Repository: nf-core/eager Branch: master Commit: 3f9d64ced5e2 Files: 73 Total size: 774.9 KB Directory structure: gitextract_rzjf5t_w/ ├── .gitattributes ├── .github/ │ ├── .dockstore.yml │ ├── CONTRIBUTING.md │ ├── ISSUE_TEMPLATE/ │ │ ├── bug_report.md │ │ ├── config.yml │ │ └── feature_request.md │ ├── PULL_REQUEST_TEMPLATE/ │ │ └── pull_request_template.md │ ├── PULL_REQUEST_TEMPLATE.md │ ├── markdownlint.yml │ ├── workflows/ │ │ ├── awsfulltest.yml │ │ ├── awstest.yml │ │ ├── branch.yml │ │ ├── ci.yml │ │ ├── linting.yml │ │ ├── linting_comment.yml │ │ ├── push_dockerhub_dev.yml │ │ └── push_dockerhub_release.yml │ └── yamllint.yml ├── .gitignore ├── .gitpod.yml ├── .nf-core-lint.yml ├── CHANGELOG.md ├── CODE_OF_CONDUCT.md ├── Dockerfile ├── LICENSE ├── README.md ├── assets/ │ ├── angsd_resources/ │ │ ├── README │ │ └── getALL.txt │ ├── email_template.html │ ├── email_template.txt │ ├── multiqc_config.yaml │ ├── nf-core_eager_dummy.txt │ ├── nf-core_eager_dummy2.txt │ ├── sendmail_template.txt │ └── where_are_my_files.txt ├── bin/ │ ├── endorS.py │ ├── extract_map_reads.py │ ├── filter_bam_fragment_length.py │ ├── kraken_parse.py │ ├── markdown_to_html.py │ ├── merge_kraken_res.py │ ├── parse_snp_cov.py │ ├── print_x_contamination.py │ └── scrape_software_versions.py ├── conf/ │ ├── base.config │ ├── benchmarking_human.config │ ├── benchmarking_vikingfish.config │ ├── igenomes.config │ ├── test.config │ ├── test_direct.config │ ├── test_full.config │ ├── test_resources.config │ ├── test_stresstest_human.config │ ├── test_tsv_bam.config │ ├── test_tsv_complex.config │ ├── test_tsv_fna.config │ ├── test_tsv_humanbam.config │ ├── test_tsv_kraken.config │ └── test_tsv_pretrim.config ├── docs/ │ ├── README.md │ ├── images/ │ │ ├── README.md │ │ └── usage/ │ │ └── nfcore-eager_tsv_template.tsv │ ├── output.md │ └── usage.md ├── environment.yml ├── lib/ │ ├── Checks.groovy │ ├── Completion.groovy │ ├── Headers.groovy │ ├── NfcoreSchema.groovy │ └── nfcore_external_java_deps.jar ├── main.nf ├── nextflow.config └── nextflow_schema.json ================================================ FILE CONTENTS ================================================ ================================================ FILE: .gitattributes ================================================ *.config linguist-language=nextflow ================================================ FILE: .github/.dockstore.yml ================================================ # Dockstore config version, not pipeline version version: 1.2 workflows: - subclass: nfl primaryDescriptorPath: /nextflow.config publish: True ================================================ FILE: .github/CONTRIBUTING.md ================================================ # nf-core/eager: Contributing Guidelines Hi there! Many thanks for taking an interest in improving nf-core/eager. We try to manage the required tasks for nf-core/eager using GitHub issues, you probably came to this page when creating one. Please use the pre-filled template to save time. However, don't be put off by this template - other more general issues and suggestions are welcome! Contributions to the code are even more welcome ;) > If you need help using or modifying nf-core/eager then the best place to ask is on the nf-core Slack [#eager](https://nfcore.slack.com/channels/eager) channel ([join our Slack here](https://nf-co.re/join/slack)). ## Contribution workflow If you'd like to write some code for nf-core/eager, the standard workflow is as follows: 1. Check that there isn't already an issue about your idea in the [nf-core/eager issues](https://github.com/nf-core/eager/issues) to avoid duplicating work * If there isn't one already, please create one so that others know you're working on this 2. [Fork](https://help.github.com/en/github/getting-started-with-github/fork-a-repo) the [nf-core/eager repository](https://github.com/nf-core/eager) to your GitHub account 3. Make the necessary changes / additions within your forked repository following [Pipeline conventions](#pipeline-contribution-conventions) 4. Use `nf-core schema build .` and add any new parameters to the pipeline JSON schema (requires [nf-core tools](https://github.com/nf-core/tools) >= 1.10). 5. Submit a Pull Request against the `dev` branch and wait for the code to be reviewed and merged If you're not used to this workflow with git, you can start with some [docs from GitHub](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests) or even their [excellent `git` resources](https://try.github.io/). ## Tests When you create a pull request with changes, [GitHub Actions](https://github.com/features/actions) will run automatic tests. Typically, pull-requests are only fully reviewed when these tests are passing, though of course we can help out before then. There are typically two types of tests that run: ### Lint tests `nf-core` has a [set of guidelines](https://nf-co.re/developers/guidelines) which all pipelines must adhere to. To enforce these and ensure that all pipelines stay in sync, we have developed a helper tool which runs checks on the pipeline code. This is in the [nf-core/tools repository](https://github.com/nf-core/tools) and once installed can be run locally with the `nf-core lint ` command. If any failures or warnings are encountered, please follow the listed URL for more documentation. ### Pipeline tests Each `nf-core` pipeline should be set up with a minimal set of test-data. `GitHub Actions` then runs the pipeline on this data to ensure that it exits successfully. If there are any failures then the automated tests fail. These tests are run both with the latest available version of `Nextflow` and also the minimum required version that is stated in the pipeline code. ## Patch :warning: Only in the unlikely and regretful event of a release happening with a bug. * On your own fork, make a new branch `patch` based on `upstream/master`. * Fix the bug, and bump version (X.Y.Z+1). * A PR should be made on `master` from patch to directly this particular bug. ## Getting help For further information/help, please consult the [nf-core/eager documentation](https://nf-co.re/eager/usage) and don't hesitate to get in touch on the nf-core Slack [#eager](https://nfcore.slack.com/channels/eager) channel ([join our Slack here](https://nf-co.re/join/slack)). ## Pipeline contribution conventions To make the nf-core/eager code and processing logic more understandable for new contributors and to ensure quality, we semi-standardise the way the code and other contributions are written. ### Adding a new step If you wish to contribute a new step, please use the following coding standards: 1. Define the corresponding input channel into your new process from the expected previous process channel 2. Write the process block (see below). 3. Define the output channel if needed (see below). 4. Add any new flags/options to `nextflow.config` with a default (see below). 5. Add any new flags/options to `nextflow_schema.json` with help text (with `nf-core schema build .`). 6. Add sanity checks for all relevant parameters. 7. Add any new software to the `scrape_software_versions.py` script in `bin/` and the version command to the `scrape_software_versions` process in `main.nf`. 8. Do local tests that the new code works properly and as expected. 9. Add a new test command in `.github/workflow/ci.yaml`. 10. If applicable add a [MultiQC](https://https://multiqc.info/) module. 11. Update MultiQC config `assets/multiqc_config.yaml` so relevant suffixes, name clean up, General Statistics Table column order, and module figures are in the right order. 12. Optional: Add any descriptions of MultiQC report sections and output files to `docs/output.md`. ### Default values Parameters should be initialised / defined with default values in `nextflow.config` under the `params` scope. Once there, use `nf-core schema build .` to add to `nextflow_schema.json`. ### Default processes resource requirements Sensible defaults for process resource requirements (CPUs / memory / time) for a process should be defined in `conf/base.config`. These should generally be specified generic with `withLabel:` selectors so they can be shared across multiple processes/steps of the pipeline. A nf-core standard set of labels that should be followed where possible can be seen in the [nf-core pipeline template](https://github.com/nf-core/tools/blob/master/nf_core/pipeline-template/conf/base.config), which has the default process as a single core-process, and then different levels of multi-core configurations for increasingly large memory requirements defined with standardised labels. :warning: Note that in nf-core/eager we currently have our own custom process labels, so please check `base.config`! The process resources can be passed on to the tool dynamically within the process with the `${task.cpu}` and `${task.memory}` variables in the `script:` block. ### Naming schemes Please use the following naming schemes, to make it easy to understand what is going where. * initial process channel: `ch_output_from_` * intermediate and terminal channels: `ch__for_` * skipped process output: `ch__for_`(this goes out of the bypass statement described above) ### Nextflow version bumping If you are using a new feature from core Nextflow, you may bump the minimum required version of nextflow in the pipeline with: `nf-core bump-version --nextflow . [min-nf-version]` ### Software version reporting If you add a new tool to the pipeline, please ensure you add the information of the tool to the `get_software_version` process. Add to the script block of the process, something like the following: ```bash --version &> v_.txt 2>&1 || true ``` or ```bash --help | head -n 1 &> v_.txt 2>&1 || true ``` You then need to edit the script `bin/scrape_software_versions.py` to: 1. Add a Python regex for your tool's `--version` output (as in stored in the `v_.txt` file), to ensure the version is reported as a `v` and the version number e.g. `v2.1.1` 2. Add a HTML entry to the `OrderedDict` for formatting in MultiQC. ### Images and figures For overview images and other documents we follow the nf-core [style guidelines and examples](https://nf-co.re/developers/design_guidelines). For all internal nf-core/eager documentation images we are using the 'Kalam' font by the Indian Type Foundry and licensed under the Open Font License. It can be found for download here [here](https://fonts.google.com/specimen/Kalam). ## Process Concept We are providing a highly configurable pipeline, with many options to turn on and off different processes in different combinations. This can make a very complex graph structure that can cause a large amount of duplicated channels coming out of every process to account for each possible combination. The EAGER pipeline can currently be broken down into the following 'stages', where a stage is a collection of non-terminal mutually exclusive processes, which is the output of which is used for another file reporting module (but not reporting!) . * Input * Convert BAM * PolyG Clipping * AdapterRemoval * Mapping (either `bwa`, `bwamem`, or `circularmapper`) * BAM Filtering * Deduplication (either `dedup` or `markduplicates`) * BAM Trimming * PMDtools * Genotyping Every step can potentially be skipped, therefore the output of a previous stage must be able to be passed to the next stage, if the given stage is not run. To somewhat simplify this logic, we have implemented the following structure. The concept is as follows: * Every 'stage' of the pipeline (i.e. collection of mutually exclusive processes) must always have a if else statement following it. * This if else 'bypass' statement collects and standardises all possible input files into single channel(s) for the next stage. * Importantly - within the bypass statement, a channel from the previous stage's bypass mixes into these output channels. This additional channel is named `ch_previousstage_for_skipcurrentstage`. This contains the output from the previous stage, i.e. not the modified version from the current stage. * The bypass statement works as follows: * If the current stage is turned on: will mix the previous stage and current stage output and filter for file suffixes unique to the current stage output * If the current stage is turned off or skipped: will mix the previous stage and current stage output. However as there there is no files in the output channel from the current stage, no filtering is required and the files in the 'ch_XXX_for_skipXXX' stage will be used. This ensures the same channel inputs to the next stage is 'homogeneous' - i.e. all comes from the same source (the bypass statement) An example schematic can be given as follows ```nextflow // PREVIOUS STAGE OUTPUT if (params.run_bam_filtering) { ch_input_for_skipconvertbam.mix(ch_output_ch_convertbam) .filter{ it =~/.*converted.fq/} .into { ch_convertbam_for_fastp; ch_convertbam_for_skipfastp } } else { ch_input_for_skipconvertbam .into { ch_convertbam_for_fastp; ch_convertbam_for_skipfastp } } // SKIPPABLE CURRENT STAGE PROCESS process fastp { publishDir "${params.outdir}/fastp", mode: 'copy' when: params.run_fastp input: file fq from ch_convertbam_for_fastp output: file "*pG.fq" into ch_output_from_fastp script: """ echo "I have been fastp'd" > ${fq} mv ${fq} ${fq}.pG.fq """ } // NEXT STAGE INPUT PREPARATION if (params.run_fastp) { ch_convertbam_for_skipfastp.mix(ch_output_from_fastp) .filter { it =~/.*pG.fq/ } .into { ch_fastp_for_adapterremoval; ch_fastp_for_skipadapterremoval } } else { ch_convertbam_for_skipfastp .into { ch_fastp_for_adapterremoval; ch_fastp_for_skipadapterremoval } } ``` ================================================ FILE: .github/ISSUE_TEMPLATE/bug_report.md ================================================ --- name: Bug report about: Report something that is broken or incorrect labels: bug --- ## Check Documentation I have checked the following places for your error: - [ ] [nf-core website: troubleshooting](https://nf-co.re/usage/troubleshooting) - [ ] [nf-core/eager pipeline documentation](https://nf-co.re/nf-core/eager/usage) - nf-core/eager FAQ/troubleshooting can be found [here](https://nf-co.re/eager/usage#troubleshooting-and-faqs) ## Description of the bug ## Steps to reproduce Steps to reproduce the behaviour: 1. Command line: `nextflow run ...` 2. See error: _Please provide your error message_ ## Expected behaviour ## Log files Have you provided the following extra information/files: - [ ] The command used to run the pipeline - [ ] The `.nextflow.log` file - [ ] The exact error: ## System - Hardware: - Executor: - OS: - Version ## Nextflow Installation - Version: ## Container engine - Engine: - version: - Image tag: ## Additional context ================================================ FILE: .github/ISSUE_TEMPLATE/config.yml ================================================ blank_issues_enabled: false contact_links: - name: Join nf-core url: https://nf-co.re/join about: Please join the nf-core community here - name: "Slack #eager channel" url: https://nfcore.slack.com/channels/eager about: Discussion about the nf-core/eager pipeline ================================================ FILE: .github/ISSUE_TEMPLATE/feature_request.md ================================================ --- name: Feature request about: Suggest an idea for the nf-core/eager pipeline labels: enhancement --- ## Is your feature request related to a problem? Please describe ## Describe the solution you'd like ## Describe alternatives you've considered ## Additional context ================================================ FILE: .github/PULL_REQUEST_TEMPLATE/pull_request_template.md ================================================ Many thanks to contributing to nf-core/eager! Please fill in the appropriate checklist below (delete whatever is not relevant). These are the most common things requested on pull requests (PRs). ## PR checklist - [ ] This comment contains a description of changes (with reason). - [ ] If you've fixed a bug or added code that should be tested, add tests! - [ ] If you've added a new tool - add to the software_versions process and a regex to `scrape_software_versions.py` - [ ] If necessary, also make a PR on the [nf-core/eager branch on the nf-core/test-datasets repo]( https://github.com/nf-core/test-datasets/pull/new/nf-core/eager). - [ ] Make sure your code lints (`nf-core lint .`). - [ ] Ensure the test suite passes (`nextflow run . -profile test,docker`). - [ ] Usage Documentation in `docs/usage.md` is updated. - [ ] Output Documentation in `docs/output.md` is updated. - [ ] `CHANGELOG.md` is updated. - [ ] `README.md` is updated (including new tool citations and authors/contributors). **Learn more about contributing:** https://github.com/nf-core/eager/tree/master/.github/CONTRIBUTING.md ================================================ FILE: .github/PULL_REQUEST_TEMPLATE.md ================================================ ## PR checklist - [ ] This comment contains a description of changes (with reason). - [ ] If you've fixed a bug or added code that should be tested, add tests! - [ ] If you've added a new tool - add to the software_versions process and a regex to `scrape_software_versions.py` - [ ] If you've added a new tool - have you followed the pipeline conventions in the [contribution docs](nf-core/eager/tree/master/.github/CONTRIBUTING.md) - [ ] If necessary, also make a PR on the nf-core/eager _branch_ on the [nf-core/test-datasets](https://github.com/nf-core/test-datasets) repository. - [ ] Make sure your code lints (`nf-core lint .`). - [ ] Ensure the test suite passes (`nextflow run . -profile test,docker`). - [ ] Usage Documentation in `docs/usage.md` is updated. - [ ] Output Documentation in `docs/output.md` is updated. - [ ] `CHANGELOG.md` is updated. - [ ] `README.md` is updated (including new tool citations and authors/contributors). ================================================ FILE: .github/markdownlint.yml ================================================ # Markdownlint configuration file default: true line-length: false no-duplicate-header: siblings_only: true no-inline-html: allowed_elements: - img - p - kbd - details - summary ================================================ FILE: .github/workflows/awsfulltest.yml ================================================ name: nf-core AWS full size tests # This workflow is triggered on published releases. # It can be additionally triggered manually with GitHub actions workflow dispatch. # It runs the -profile 'test_full' on AWS batch on: workflow_run: workflows: ["nf-core Docker push (release)"] types: [completed] workflow_dispatch: env: AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }} AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }} TOWER_ACCESS_TOKEN: ${{ secrets.AWS_TOWER_TOKEN }} AWS_JOB_DEFINITION: ${{ secrets.AWS_JOB_DEFINITION }} AWS_JOB_QUEUE: ${{ secrets.AWS_JOB_QUEUE }} AWS_S3_BUCKET: ${{ secrets.AWS_S3_BUCKET }} jobs: run-awstest: name: Run AWS full tests if: github.repository == 'nf-core/eager' runs-on: ubuntu-latest steps: - name: Setup Miniconda uses: conda-incubator/setup-miniconda@v2 with: auto-update-conda: true python-version: 3.7 - name: Install awscli run: conda install -c conda-forge awscli - name: Start AWS batch job # Add full size test data (but still relatively small datasets for few samples) # on the `test_full.config` test runs with only one set of parameters # Then specify `-profile test_full` instead of `-profile test` on the AWS batch command run: | aws batch submit-job \ --region eu-west-1 \ --job-name nf-core-eager \ --job-queue $AWS_JOB_QUEUE \ --job-definition $AWS_JOB_DEFINITION \ --container-overrides '{"command": ["nf-core/eager", "-r '"${GITHUB_SHA}"' -profile test_full --outdir s3://'"${AWS_S3_BUCKET}"'/eager/results-'"${GITHUB_SHA}"' -w s3://'"${AWS_S3_BUCKET}"'/eager/work-'"${GITHUB_SHA}"' -with-tower"], "environment": [{"name": "TOWER_ACCESS_TOKEN", "value": "'"$TOWER_ACCESS_TOKEN"'"}]}' ================================================ FILE: .github/workflows/awstest.yml ================================================ name: nf-core AWS test # This workflow is triggered on push to the master branch. # It can be additionally triggered manually with GitHub actions workflow dispatch. # It runs the -profile 'test' on AWS batch. on: workflow_dispatch: env: AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }} AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }} TOWER_ACCESS_TOKEN: ${{ secrets.AWS_TOWER_TOKEN }} AWS_JOB_DEFINITION: ${{ secrets.AWS_JOB_DEFINITION }} AWS_JOB_QUEUE: ${{ secrets.AWS_JOB_QUEUE }} AWS_S3_BUCKET: ${{ secrets.AWS_S3_BUCKET }} jobs: run-awstest: name: Run AWS tests if: github.repository == 'nf-core/eager' runs-on: ubuntu-latest steps: - name: Setup Miniconda uses: conda-incubator/setup-miniconda@v2 with: auto-update-conda: true python-version: 3.7 - name: Install awscli run: conda install -c conda-forge awscli - name: Start AWS batch job # For example: adding multiple test runs with different parameters # Remember that you can parallelise this by using strategy.matrix run: | aws batch submit-job \ --region eu-west-1 \ --job-name nf-core-eager \ --job-queue $AWS_JOB_QUEUE \ --job-definition $AWS_JOB_DEFINITION \ --container-overrides '{"command": ["nf-core/eager", "-r '"${GITHUB_SHA}"' -profile test_tsv_complex --outdir s3://'"${AWS_S3_BUCKET}"'/eager/results-'"${GITHUB_SHA}"' -w s3://'"${AWS_S3_BUCKET}"'/eager/work-'"${GITHUB_SHA}"' -with-tower"], "environment": [{"name": "TOWER_ACCESS_TOKEN", "value": "'"$TOWER_ACCESS_TOKEN"'"}]}' ================================================ FILE: .github/workflows/branch.yml ================================================ name: nf-core branch protection # This workflow is triggered on PRs to master branch on the repository # It fails when someone tries to make a PR against the nf-core `master` branch instead of `dev` on: pull_request_target: branches: [master] jobs: test: runs-on: ubuntu-latest steps: # PRs to the nf-core repo master branch are only ok if coming from the nf-core repo `dev` or any `patch` branches - name: Check PRs if: github.repository == 'nf-core/eager' run: | { [[ ${{github.event.pull_request.head.repo.full_name }} == nf-core/eager ]] && [[ $GITHUB_HEAD_REF = "dev" ]]; } || [[ $GITHUB_HEAD_REF == "patch" ]] # If the above check failed, post a comment on the PR explaining the failure # NOTE - this doesn't currently work if the PR is coming from a fork, due to limitations in GitHub actions secrets - name: Post PR comment if: failure() uses: mshick/add-pr-comment@v1 with: message: | ## This PR is against the `master` branch :x: * Do not close this PR * Click _Edit_ and change the `base` to `dev` * This CI test will remain failed until you push a new commit --- Hi @${{ github.event.pull_request.user.login }}, It looks like this pull-request is has been made against the [${{github.event.pull_request.head.repo.full_name }}](https://github.com/${{github.event.pull_request.head.repo.full_name }}) `master` branch. The `master` branch on nf-core repositories should always contain code from the latest release. Because of this, PRs to `master` are only allowed if they come from the [${{github.event.pull_request.head.repo.full_name }}](https://github.com/${{github.event.pull_request.head.repo.full_name }}) `dev` branch. You do not need to close this PR, you can change the target branch to `dev` by clicking the _"Edit"_ button at the top of this page. Note that even after this, the test will continue to show as failing until you push a new commit. Thanks again for your contribution! repo-token: ${{ secrets.GITHUB_TOKEN }} allow-repeats: false ================================================ FILE: .github/workflows/ci.yml ================================================ name: nf-core CI # This workflow runs the pipeline with the minimal test dataset to check that it completes without any syntax errors on: push: branches: - dev pull_request: release: types: [published] # Uncomment if we need an edge release of Nextflow again # env: NXF_EDGE: 1 jobs: test: name: Run workflow tests # Only run on push if this is the nf-core dev branch (merged PRs) if: ${{ github.event_name != 'push' || (github.event_name == 'push' && github.repository == 'nf-core/eager') }} runs-on: ubuntu-latest env: NXF_VER: ${{ matrix.nxf_ver }} NXF_ANSI_LOG: false strategy: matrix: # Nextflow versions: check pipeline minimum and current latest nxf_ver: ["20.07.1", "22.10.6"] steps: - name: Check out pipeline code uses: actions/checkout@v2 - name: Install older Java uses: actions/setup-java@v4 with: distribution: "temurin" # See 'Supported distributions' for available options java-version: "11" - name: Check if Dockerfile or Conda environment changed uses: technote-space/get-diff-action@v4 with: FILES: | Dockerfile environment.yml - name: Build new docker image if: env.MATCHED_FILES run: docker build --no-cache . -t nfcore/eager:2.5.3 - name: Pull docker image if: ${{ !env.MATCHED_FILES }} run: | docker pull nfcore/eager:dev docker tag nfcore/eager:dev nfcore/eager:2.5.3 - name: Install Nextflow env: CAPSULE_LOG: none run: | wget -qO- https://github.com/nextflow-io/nextflow/releases/download/v22.10.6/nextflow | bash sudo mv nextflow /usr/local/bin/ - name: HELPTEXT Run with the help flag run: | nextflow run ${GITHUB_WORKSPACE} --help - name: Get test data for cases where we don't use TSV input run: | git clone --single-branch --branch eager https://github.com/nf-core/test-datasets.git data - name: DELAY to try address some odd behaviour with what appears to be a conflict between parallel htslib jobs leading to CI hangs run: | if [[ $NXF_VER = '' ]]; then sleep 1200; fi - name: BASIC Run the basic pipeline with directly supplied single-end FASTQ run: | nextflow run ${GITHUB_WORKSPACE} -profile test_direct,docker --input 'data/testdata/Mammoth/fastq/*_R1_*.fq.gz' --single_end - name: BASIC Run the basic pipeline with directly supplied paired-end FASTQ run: | nextflow run ${GITHUB_WORKSPACE} -profile test_direct,docker --input 'data/testdata/Mammoth/fastq/*_{R1,R2}_*tengrand.fq.gz' - name: BASIC Run the basic pipeline with supplied --input BAM run: | nextflow run ${GITHUB_WORKSPACE} -profile test_direct,docker --input 'data/testdata/Mammoth/bam/*_R1_*.bam' --bam --single_end - name: BASIC Run the basic pipeline with the test profile with, PE/SE, bwa aln run: | nextflow run ${GITHUB_WORKSPACE} -profile test,docker --save_reference - name: REFERENCE Basic workflow, with supplied indices run: | nextflow run ${GITHUB_WORKSPACE} -profile test,docker --bwa_index 'results/reference_genome/bwa_index/BWAIndex/' --fasta_index 'https://github.com/nf-core/test-datasets/blob/eager/reference/Mammoth/Mammoth_MT_Krause.fasta.fai' - name: REFERENCE Run the basic pipeline with FastA reference with `fna` extension run: | nextflow run ${GITHUB_WORKSPACE} -profile test_tsv_fna,docker - name: REFERENCE Test with zipped reference input run: | nextflow run ${GITHUB_WORKSPACE} -profile test,docker --fasta 'https://github.com/nf-core/test-datasets/raw/eager/reference/Mammoth/Mammoth_MT_Krause.fasta.gz' - name: FASTP Test fastp complexity filtering run: | nextflow run ${GITHUB_WORKSPACE} -profile test,docker --complexity_filter_poly_g - name: ADAPTERREMOVAL Test skip paired end collapsing run: | nextflow run ${GITHUB_WORKSPACE} -profile test,docker --skip_collapse - name: ADAPTERREMOVAL Test paired end collapsing but no trimming run: | nextflow run ${GITHUB_WORKSPACE} -profile test_tsv_pretrim,docker --skip_trim - name: ADAPTERREMOVAL Run the basic pipeline with paired end data without adapterRemoval run: | nextflow run ${GITHUB_WORKSPACE} -profile test,docker --skip_adapterremoval - name: ADAPTERREMOVAL Run the basic pipeline with preserve5p end option run: | nextflow run ${GITHUB_WORKSPACE} -profile test,docker --preserve5p - name: ADAPTERREMOVAL Run the basic pipeline with merged only option run: | nextflow run ${GITHUB_WORKSPACE} -profile test,docker --mergedonly - name: ADAPTERREMOVAL Run the basic pipeline with preserve5p end and merged reads only options run: | nextflow run ${GITHUB_WORKSPACE} -profile test,docker --preserve5p --mergedonly - name: ADAPTER LIST Run the basic pipeline using an adapter list run: | nextflow run ${GITHUB_WORKSPACE} -profile test,docker --clip_adapters_list 'https://github.com/nf-core/test-datasets/raw/eager/databases/adapters/adapter-list.txt' - name: ADAPTER LIST Run the basic pipeline using an adapter list, skipping adapter removal run: | nextflow run ${GITHUB_WORKSPACE} -profile test,docker --clip_adapters_list 'https://github.com/nf-core/test-datasets/raw/eager/databases/adapters/adapter-list.txt' --skip_adapterremoval - name: POST_AR_FASTQ_TRIMMING Run the basic pipeline post-adapterremoval FASTQ trimming run: | nextflow run ${GITHUB_WORKSPACE} -profile test,docker --run_post_ar_trimming - name: POST_AR_FASTQ_TRIMMING Run the basic pipeline post-adapterremoval FASTQ trimming, but skip adapterremoval run: | nextflow run ${GITHUB_WORKSPACE} -profile test,docker --run_post_ar_trimming --skip_adapterremoval - name: MAPPER_CIRCULARMAPPER Test running with CircularMapper run: | nextflow run ${GITHUB_WORKSPACE} -profile test,docker --mapper 'circularmapper' --circulartarget 'NC_007596.2' - name: MAPPER_BWAMEM Test running with BWA Mem run: | nextflow run ${GITHUB_WORKSPACE} -profile test,docker --mapper 'bwamem' --skip_collapse - name: MAPPER_BT2 Test running with BowTie2 run: | nextflow run ${GITHUB_WORKSPACE} -profile test,docker --mapper 'bowtie2' --bt2_alignmode 'local' --bt2_sensitivity 'sensitive' --bt2n 1 --bt2l 16 --bt2_trim5 1 --bt2_trim3 1 - name: HOST_REMOVAL_FASTQ Run the basic pipeline with output unmapped reads as fastq run: | nextflow run ${GITHUB_WORKSPACE} -profile test_tsv_complex,docker --hostremoval_input_fastq - name: BAM_FILTERING Run basic mapping pipeline with mapping quality filtering, and unmapped export run: | nextflow run ${GITHUB_WORKSPACE} -profile test,docker --run_bam_filtering --bam_mapping_quality_threshold 37 --bam_unmapped_type 'fastq' - name: BAM_FILTERING Run basic mapping pipeline with post-mapping length filtering run: | nextflow run ${GITHUB_WORKSPACE} -profile test,docker --clip_readlength 0 --run_bam_filtering --bam_filter_minreadlength 50 - name: PRESEQ Run basic mapping pipeline with different preseq mode run: | nextflow run ${GITHUB_WORKSPACE} -profile test,docker --preseq_mode 'lc_extrap' --preseq_maxextrap 10000 --preseq_bootstrap 10 - name: DEDUPLICATION Test with dedup run: | nextflow run ${GITHUB_WORKSPACE} -profile test,docker --dedupper 'dedup' --dedup_all_merged - name: BEDTOOLS Test bedtools feature annotation run: | nextflow run ${GITHUB_WORKSPACE} -profile test,docker --run_bedtools_coverage --anno_file 'https://github.com/nf-core/test-datasets/raw/eager/reference/Mammoth/Mammoth_MT_Krause.gff3' - name: MAPDAMAGE2 damage calculation run: | nextflow run ${GITHUB_WORKSPACE} -profile test,docker --damage_calculation_tool 'mapdamage' - name: GENOTYPING_HC Test running GATK HaplotypeCaller run: | nextflow run ${GITHUB_WORKSPACE} -profile test_tsv_fna,docker --run_genotyping --genotyping_tool 'hc' --gatk_hc_out_mode 'EMIT_ALL_ACTIVE_SITES' --gatk_hc_emitrefconf 'BP_RESOLUTION' - name: GENOTYPING_FB Test running FreeBayes run: | nextflow run ${GITHUB_WORKSPACE} -profile test,docker --run_genotyping --genotyping_tool 'freebayes' - name: GENOTYPING_PC Test running pileupCaller run: | nextflow run ${GITHUB_WORKSPACE} -profile test_tsv_humanbam,docker --run_genotyping --genotyping_tool 'pileupcaller' - name: GENOTYPING_ANGSD Test running ANGSD genotype likelihood calculation run: | nextflow run ${GITHUB_WORKSPACE} -profile test_tsv_humanbam,docker --run_genotyping --genotyping_tool 'angsd' - name: GENOTYPING_BCFTOOLS Test running FreeBayes with bcftools stats turned on run: | nextflow run ${GITHUB_WORKSPACE} -profile test,docker --run_genotyping --genotyping_tool 'freebayes' --run_bcftools_stats - name: SKIPPING Test checking all skip steps work i.e. input bam, skipping straight to genotyping run: | nextflow run ${GITHUB_WORKSPACE} -profile test_tsv_bam,docker --skip_fastqc --skip_adapterremoval --skip_deduplication --skip_qualimap --skip_preseq --skip_damage_calculation --run_genotyping --genotyping_tool 'freebayes' - name: TRIMBAM Test bamutils works alone run: | nextflow run ${GITHUB_WORKSPACE} -profile test,docker --run_trim_bam - name: PMDTOOLS Test PMDtools works alone run: | nextflow run ${GITHUB_WORKSPACE} -profile test,docker --run_pmdtools - name: GENOTYPING_UG AND MULTIVCFANALYZER Test running GATK UnifiedGenotyper and MultiVCFAnalyzer, additional VCFS run: | nextflow run ${GITHUB_WORKSPACE} -profile test,docker --run_genotyping --genotyping_tool 'ug' --gatk_ug_out_mode 'EMIT_ALL_SITES' --gatk_ug_genotype_model 'SNP' --run_multivcfanalyzer --additional_vcf_files 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/testdata/Mammoth/vcf/JK2772_CATCAGTGAGTAGA_L008_R1_001.fastq.gz.tengrand.fq.combined.fq.mapped_rmdup.bam.unifiedgenotyper.vcf.gz' --write_allele_frequencies - name: COMPLEX LANE/LIBRARY MERGING Test running lane and library merging prior to GATK UnifiedGenotyper and running MultiVCFAnalyzer run: | nextflow run ${GITHUB_WORKSPACE} -profile test_tsv_complex,docker --run_genotyping --genotyping_tool 'ug' --gatk_ug_out_mode 'EMIT_ALL_SITES' --gatk_ug_genotype_model 'SNP' --run_multivcfanalyzer - name: GENOTYPING_UG ON TRIMMED BAM Test run: | nextflow run ${GITHUB_WORKSPACE} -profile test,docker --run_genotyping --run_trim_bam --genotyping_source 'trimmed' --genotyping_tool 'ug' --gatk_ug_out_mode 'EMIT_ALL_SITES' --gatk_ug_genotype_model 'SNP' - name: BAM_INPUT Run the basic pipeline with the bam input profile, skip AdapterRemoval as no convertBam run: | nextflow run ${GITHUB_WORKSPACE} -profile test_tsv_bam,docker --skip_adapterremoval - name: BAM_INPUT Run the basic pipeline with the bam input profile, convert to FASTQ for adapterremoval test and downstream run: | nextflow run ${GITHUB_WORKSPACE} -profile test_tsv_bam,docker --run_convertinputbam - name: METAGENOMIC Download MALT database run: | mkdir -p databases/malt readlink -f databases/malt/ for i in index0.idx ref.db ref.idx ref.inf table0.db table0.idx taxonomy.idx taxonomy.map taxonomy.tre; do wget https://github.com/nf-core/test-datasets/raw/eager/databases/malt/"$i" -P databases/malt/; done - name: METAGENOMIC Run the basic pipeline but with unmapped reads going into MALT run: | nextflow run ${GITHUB_WORKSPACE} -profile test,docker --run_bam_filtering --bam_unmapped_type 'fastq' --run_metagenomic_screening --metagenomic_tool 'malt' --database "/home/runner/work/eager/eager/databases/malt/" --malt_sam_output - name: METAGENOMIC Run the basic pipeline but low-complexity filtered reads going into MALT run: | nextflow run ${GITHUB_WORKSPACE} -profile test,docker --run_bam_filtering --bam_unmapped_type 'fastq' --run_metagenomic_screening --metagenomic_tool 'malt' --database "/home/runner/work/eager/eager/databases/malt/" --metagenomic_complexity_filter - name: MALTEXTRACT Download resource files run: | mkdir -p databases/maltextract for i in ncbi.tre ncbi.map; do wget https://github.com/rhuebler/HOPS/raw/0.33/Resources/"$i" -P databases/maltextract/; done - name: MALTEXTRACT Basic with MALT plus MaltExtract run: | nextflow run ${GITHUB_WORKSPACE} -profile test,docker --run_bam_filtering --bam_unmapped_type 'fastq' --run_metagenomic_screening --metagenomic_tool 'malt' --database "/home/runner/work/eager/eager/databases/malt" --run_maltextract --maltextract_ncbifiles "/home/runner/work/eager/eager/databases/maltextract/" --maltextract_taxon_list 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/testdata/Mammoth/maltextract/MaltExtract_list.txt' - name: METAGENOMIC Run the basic pipeline but with unmapped reads going into Kraken run: | nextflow run ${GITHUB_WORKSPACE} -profile test_tsv_kraken,docker --run_bam_filtering --bam_unmapped_type 'fastq' - name: SNPCAPTURE Run the basic pipeline with the bam input profile, generating statistics with a SNP capture bed run: | wget https://github.com/nf-core/test-datasets/raw/eager/reference/Human/1240K.pos.list_hs37d5.0based.bed.gz && gunzip 1240K.pos.list_hs37d5.0based.bed.gz nextflow run ${GITHUB_WORKSPACE} -profile test_tsv_humanbam,docker --skip_fastqc --skip_adapterremoval --skip_deduplication --snpcapture_bed 1240K.pos.list_hs37d5.0based.bed - name: SEXDETERMINATION Run the basic pipeline with the bam input profile, but don't convert BAM, skip everything but sex determination run: | nextflow run ${GITHUB_WORKSPACE} -profile test_tsv_humanbam,docker --skip_fastqc --skip_adapterremoval --skip_deduplication --skip_qualimap --run_sexdeterrmine - name: NUCLEAR CONTAMINATION Run basic pipeline with bam input profile, but don't convert BAM, skip everything but nuclear contamination estimation run: | nextflow run ${GITHUB_WORKSPACE} -profile test_tsv_humanbam,docker --skip_fastqc --skip_adapterremoval --skip_deduplication --skip_qualimap --run_nuclear_contamination - name: MTNUCRATIO Run basic pipeline with bam input profile, but don't convert BAM, skip everything but nmtnucratio run: | nextflow run ${GITHUB_WORKSPACE} -profile test_tsv_humanbam,docker --skip_fastqc --skip_adapterremoval --skip_deduplication --skip_qualimap --skip_preseq --skip_damage_calculation --run_mtnucratio - name: RESCALING Run basic pipeline with basic pipeline but with mapDamage rescaling of BAM files. Note this will be slow run: | nextflow run ${GITHUB_WORKSPACE} -profile test,docker --run_mapdamage_rescaling --run_genotyping --genotyping_tool hc --genotyping_source 'rescaled' ================================================ FILE: .github/workflows/linting.yml ================================================ name: nf-core linting # This workflow is triggered on pushes and PRs to the repository. # It runs the `nf-core lint` and markdown lint tests to ensure that the code meets the nf-core guidelines on: push: pull_request: release: types: [published] jobs: Markdown: runs-on: ubuntu-latest steps: - uses: actions/checkout@v2 - uses: actions/setup-node@v2 - name: Install markdownlint run: npm install -g markdownlint-cli - name: Run Markdownlint run: markdownlint ${GITHUB_WORKSPACE} -c ${GITHUB_WORKSPACE}/.github/markdownlint.yml # If the above check failed, post a comment on the PR explaining the failure - name: Post PR comment if: failure() uses: mshick/add-pr-comment@v1 with: message: | ## Markdown linting is failing To keep the code consistent with lots of contributors, we run automated code consistency checks. To fix this CI test, please run: * Install `markdownlint-cli` * On Mac: `brew install markdownlint-cli` * Everything else: [Install `npm`](https://www.npmjs.com/get-npm) then [install `markdownlint-cli`](https://www.npmjs.com/package/markdownlint-cli) (`npm install -g markdownlint-cli`) * Fix the markdown errors * Automatically: `markdownlint . --config .github/markdownlint.yml --fix` * Manually resolve anything left from `markdownlint . --config .github/markdownlint.yml` Once you push these changes the test should pass, and you can hide this comment :+1: We highly recommend setting up markdownlint in your code editor so that this formatting is done automatically on save. Ask about it on Slack for help! Thanks again for your contribution! repo-token: ${{ secrets.GITHUB_TOKEN }} allow-repeats: false YAML: runs-on: ubuntu-latest steps: - uses: actions/checkout@v1 - uses: actions/setup-node@v2 - name: Install yaml-lint run: npm install -g yaml-lint - name: Run yaml-lint run: yamllint $(find ${GITHUB_WORKSPACE} -type f -name "*.yml" -o -name "*.yaml") -c .github/yamllint.yml # If the above check failed, post a comment on the PR explaining the failure - name: Post PR comment if: failure() uses: mshick/add-pr-comment@v1 with: message: | ## YAML linting is failing To keep the code consistent with lots of contributors, we run automated code consistency checks. To fix this CI test, please run: * Install `yaml-lint` * [Install `npm`](https://www.npmjs.com/get-npm) then [install `yaml-lint`](https://www.npmjs.com/package/yaml-lint) (`npm install -g yaml-lint`) * Fix the markdown errors * Run the test locally: `yamllint $(find . -type f -name "*.yml" -o -name "*.yaml")` * Fix any reported errors in your YAML files Once you push these changes the test should pass, and you can hide this comment :+1: We highly recommend setting up yaml-lint in your code editor so that this formatting is done automatically on save. Ask about it on Slack for help! Thanks again for your contribution! repo-token: ${{ secrets.GITHUB_TOKEN }} allow-repeats: false nf-core: runs-on: ubuntu-latest steps: - name: Check out pipeline code uses: actions/checkout@v2 - name: Install Nextflow env: CAPSULE_LOG: none run: | wget -qO- get.nextflow.io | bash sudo mv nextflow /usr/local/bin/ - uses: actions/setup-python@v1 with: python-version: "3.6" architecture: "x64" - name: Install dependencies run: | python -m pip install --upgrade pip pip install nf-core==1.14 - name: Run nf-core lint env: GITHUB_COMMENTS_URL: ${{ github.event.pull_request.comments_url }} GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} GITHUB_PR_COMMIT: ${{ github.event.pull_request.head.sha }} run: nf-core -l lint_log.txt lint ${GITHUB_WORKSPACE} --markdown lint_results.md - name: Save PR number if: ${{ always() }} run: echo ${{ github.event.pull_request.number }} > PR_number.txt - name: Upload linting log file artifact if: ${{ always() }} uses: actions/upload-artifact@v2 with: name: linting-logs path: | lint_log.txt lint_results.md PR_number.txt ================================================ FILE: .github/workflows/linting_comment.yml ================================================ name: nf-core linting comment # This workflow is triggered after the linting action is complete # It posts an automated comment to the PR, even if the PR is coming from a fork on: workflow_run: workflows: ["nf-core linting"] jobs: test: runs-on: ubuntu-latest steps: - name: Download lint results uses: dawidd6/action-download-artifact@v2 with: workflow: linting.yml - name: Get PR number id: pr_number run: echo "::set-output name=pr_number::$(cat linting-logs/PR_number.txt)" - name: Post PR comment uses: marocchino/sticky-pull-request-comment@v2 with: GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} number: ${{ steps.pr_number.outputs.pr_number }} path: linting-logs/lint_results.md ================================================ FILE: .github/workflows/push_dockerhub_dev.yml ================================================ name: nf-core Docker push (dev) # This builds the docker image and pushes it to DockerHub # Runs on nf-core repo releases and push event to 'dev' branch (PR merges) on: push: branches: - dev jobs: push_dockerhub: name: Push new Docker image to Docker Hub (dev) runs-on: ubuntu-latest # Only run for the nf-core repo, for releases and merged PRs if: ${{ github.repository == 'nf-core/eager' }} env: DOCKERHUB_USERNAME: ${{ secrets.DOCKERHUB_USERNAME }} DOCKERHUB_PASS: ${{ secrets.DOCKERHUB_PASS }} steps: - name: Check out pipeline code uses: actions/checkout@v2 - name: Build new docker image run: docker build --no-cache . -t nfcore/eager:dev - name: Push Docker image to DockerHub (dev) run: | echo "$DOCKERHUB_PASS" | docker login -u "$DOCKERHUB_USERNAME" --password-stdin docker push nfcore/eager:dev ================================================ FILE: .github/workflows/push_dockerhub_release.yml ================================================ name: nf-core Docker push (release) # This builds the docker image and pushes it to DockerHub # Runs on nf-core repo releases and push event to 'dev' branch (PR merges) on: release: types: [published] jobs: push_dockerhub: name: Push new Docker image to Docker Hub (release) runs-on: ubuntu-latest # Only run for the nf-core repo, for releases and merged PRs if: ${{ github.repository == 'nf-core/eager' }} env: DOCKERHUB_USERNAME: ${{ secrets.DOCKERHUB_USERNAME }} DOCKERHUB_PASS: ${{ secrets.DOCKERHUB_PASS }} steps: - name: Check out pipeline code uses: actions/checkout@v2 - name: Build new docker image run: docker build --no-cache . -t nfcore/eager:latest - name: Push Docker image to DockerHub (release) run: | echo "$DOCKERHUB_PASS" | docker login -u "$DOCKERHUB_USERNAME" --password-stdin docker push nfcore/eager:latest docker tag nfcore/eager:latest nfcore/eager:${{ github.event.release.tag_name }} docker push nfcore/eager:${{ github.event.release.tag_name }} ================================================ FILE: .github/yamllint.yml ================================================ rules: document-start: disable comments: disable truthy: disable line-length: disable empty-lines: disable ================================================ FILE: .gitignore ================================================ .nextflow* work/ data/ results/ .DS_Store tests/ testing/ testing* *.pyc main_playground.nf .vscode *.code-workspace nf-params.json ================================================ FILE: .gitpod.yml ================================================ image: nfcore/gitpod:latest vscode: extensions: # based on nf-core.nf-core-extensionpack - codezombiech.gitignore # Language support for .gitignore files # - cssho.vscode-svgviewer # SVG viewer - esbenp.prettier-vscode # Markdown/CommonMark linting and style checking for Visual Studio Code - eamodio.gitlens # Quickly glimpse into whom, why, and when a line or code block was changed - EditorConfig.EditorConfig # override user/workspace settings with settings found in .editorconfig files - Gruntfuggly.todo-tree # Display TODO and FIXME in a tree view in the activity bar - mechatroner.rainbow-csv # Highlight columns in csv files in different colors # - nextflow.nextflow # Nextflow syntax highlighting - oderwat.indent-rainbow # Highlight indentation level - streetsidesoftware.code-spell-checker # Spelling checker for source code ================================================ FILE: .nf-core-lint.yml ================================================ files_unchanged: - assets/multiqc_config.yaml - .github/CONTRIBUTING.md - .github/ISSUE_TEMPLATE/bug_report.md - docs/README.md - .github/workflows/linting.yml ================================================ FILE: CHANGELOG.md ================================================ # nf-core/eager: Changelog The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/) and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html). ## [2.5.3] - 2025-03-18 ### `Added` ### `Fixed` - [#1119](https://github.com/nf-core/eager/issues/1119) - Fix typo in variable of IndelRealigner step of UnifiedGenotyper when generating a targetIntervals file (♥ to @Dog13Golf for reporting, fix by @jfy133). ### `Dependencies` ### `Deprecated` ## [2.5.2] - 2024-06-28 ### `Added` - [#1079](https://github.com/nf-core/eager/issues/1079) - Added the `lanemerging` output directory in the output documentation (♥ to @TessaZei for reporting, fix by @TCLamnidis). ### `Fixed` - [#1037](https://github.com/nf-core/eager/issues/1073) - Fixed post-adapterremoval trimmed FastQC results not being displayed in MultiQC (♥ to @kieren-j-mitchell for reporting, fix by @jfy133 and @TCLamnidis) ### `Dependencies` ### `Deprecated` ## [2.5.1] - 2024-02-21 ### `Added` - [#1037](https://github.com/nf-core/eager/issues/1037) Added an option to deactivate the `-sorted` option of bedtools coverage, in case the feature file is not sorted the same way as the fasta file, albeit with the caveat this will be very slow. (♥ Thanks to @IdoBar for reporting, and contributing.) ### `Fixed` - [#1048](https://github.com/nf-core/eager/issues/1048) `--vcf2genome_outfile` parameter now gets prefixed by the sample_name and suffixed with `.fasta` (i.e. `_.fasta`). This ensures we avoid overwriting the output fasta of one sample with that of another when the option is provided. (♥ Thanks to @MeriamOs for reporting.) - [#1047](https://github.com/nf-core/eager/issues/1047) Changed the row some statistics were reported in the General Stats table. The File name collision fixed in 2.5.0 (see #1017) caused these statistics to be reported in the wrong row due to an added suffix. - [#1051](https://github.com/nf-core/eager/issues/1051) An error is now thrown if input BAM files end in `.unmapped.bam`, as this breaks the bam filtering process and empties the bam files in the process. (♥ Thanks to @PCQuilis for reporting.) ### `Dependencies` ### `Deprecated` ## [2.5.0] - Bopfingen - 2023-11-03 ### `Added` - [#1020](https://github.com/nf-core/eager/issues/1020) Added mapDamage2 as an alternative for damage calculation. ### `Fixed` - [#1017](https://github.com/nf-core/eager/issues/1017) Fixed file name collision in niche cases with multiple libraries of multiple UDG treatments. - [#1024](https://github.com/nf-core/eager/issues/1024) `multiqc_general_stats.txt` is now generated even if the table is a beeswarm plot in the report. - [#655](https://github.com/nf-core/eager/issues/655) Updated RG tags for all mappers. RG-id now includes Sample as well as Library ID. Added `LB:` tag with the library ID. - [#1031](https://github.com/nf-core/eager/issues/1031) Always index fasta regardless of mapper. This ensures that DamageProfiler and genotyping processes get submitted when using bowtie2 and not providing a fasta index. ### `Dependencies` - `multiqc`: 1.14 -> 1.16 ### `Deprecated` ## [2.4.7] - 2023-05-16 ### `Added` ### `Fixed` - [#983](https://github.com/nf-core/eager/issues/983) Bump `pygments` version due to incompatibility with MultiQC dependencies (♥ to @MinLuke for reporting) ### `Dependencies` - `pygments`: 2.9 -> 2.14 - `multiqc`: 1.13 -> 1.14 ### `Deprecated` ## [2.4.6] - 2022-11-14 ### `Added` - [#933](https://github.com/nf-core/eager/issues/933) Added support for customising --seq-length in mapDamage rescaling (♥ to @ashildv for requesting) ### `Fixed` - Changed endors.py license from GPL to MIT (♥ to @aidaanva for fixing) - Removed erroneous R2 in single-end example in input TSV of usage docs (♥ to @aidaanva for fixing) - [#928](https://github.com/nf-core/eager/issues/928) Fixed read group incompatibility by re-adding picard AddOrReplaceReadGroups for MultiVCFAnalyzer (♥ to @aidaanva, @meganemichel for reporting) - Fixed edge case of DamageProfiler occasionally requiring FASTA index (♥ to @asmaa-a-abdelwahab for reporting) - [#834](https://github.com/nf-core/eager/issues/834) Increased significance values in general stats table for Qualimap mean/median coverages (♥ to @neija2611 for reporting) - Fixed parameter documentation for `--snpcapture_bed` regarding on-target SNP stats to state these stats currently not displayed in MultiQC only in the Qualimap results (♥ to @meganemichel and @TCLamnidis for reporting) - [#934](https://github.com/nf-core/eager/issues/934) Fixed broken parameter setting in mapDamage2 rescale length (♥ to @ashildv for reporting) ### `Dependencies` - Updated MultiQC to official 1.13 version (rather than alpha) - Added pinned MALT dependency to ensure working version in future versions of eager ### `Deprecated` ## [2.4.5] - 2022-08-02 ### `Added` ### `Fixed` - [#882](https://github.com/nf-core/eager/pull/882) Define DSL1 execution explicitly, as new versions Nextflow made DSL2 default (♥ to & fix from @Lehmann-Fabian) - [#879](https://github.com/nf-core/eager/issues/879) Add missing threads parameter for pre-clipping FastQC for single end data that caused insufficient memory in some cases (♥ to @marcel-keller for reporting) - [#880](https://github.com/nf-core/eager/issues/880) Fix failure of endorSpy to be cached or reexecuted on resume (♥ to @KathrinNaegele, @TCLamnidis, & @mahesh-panchal for reporting and debugging) - [#885](https://github.com/nf-core/eager/issues/885) Specify task memory for all tools in get_software_versions to account for incompatibilty of java with some SGE clusters causing hanging of the process (♥ to @maxibor for reporting) - [#887](https://github.com/nf-core/eager/issues/887) Clarify what is considered 'ultra-short' reads in the help text of clip_readlength, for when you may wish to turn of length filtering during AdapterRemoval (♥ to @TCLamnidis for reporting) - [#889](https://github.com/nf-core/eager/issues/889) Remove/update parameters from benchmarking test profiles (♥ to @TCLamnidis for reporting) - [#895](https://github.com/nf-core/eager/issues/895) Output documentation typo fix and added location of output docs in pipeline summary (♥ to @RodrigoBarquera for reporting) - [#897](https://github.com/nf-core/eager/issues/897) Fix pipeline crash if no Kraken2 results generated (♥ to @alexandregilardet for reporting) - [#899](https://github.com/nf-core/eager/issues/897) Fix pipeline crash for circulargenerator if reference file does not end in .fasta (♥ to @scarlhoff for reporting) - Fixed some missing default values in the nextflow parameter schema JSON - [#789](https://github.com/nf-core/eager/issues/789) Substantial speed and memory optimisation of the `extract_map_reads.py` script (♥ to @ivelsko for reporting, @maxibor for optimisation) - Fix staging of input bams for genotyping_pileupcaller process. Downstream changes from changes introduced when fixing endorspy caching. - Made slight correction on metro map diagram regarding input data to SexDeterrmine (only BAM trimming output files) ### `Dependencies` - Updated MultiQC to latest stable alpha version on bioconda, correcting the previously nonsensical AdapterRemoval plots (♥ to @NiemannJ for fixing in MultiQC) ### `Deprecated` ## [2.4.4] - 2022-04-08 ### `Added` ### `Fixed` - Fixed some auxiliary files (adapater list, snpcapture/pileupcaller/sexdeterrmine BED files, and pileupCaller SNP file, PMD reference mask) in some cases only be used against one sample (❤ to @meganemichel for reporting, fix by @jfy133) ### `Dependencies` ### `Deprecated` ## [2.4.3] - 2022-03-24 ### `Added` ### `Fixed` - [#828](https://github.com/nf-core/eager/issues/828) Improved error message if required metagenomic screening parameters not set correctly - [#836](https://github.com/nf-core/eager/issues/836) Remove deprecated parameters from test profiles - [#838](https://github.com/nf-core/eager/issues/838) Fix --snpcapture_bed files not being picked up by Nextflow (❤ to @meganemichel for reporting) - [#843](https://github.com/nf-core/eager/issues/843) Re-add direct piping of AdapterRemovalFixPrefix to pigz - [#844](https://github.com/nf-core/eager/issues/844) Fixed reference masking prior to pmdtools - [#845](https://github.com/nf-core/eager/issues/845) Updates parameter documention to specify `-s` preseq parameter also applies to lc_extrap - [#851](https://github.com/nf-core/eager/issues/851) Fixes a file-name clash during additional_library_merge, post-BAM trimming of different UDG treated libraries of a sample (❤ to @alexandregilardet for reporting) - Renamed a range of MultiQC general stats table headers to improve clarity, documentation has been updated accordingly - [#857](https://github.com/nf-core/eager/issues/857) Corrected samtools fastq flag to _retain_ read-pair information when converting off-target BAM files to fastq in paired-end mapping (❤ to @alexhbnr for reporting) - [#866](https://github.com/nf-core/eager/issues/866) Fixed a typo in the indexing step of BWA mem when not-collapsing (❤ to @alexhbnr for reporting) - Corrected tutorials to reflect updated BAM trimming flags (❤ to @marcel-keller for reporting and correcting) ### `Dependencies` - [#829](https://github.com/nf-core/eager/issues/829) Bumped sequencetools: 1.4.0.5 -> 1.5.2 - Bumped MultiQC: 1.11 -> 1.12 (for run-time optimisation and tool citation information) ### `Deprecated` ## [2.4.2] - 2022-01-24 ### `Added` ### `Fixed` - [#824](https://github.com/nf-core/eager/issues/824) Fixes large memory footprint of bedtools coverage calculation. - [#822](https://github.com/nf-core/eager/issues/822) Fixed post-adapterremoval trimmed files not being lane-merged and included in downstream analyses - Fixed a couple of software version reporting commands ### `Dependencies` ### `Deprecated` ## [2.4.1] - 2021-11-30 ### `Added` - [#805](https://github.com/nf-core/eager/issues/805) Changes to bam_trim options to allow flexible trimming by library strandedness (in addition to UDG treatment). (@TCLamnidis) - [#808](https://github.com/nf-core/eager/issues/808) Retain read group information across bam merges. Sample set to sample name (rather than library name) in bwa output 'RG' readgroup tag. (@TCLamnidis) - Map and base quality filters prior to genotyping with pileupcaller can now be specified. (@TCLamnidis) - [#774](https://github.com/nf-core/eager/issues/774) Added support for multi-threaded Bowtie2 build reference genome indexing (@jfy133) - [#804](https://github.com/nf-core/eager/issues/804) Improved output documentation description to add how 'cluster factor' is calculated (thanks to @meganemichel) ### `Fixed` - [#803](https://github.com/nf-core/eager/issues/803) Fixed mistake in metro-map diagram (`samtools index` is now correctly `samtools faidx`) (@jfy133) ### `Dependencies` ### `Deprecated` ## [2.4.0] - Wangen - 2021-09-14 ### `Added` - [#317](https://github.com/nf-core/eager/issues/317) Added bcftools stats for general genotyping statistics of VCF files - [#651](https://github.com/nf-core/eager/issues/651) - Adds removal of adapters specified in an AdapterRemoval adapter list file - [#642](https://github.com/nf-core/eager/issues/642) and [#431](https://github.com/nf-core/eager/issues/431) adds post-adapter removal barcode/fastq trimming - [#769](https://github.com/nf-core/eager/issues/769) - Adds lc_extrap mode to preseq (suggested by @roberta-davidson) ### `Fixed` - Fixed some missing or incorrectly reported software versions - [#771](https://github.com/nf-core/eager/issues/771) Remove legacy code - Improved output documentation for MultiQC general stats table (thanks to @KathrinNaegele and @esalmela) - Improved output documentation for BowTie2 (thanks to @isinaltinkaya) - [#612](https://github.com/nf-core/eager/issues/612) Updated BAM trimming defaults to 0 to ensure no unwanted trimming when mixing half-UDG with no-UDG (thanks to @scarlhoff) - [#722](https://github.com/nf-core/eager/issues/722) Updated BWA mapping mapping parameters to latest recommendations - primarily alnn back to 0.01 and alno to 2 as per Oliva et al. 2021 (10.1093/bib/bbab076) - Updated workflow diagrams to reflect latest functionality - [#787](https://github.com/nf-core/eager/issues/787) Adds memory specification flags for the GATK UnifiedGenotyper and HaplotyperCaller steps (thanks to @nylander) - Fixed issue where MultiVCFAnalyzer would not pick up newly generated VCF files, when specifying additional VCF files. - [#790](https://github.com/nf-core/eager/issues/790) Fixed kraken2 report file-name collision when sample names have `.` in them - [#792](https://github.com/nf-core/eager/issues/792) Fixed java error messages for AdapterRemovalFixPrefix being hidden in output - [#794](https://github.com/nf-core/eager/issues/794) Aligned default test profile with nf-core standards (`test_tsv` is now `test`) ### `Dependencies` - Bumped python: 3.7.3 -> 3.9.4 - Bumped markdown: 3.2.2 -> 3.3.4 - Bumped pymdown-extensions: 7.1 -> 8.2 - Bumped pyments: 2.6.1 -> 2.9.0 - Bumped adapterremoval: 2.3.1 -> 2.3.2 - Bumped picard: 2.22.9 -> 2.26.0 - Bumped samtools 1.9 -> 1.12 - Bumped angsd: 0.933 -> 0.935 - Bumped gatk4: 4.1.7.0 -> 4.2.0.0 - Bumped multiqc: 1.10.1 -> 1.11 - Bumped bedtools 2.29.2 -> 2.30.0 - Bumped libiconv: 1.15 -> 1.16 - Bumped preseq: 2.0.3 -> 3.1.2 - Bumped bamutil: 1.0.14 -> 1.0.15 - Bumped pysam: 0.15.4 -> 0.16.0 - Bumped kraken2: 2.1.1 -> 2.1.2 - Bumped pandas: 1.0.4 -> 1.2.4 - Bumped freebayes: 1.3.2 -> 1.3.5 - Bumped biopython: 1.76 -> 1.79 - Bumped xopen: 0.9.0 -> 1.1.0 - Bumped bowtie2: 2.4.2 -> 2.4.4 - Bumped mapdamage2: 2.2.0 -> 2.2.1 - Bumped bbmap: 38.87 -> 38.92 - Added bcftools: 1.12 ### `Deprecated` ## [2.3.5] - 2021-06-03 ### `Added` - [#722](https://github.com/nf-core/eager/issues/722) - Adds bwa `-o` flag for more flexibility in bwa parameters - [#736](https://github.com/nf-core/eager/issues/736) - Add printing of multiqc run report location on successful completion - New logo that is more visible when a user is using darkmode on GitHub or nf-core website! ### `Fixed` - [#723](https://github.com/nf-core/eager/issues/723) - Fixes empty fields in TSV resulting in uninformative error - Updated template to nf-core/tools 1.14 - [#688](https://github.com/nf-core/eager/issues/688) - Clarified the pipeline is not just for humans and microbes, but also plants and animals, and also for modern DNA - [#751](https://github.com/nf-core/eager/pull/751) - Added missing label to mtnucratio - General code cleanup and standardisation of parameters with no default setting - [#750](https://github.com/nf-core/eager/issues/750) - Fixed piped commands requesting the same number of CPUs at each command step - [#757](https://github.com/nf-core/eager/issues/757) - Removed confusing 'Data Type' variable from MultiQC workflow summary (not consistent with TSV input) - [#759](https://github.com/nf-core/eager/pull/759) - Fixed malformed software scraping regex that resulted in N/A in MultiQC report - [#761](https://github.com/nf-core/eager/pull/759) - Fixed issues related to instability of samtools filtering related CI tests ### `Dependencies` ### `Deprecated` ## [2.3.4] - 2021-05-05 ### `Added` - [#729](https://github.com/nf-core/eager/issues/729) - Added Bowtie2 flag `--maxins` for PE mapping modern DNA mapping contexts ### `Fixed` - Corrected explanation of the "--min_adap_overlap" parameter for AdapterRemoval in the docs - [#725](https://github.com/nf-core/eager/pull/725) - `bwa_index` doc update - Re-adds gzip piping to AdapterRemovalFixPrefix to speed up process after reports of being very slow - Updated DamageProfiler citation from bioRxiv to publication ### `Dependencies` - Removed pinning of `tbb` (upstream bug in bioconda fixed) - Bumped `pigz` to 2.6 to fix rare stall bug when compressing data after AdapterRemoval - Bumped Bowtie2 to 2.4.2 to fix issues with `tbb` version ### `Deprecated` ## [2.3.3] - 2021-04-08 ### `Added` - [#349](https://github.com/nf-core/eager/issues/349) - Added option enabling platypus formatted output of pmdtools misincorporation frequencies. ### `Fixed` - [#719](https://github.com/nf-core/eager/pull/719) - Fix filename for bam output of `mapdamage_rescaling` - [#707](https://github.com/nf-core/eager/pull/707) - Fix typo in UnifiedGenotyper IndelRealigner command - Fixed some Java tools not following process memory specifications - Updated template to nf-core/tools 1.13.2 - [#711](https://github.com/nf-core/eager/pull/711) - Fix conditional execution preventing multivcfanalyze to run - [#714](https://github.com/nf-core/eager/issues/714) - Fixes bug in nuc contamination by upgrading to latest MultiQC v1.10.1 bugfix release ### `Dependencies` ### `Deprecated` ## [2.3.2] - 2021-03-16 ### `Added` - [#687](https://github.com/nf-core/eager/pull/687) - Adds Kraken2 unique kmer counting report - [#676](https://github.com/nf-core/eager/issues/676) - Refactor help message / summary message formatting to automatic versions using nf-core library - [#682](https://github.com/nf-core/eager/issues/682) - Add AdapterRemoval `--qualitymax` flag to allow FASTQ Phred score range max more than 41 ### `Fixed` - [#666](https://github.com/nf-core/eager/issues/666) - Fixed input file staging for `print_nuclear_contamination` - [#631](https://github.com/nf-core/eager/issues/631) - Update minimum Nextflow version to 20.07.1, due to unfortunate bug in Nextflow 20.04.1 causing eager to crash if patch pulled - Made MultiQC crash behaviour stricter when dealing with large datasets, as reported by @ashildv - [#652](https://github.com/nf-core/eager/issues/652) - Added note to documentation that when using `--skip_collapse` this will use _paired-end_ alignment mode with mappers when using PE data - [#626](https://github.com/nf-core/eager/issues/626) - Add additional checks to ensure pipeline will give useful error if cells of a TSV column are empty - Added note to documentation that when using `--skip_collapse` this will use _paired-end_ alignment mode with mappers when using PE data - [#673](https://github.com/nf-core/eager/pull/673) - Fix Kraken database loading when loading from directory instead of compressed file - [#688](https://github.com/nf-core/eager/issues/668) - Allow pipeline to complete, even if Qualimap crashes due to an empty or corrupt BAM file for one sample/library - [#683](https://github.com/nf-core/eager/pull/683) - Sets `--igenomes_ignore` to true by default, as rarely used by users currently and makes resolving configs less complex - Added exit code `140` to re-tryable exit code list to account for certain scheduler wall-time limit fails - [#672](https://github.com/nf-core/eager/issues/672) - Removed java parameter from picard tools which could cause memory issues - [#679](https://github.com/nf-core/eager/issues/679) - Refactor within-process bash conditions to groovy/nextflow, due to incompatibility with some servers environments - [#690](https://github.com/nf-core/eager/pull/690) - Fixed ANGSD output mode for beagle by setting `-doMajorMinor 1` as default in that case - [#693](https://github.com/nf-core/eager/issues/693) - Fixed broken TSV input validation for the Colour Chemistry column - [#695](https://github.com/nf-core/eager/issues/695) - Fixed incorrect `-profile` order in tutorials (originally written reversed due to [nextflow bug](https://github.com/nextflow-io/nextflow/issues/1792)) - [#653](https://github.com/nf-core/eager/issues/653) - Fixed file collision errors with sexdeterrmine for two same-named libraries with different strandedness ### `Dependencies` - Bumped MultiQC to 1.10 for improved functionality - Bumped HOPS to 0.35 for MultiQC 1.10 compatibility ### `Deprecated` ## [2.3.1] - 2021-01-14 ### `Added` ### `Fixed` - [#654](https://github.com/nf-core/eager/issues/654) - Fixed some values in JSON schema (used in launch GUI) not passing validation checks during run - [#655](https://github.com/nf-core/eager/issues/655) - Updated read groups for all mappers to allow proper GATK validation - Fixed issue with Docker container not being pullable by Nextflow due to version-number inconsistencies ### `Dependencies` ### `Deprecated` ## [2.3.0] - Aalen - 2021-01-11 ### `Added` - [#640](https://github.com/nf-core/eager/issues/640) - Added a pre-metagenomic screening filtering of low-sequence complexity reads with `bbduk` - [#583](https://github.com/nf-core/eager/issues/583) - Added `mapDamage2` rescaling of BAM files to remove damage - Updated usage (merging files) and workflow images reflecting new functionality. ### `Fixed` - Removed leftover old DockerHub push CI commands. - [#627](https://github.com/nf-core/eager/issues/627) - Added de Barros Damgaard citation to README - [#630](https://github.com/nf-core/eager/pull/630) - Better handling of Qualimap memory requirements and error strategy. - Fixed some incomplete schema options to ensure users supply valid input values - [#638](https://github.com/nf-core/eager/issues/638#issuecomment-748877567) Fixed inverted circularfilter filtering (previously filtering would happen by default, not when requested by user as originally recorded in documentation) - [DeDup:](https://github.com/apeltzer/DeDup/commit/07d47868f10a6830da8c9161caa3755d9da155bf) Fixed Null Pointer Bug in DeDup by updating to 0.12.8 version - [#650](https://github.com/nf-core/eager/pull/650) - Increased memory given to FastQC for larger files by making it multithreaded ### `Dependencies` - Update: DeDup v0.12.7 to v0.12.8 ### `Deprecated` ## [2.2.2] - 2020-12-09 ### `Added` - Added large scale 'stress-test' profile for AWS (using de Barros Damgaard et al. 2018's 137 ancient human genomes). - This will now be run automatically for every release. All processed data will be available on the nf-core website: - You can run this yourself using `-profile test_full` ### `Fixed` - Fixed AWS full test profile. - [#587](https://github.com/nf-core/eager/issues/587) - Re-implemented AdapterRemovalFixPrefix for DeDup compatibility of including singletons - [#602](https://github.com/nf-core/eager/issues/602) - Added the newly available GATK 3.5 conda package. - [#610](https://github.com/nf-core/eager/issues/610) - Create bwa_index channel when specifying circularmapper as mapper - Updated template to nf-core/tools 1.12.1 - General documentation improvements ### `Deprecated` - Flag `--gatk_ug_jar` has now been removed as GATK 3.5 is now avaliable within the nf-core/eager software environment. ## [2.2.1] - 2020-10-20 ### `Fixed` - [#591](https://github.com/nf-core/eager/issues/591) - Fixed offset underlines in lane merging diagram in docs - [#592](https://github.com/nf-core/eager/issues/592) - Fixed issue where supplying Bowtie2 index reported missing bwamem_index error - [#590](https://github.com/nf-core/eager/issues/592) - Removed redundant dockstore.yml from root - [#596](https://github.com/nf-core/eager/issues/596) - Add workaround for issue regarding gzipped FASTAs and pre-built indices - [#589](https://github.com/nf-core/eager/issues/582) - Updated template to nf-core/tools 1.11 - [#582](https://github.com/nf-core/eager/issues/582) - Clarify memory limit issue on FAQ ## [2.2.0] - Ulm - 2020-10-20 ### `Added` - **Major** Automated cloud tests with large-scale data on [AWS](https://aws.amazon.com/) - **Major** Re-wrote input logic to accept a TSV 'map' file in addition to direct paths to FASTQ files - **Major** Added JSON Schema, enabling web GUI for configuration of pipeline available [here](https://nf-co.re/launch?pipeline=eager&release=2.2.0) - **Major** Lane and library merging implemented - When using TSV input, one library with the multiple _lanes_ will be merged together, before mapping - Strip FASTQ will also produce a lane merged 'raw' but 'stripped' FASTQ file - When using TSV input, one sample with multiple (same treatment) libraries will be merged together - Important: direct FASTQ paths will not have this functionality. TSV is required. - [#40](https://github.com/nf-core/eager/issues/40) - Added the pileupCaller genotyper from [sequenceTools](https://github.com/stschiff/sequenceTools) - Added validation check and clearer error message when `--fasta_index` is provided and filepath does not end in `.fai`. - Improved error messages - Added ability for automated emails using `mailutils` to also send MultiQC reports - General documentation additions, cleaning, and updated figures with CC-BY license - Added large 'full size' dataset test-profiles for ancient fish and human contexts human - [#257](https://github.com/nf-core/eager/issues/257) - Added the bowtie2 aligner as option for mapping, following Poullet and Orlando 2020 doi: [10.3389/fevo.2020.00105](https://doi.org/10.3389/fevo.2020.00105) - [#451](https://github.com/nf-core/eager/issues/451) - Adds ANGSD genotype likelihood calculations as an alternative to typical 'genotypers' - [#566](https://github.com/nf-core/eager/issues/466) - Add tutorials on how to set up nf-core/eager for different contexts - Nuclear contamination results are now shown in the MultiQC report - Tutorial on how to use profiles for reproducible science (i.e. parameter sharing between different groups) - [#522](https://github.com/nf-core/eager/issues/522) - Added post-mapping length filter to assist in more realistic endogenous DNA calculations - [#512](https://github.com/nf-core/eager/issues/512) - Added flexible trimming of BAMs by library type. 'half' and 'none' UDG libraries can now be trimmed differentially within a single eager run. - Added a `.dockstore.yml` config file for automatic workflow registration with [dockstore.org](https://dockstore.org/) - Updated template to nf-core/tools 1.10.2 - [#544](https://github.com/nf-core/eager/pull/544) - Add script to perform bam filtering on fragment length - [#456](https://github.com/nf-core/eager/pull/546) - Bumps the base (default) runtime of all processes to 4 hours, and set shorter time limits for test profiles (1 hour) - [#552](https://github.com/nf-core/eager/issues/552) - Adds optional creation of MALT SAM files alongside RMA6 files - Added eigenstrat snp coverage statistics to MultiQC report. Process results are published in `genotyping/*_eigenstrat_coverage.txt`. ### `Fixed` - [#368](https://github.com/nf-core/eager/issues/368) - Fixed the profile `test` to contain a parameter for `--paired_end` - Mini bugfix for typo in line 1260+1261 - [#374](https://github.com/nf-core/eager/issues/374) - Fixed output documentation rendering not containing images - [#379](https://github.com/nf-core/eager/issues/378) - Fixed insufficient memory requirements for FASTQC edge case - [#390](https://github.com/nf-core/eager/issues/390) - Renamed clipped/merged output directory to be more descriptive - [#398](https://github.com/nf-core/eager/issues/498) - Stopped incompatible FASTA indexes being accepted - [#400](https://github.com/nf-core/eager/issues/400) - Set correct recommended bwa mapping parameters from [Schubert et al. 2012](https://doi.org/10.1186/1471-2164-13-178) - [#410](https://github.com/nf-core/eager/issues/410) - Fixed nf-core/configs not being loaded properly - [#473](https://github.com/nf-core/eager/issues/473) - Fixed bug in sexdet_process on AWS - [#444](https://github.com/nf-core/eager/issues/444) - Provide option for preserving realigned bam + index - Fixed deduplication output logic. Will now pass along only the post-rmdup bams if duplicate removal is not skipped, instead of both the post-rmdup and pre-rmdup bams - [#497](https://github.com/nf-core/eager/issues/497) - Simplifies number of parameters required to run bam filtering - [#501](https://github.com/nf-core/eager/issues/501) - Adds additional validation checks for MALT/MaltExtract database input files - [#508](https://github.com/nf-core/eager/issues/508) - Made Markduplicates default dedupper due to narrower context specificity of dedup - [#516](https://github.com/nf-core/eager/issues/516) - Made bedtools not report out of memory exit code when warning of inconsistent FASTA/Bed entry names - [#504](https://github.com/nf-core/eager/issues/504) - Removed uninformative sexdeterrmine-snps plot from MultiQC report. - Nuclear contamination is now reported with the correct library names. - [#531](https://github.com/nf-core/eager/pull/531) - Renamed 'FASTQ stripping' to 'host removal' - Merged all tutorials and FAQs into `usage.md` for display on [nf-co.re](https://www.nf-co.re) - Corrected header of nuclear contamination table (`nuclear_contamination.txt`). - Fixed a bug with `nSNPs` definition in `print_x_contamination.py`. Number of SNPs now correctly reported - `print_x_contamination.py` now correctly converts all NA values to "N/A" - Increased amount of memory MultiQC by default uses, to account for very large nf-core/eager runs (e.g. >1000 samples) ### `Dependencies` - Added sequenceTools (1.4.0.6) that adds the ability to do genotyping with the 'pileupCaller' - Latest version of DeDup (0.12.6) which now reports mapped reads after deduplication - [#560](https://github.com/nf-core/eager/issues/560) Latest version of Dedup (0.12.7), which now correctly reports deduplication statistics based on calculations of mapped reads only (prior denominator was total reads of BAM file) - Latest version of ANGSD (0.933) which doesn't seg fault when running contamination on BAMs with insufficient reads - Latest version of MultiQC (1.9) with support for lots of extra tools in the pipeline (MALT, SexDetERRmine, DamageProfiler, MultiVCFAnalyzer) - Latest versions of Pygments (7.1), Pymdown-Extensions (2.6.1) and Markdown (3.2.2) for documentation output - Latest version of Picard (2.22.9) - Latest version of GATK4 (4.1.7.0) - Latest version of sequenceTools (1.4.0.6) - Latest version of fastP (0.20.1) - Latest version of Kraken2 (2.0.9beta) - Latest version of FreeBayes (1.3.2) - Latest version of xopen (0.9.0) - Added Bowtie 2 (2.4.1) - Latest version of Sex.DetERRmine (1.1.2) - Latest version of endorS.py (0.4) ## [2.1.0] - Ravensburg - 2020-03-05 ### `Added` - Added Support for automated tests using [GitHub Actions](https://github.com/features/actions), replacing travis - [#40](https://github.com/nf-core/eager/issues/40), [#231](https://github.com/nf-core/eager/issues/231) - Added genotyping capability through GATK UnifiedGenotyper (v3.5), GATK HaplotypeCaller (v4.1) and FreeBayes - Added MultiVCFAnalyzer module - [#240](https://github.com/nf-core/eager/issues/240) - Added human sex determination module - [#226](https://github.com/nf-core/eager/issues/226) - Added `--preserve5p` function for AdapterRemoval - [#212](https://github.com/nf-core/eager/issues/212) - Added ability to use only merged reads downstream from AdapterRemoval - [#265](https://github.com/nf-core/eager/issues/265) - Adjusted full markdown linting in Travis CI - [#247](https://github.com/nf-core/eager/issues/247) - Added nuclear contamination with angsd - [#258](https://github.com/nf-core/eager/issues/258) - Added ability to report bedtools stats to features (e.g. depth/breadth of annotated genes) - [#249](https://github.com/nf-core/eager/issues/249) - Added metagenomic classification of unmapped reads with MALT and aDNA authentication with MaltExtract - [#302](https://github.com/nf-core/eager/issues/302) - Added mitochondrial to nuclear ratio calculation - [#302](https://github.com/nf-core/eager/issues/302) - Added VCF2Genome for consensus sequence generation - Fancy new logo from [ZandraFagernas](https://github.com/ZandraFagernas) - [#286](https://github.com/nf-core/eager/issues/286) - Adds pipeline-specific profiles (loaded from nf-core configs) - [#310](https://github.com/nf-core/eager/issues/310) - Generalises base.config - [#326](https://github.com/nf-core/eager/pull/326) - Add Biopython and [xopen](https://github.com/marcelm/xopen/) dependencies - [#336](https://github.com/nf-core/eager/issues/336) - Change default Y-axis maximum value of DamageProfiler to 30% to match popular (but slower) mapDamage, and allow user to set their own value. - [#352](https://github.com/nf-core/eager/pull/352) - Add social preview image - [#355](https://github.com/nf-core/eager/pull/355) - Add Kraken2 metagenomics classifier - [#90](https://github.com/nf-core/eager/issues/90) - Added endogenous DNA calculator (original repository: [https://github.com/aidaanva/endorS.py/](https://github.com/aidaanva/endorS.py/)) ### `Fixed` - [#227](https://github.com/nf-core/eager/issues/227) - Large re-write of input/output process logic to allow maximum flexibility. Originally to address [#227](https://github.com/nf-core/eager/issues/227), but further expanded - Fixed Travis-Ci.org to Travis-Ci.com migration issues - [#266](https://github.com/nf-core/eager/issues/266) - Added sanity checks for input filetypes (i.e. only BAM files can be supplied if `--bam`) - [#237](https://github.com/nf-core/eager/issues/237) - Fixed and Updated script scrape_software_versions - [#322](https://github.com/nf-core/eager/pull/322) - Move extract map reads fastq compression to pigz - [#327](https://github.com/nf-core/eager/pull/327) - Speed up strip_input_fastq process and make it more robust - [#342](https://github.com/nf-core/eager/pull/342) - Updated to match nf-core tools 1.8 linting guidelines - [#339](https://github.com/nf-core/eager/issues/339) - Converted unnecessary zcat + gzip to just cat for a performance boost - [#344](https://github.com/nf-core/eager/issues/344) - Fixed pipeline still trying to run when using old nextflow version ### `Dependencies` - adapterremoval=2.2.2 upgraded to 2.3.1 - adapterremovalfixprefix=0.0.4 upgraded to 0.0.5 - damageprofiler=0.4.3 upgraded to 0.4.9 - angsd=0.923 upgraded to 0.931 - gatk4=4.1.2.0 upgraded to 4.1.4.1 - mtnucratio=0.5 upgraded to 0.6 - conda-forge::markdown=3.1.1 upgraded to 3.2.1 - bioconda::fastqc=0.11.8 upgraded to 0.11.9 - bioconda::picard=2.21.4 upgraded to 2.22.0 - bioconda::bedtools=2.29.0 upgraded to 2.29.2 - pysam=0.15.3 upgraded to 0.15.4 - conda-forge::pandas=1.0.0 upgraded to 1.0.1 - bioconda::freebayes=1.3.1 upgraded to 1.3.2 - conda-forge::biopython=1.75 upgraded to 1.76 ## [2.0.7] - 2019-06-10 ### `Added` - [#189](https://github.com/nf-core/eager/pull/189) - Outputting unmapped reads in a fastq files with the --strip_input_fastq flag - [#186](https://github.com/nf-core/eager/pull/186) - Make FastQC skipping [possible](https://github.com/nf-core/eager/issues/182) - Merged in [nf-core/tools](https://github.com/nf-core/tools) release V1.6 template changes - A lot more automated tests using Travis CI - Don't ignore DamageProfiler errors any more - [#220](https://github.com/nf-core/eager/pull/220) - Added post-mapping filtering statistics module and corresponding MultiQC statistics [#217](https://github.com/nf-core/eager/issues/217) ### `Fixed` - [#152](https://github.com/nf-core/eager/pull/152) - DamageProfiler errors [won't crash entire pipeline any more](https://github.com/nf-core/eager/issues/171) - [#176](https://github.com/nf-core/eager/pull/176) - Increase runtime for DamageProfiler on [large reference genomes](https://github.com/nf-core/eager/issues/173) - [#172](https://github.com/nf-core/eager/pull/152) - DamageProfiler errors [won't crash entire pipeline any more](https://github.com/nf-core/eager/issues/171) - [#174](https://github.com/nf-core/eager/pull/190) - Publish DeDup files [properly](https://github.com/nf-core/eager/issues/183) - [#196](https://github.com/nf-core/eager/pull/196) - Fix reference [issues](https://github.com/nf-core/eager/issues/150) - [#196](https://github.com/nf-core/eager/pull/196) - Fix issues with PE data being mapped incompletely - [#200](https://github.com/nf-core/eager/pull/200) - Fix minor issue with some [typos](https://github.com/nf-core/eager/pull/196) - [#210](https://github.com/nf-core/eager/pull/210) - Fix PMDTools [encoding issue](https://github.com/pontussk/PMDtools/issues/6) from `samtools calmd` generated files by running through `sa]mtools view` first - [#221](https://github.com/nf-core/eager/pull/221) - Fix BWA Index [not being reused by multiple samples](https://github.com/nf-core/eager/issues/219) ### `Dependencies` - Added DeDup v0.12.5 (json support) - Added mtnucratio v0.5 (json support) - Updated Picard 2.18.27 -> 2.20.2 - Updated GATK 4.1.0.0 -> 4.1.2.0 - Updated damageprofiler 0.4.4 -> 0.4.5 - Updated r-rmarkdown 1.11 -> 1.12 - Updated fastp 0.19.7 -> 0.20.0 - Updated qualimap 2.2.2b -> 2.2.2c ## [2.0.6] - 2019-03-05 ### `Added` - [#152](https://github.com/nf-core/eager/pull/152) - Clarified `--complexity_filter` flag to be specifically for poly G trimming. - [#155](https://github.com/nf-core/eager/pull/155) - Added [Dedup log to output folders](https://github.com/nf-core/eager/issues/154) - [#159](https://github.com/nf-core/eager/pull/159) - Added Possibility to skip AdapterRemoval, skip merging, skip trimming fixing [#64](https://github.com/nf-core/eager/issues/64),[#137](https://github.com/nf-core/eager/issues/137) - thanks to @maxibor, @jfy133 ### `Fixed` - [#151](https://github.com/nf-core/eager/pull/151) - Fixed [post-deduplication step errors](https://github.com/nf-core/eager/issues/128) - [#147](https://github.com/nf-core/eager/pull/147) - Fix Samtools Index for [large references](https://github.com/nf-core/eager/issues/146) - [#145](https://github.com/nf-core/eager/pull/145) - Added Picard Memory Handling [fix](https://github.com/nf-core/eager/issues/144) ### `Dependencies` - Picard Tools 2.18.23 -> 2.18.27 - GATK 4.0.12.0 -> 4.1.0.0 - FastP 0.19.6 -> 0.19.7 ## [2.0.5] - 2019-01-28 ### `Added` - [#127](https://github.com/nf-core/eager/pull/127) - Added a second test case for testing the pipeline properly - [#129](https://github.com/nf-core/eager/pull/129) - Support BAM files as [input format](https://github.com/nf-core/eager/issues/41) - [#131](https://github.com/nf-core/eager/pull/131) - Support different [reference genome file extensions](https://github.com/nf-core/eager/issues/130) ### `Fixed` - [#128](https://github.com/nf-core/eager/issues/128) - Fixed reference genome handling errors ### `Dependencies` - Picard Tools 2.18.21 -> 2.18.23 - R-Markdown 1.10 -> 1.11 - FastP 0.19.5 -> 0.19.6 ## [2.0.4] - 2019-01-09 ### `Added` - [#111](https://github.com/nf-core/eager/pull/110) - Allow [Zipped FastA reference input](https://github.com/nf-core/eager/issues/91) - [#113](https://github.com/nf-core/eager/pull/113) - All files are now staged via channels, which is considered best practice by Nextflow - [#114](https://github.com/nf-core/eager/pull/113) - Add proper runtime defaults for multiple processes - [#118](https://github.com/nf-core/eager/pull/118) - Add [centralized configs handling](https://github.com/nf-core/configs) - [#115](https://github.com/nf-core/eager/pull/115) - Add DamageProfiler MultiQC support - [#122](https://github.com/nf-core/eager/pull/122) - Add pulling from Dockerhub again ### `Fixed` - [#110](https://github.com/nf-core/eager/pull/110) - Fix for [MultiQC Missing Second FastQC report](https://github.com/nf-core/eager/issues/107) - [#112](https://github.com/nf-core/eager/pull/112) - Remove [redundant UDG options](https://github.com/nf-core/eager/issues/89) ## [2.0.3] - 2018-12-12 ### `Added` - [#80](https://github.com/nf-core/eager/pull/80) - BWA Index file handling - [#77](https://github.com/nf-core/eager/pull/77) - Lots of documentation updates by [@jfy133](https://github.com/jfy133) - [#81](https://github.com/nf-core/eager/pull/81) - Renaming of certain BAM options - [#92](https://github.com/nf-core/eager/issues/92) - Complete restructure of BAM options ### `Fixed` - [#84](https://github.com/nf-core/eager/pull/85) - Fix for [Samtools index issues](https://github.com/nf-core/eager/issues/84) - [#96](https://github.com/nf-core/eager/issues/96) - Fix for [MarkDuplicates issues](https://github.com/nf-core/eager/issues/96) found by [@nilesh-tawari](https://github.com/nilesh-tawari) ### Other - Added Slack button to repository readme ## [2.0.2] - 2018-11-03 ### `Changed` - [#70](https://github.com/nf-core/eager/issues/70) - Uninitialized `readPaths` warning removed ### `Added` - [#73](https://github.com/nf-core/eager/pull/73) - Travis CI Testing of Conda Environment added ### `Fixed` - [#72](https://github.com/nf-core/eager/issues/72) - iconv Issue with R in conda environment ## [2.0.1] - 2018-11-02 ### `Fixed` - [#69](https://github.com/nf-core/eager/issues/67) - FastQC issues with conda environments ## [2.0.0] "Kaufbeuren" - 2018-10-17 Initial release of nf-core/eager: ### `Added` - FastQC read quality control - (Optional) Read complexity filtering with FastP - Read merging and clipping using AdapterRemoval v2 - Mapping using BWA / BWA Mem or CircularMapper - Library Complexity Estimation with Preseq - Conversion and Filtering of BAM files using Samtools - Damage assessment via DamageProfiler, additional filtering using PMDTools - Duplication removal via DeDup - BAM Clipping with BamUtil for UDGhalf protocols - QualiMap BAM quality control analysis Furthermore, this already creates an interactive report using MultiQC, which will be upgraded in V2.1 "Ulm" to contain more aDNA specific metrics. ================================================ FILE: CODE_OF_CONDUCT.md ================================================ # Code of Conduct at nf-core (v1.0) ## Our Pledge In the interest of fostering an open, collaborative, and welcoming environment, we as contributors and maintainers of nf-core, pledge to making participation in our projects and community a harassment-free experience for everyone, regardless of: - Age - Body size - Familial status - Gender identity and expression - Geographical location - Level of experience - Nationality and national origins - Native language - Physical and neurological ability - Race or ethnicity - Religion - Sexual identity and orientation - Socioeconomic status Please note that the list above is alphabetised and is therefore not ranked in any order of preference or importance. ## Preamble > Note: This Code of Conduct (CoC) has been drafted by the nf-core Safety Officer and been edited after input from members of the nf-core team and others. "We", in this document, refers to the Safety Officer and members of the nf-core core team, both of whom are deemed to be members of the nf-core community and are therefore required to abide by this Code of Conduct. This document will amended periodically to keep it up-to-date, and in case of any dispute, the most current version will apply. An up-to-date list of members of the nf-core core team can be found [here](https://nf-co.re/about). Our current safety officer is Renuka Kudva. nf-core is a young and growing community that welcomes contributions from anyone with a shared vision for [Open Science Policies](https://www.fosteropenscience.eu/taxonomy/term/8). Open science policies encompass inclusive behaviours and we strive to build and maintain a safe and inclusive environment for all individuals. We have therefore adopted this code of conduct (CoC), which we require all members of our community and attendees in nf-core events to adhere to in all our workspaces at all times. Workspaces include but are not limited to Slack, meetings on Zoom, Jitsi, YouTube live etc. Our CoC will be strictly enforced and the nf-core team reserve the right to exclude participants who do not comply with our guidelines from our workspaces and future nf-core activities. We ask all members of our community to help maintain a supportive and productive workspace and to avoid behaviours that can make individuals feel unsafe or unwelcome. Please help us maintain and uphold this CoC. Questions, concerns or ideas on what we can include? Contact safety [at] nf-co [dot] re ## Our Responsibilities The safety officer is responsible for clarifying the standards of acceptable behavior and are expected to take appropriate and fair corrective action in response to any instances of unacceptable behaviour. The safety officer in consultation with the nf-core core team have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, or to ban temporarily or permanently any contributor for other behaviors that they deem inappropriate, threatening, offensive, or harmful. Members of the core team or the safety officer who violate the CoC will be required to recuse themselves pending investigation. They will not have access to any reports of the violations and be subject to the same actions as others in violation of the CoC. ## When are where does this Code of Conduct apply? Participation in the nf-core community is contingent on following these guidelines in all our workspaces and events. This includes but is not limited to the following listed alphabetically and therefore in no order of preference: - Communicating with an official project email address. - Communicating with community members within the nf-core Slack channel. - Participating in hackathons organised by nf-core (both online and in-person events). - Participating in collaborative work on GitHub, Google Suite, community calls, mentorship meetings, email correspondence. - Participating in workshops, training, and seminar series organised by nf-core (both online and in-person events). This applies to events hosted on web-based platforms such as Zoom, Jitsi, YouTube live etc. - Representing nf-core on social media. This includes both official and personal accounts. ## nf-core cares 😊 nf-core's CoC and expectations of respectful behaviours for all participants (including organisers and the nf-core team) include but are not limited to the following (listed in alphabetical order): - Ask for consent before sharing another community member’s personal information (including photographs) on social media. - Be respectful of differing viewpoints and experiences. We are all here to learn from one another and a difference in opinion can present a good learning opportunity. - Celebrate your accomplishments at events! (Get creative with your use of emojis 🎉 🥳 💯 🙌 !) - Demonstrate empathy towards other community members. (We don’t all have the same amount of time to dedicate to nf-core. If tasks are pending, don’t hesitate to gently remind members of your team. If you are leading a task, ask for help if you feel overwhelmed.) - Engage with and enquire after others. (This is especially important given the geographically remote nature of the nf-core community, so let’s do this the best we can) - Focus on what is best for the team and the community. (When in doubt, ask) - Graciously accept constructive criticism, yet be unafraid to question, deliberate, and learn. - Introduce yourself to members of the community. (We’ve all been outsiders and we know that talking to strangers can be hard for some, but remember we’re interested in getting to know you and your visions for open science!) - Show appreciation and **provide clear feedback**. (This is especially important because we don’t see each other in person and it can be harder to interpret subtleties. Also remember that not everyone understands a certain language to the same extent as you do, so **be clear in your communications to be kind.**) - Take breaks when you feel like you need them. - Using welcoming and inclusive language. (Participants are encouraged to display their chosen pronouns on Zoom or in communication on Slack.) ## nf-core frowns on 😕 The following behaviours from any participants within the nf-core community (including the organisers) will be considered unacceptable under this code of conduct. Engaging or advocating for any of the following could result in expulsion from nf-core workspaces. - Deliberate intimidation, stalking or following and sustained disruption of communication among participants of the community. This includes hijacking shared screens through actions such as using the annotate tool in conferencing software such as Zoom. - “Doxing” i.e. posting (or threatening to post) another person’s personal identifying information online. - Spamming or trolling of individuals on social media. - Use of sexual or discriminatory imagery, comments, or jokes and unwelcome sexual attention. - Verbal and text comments that reinforce social structures of domination related to gender, gender identity and expression, sexual orientation, ability, physical appearance, body size, race, age, religion or work experience. ### Online Trolling The majority of nf-core interactions and events are held online. Unfortunately, holding events online comes with the added issue of online trolling. This is unacceptable, reports of such behaviour will be taken very seriously, and perpetrators will be excluded from activities immediately. All community members are required to ask members of the group they are working within for explicit consent prior to taking screenshots of individuals during video calls. ## Procedures for Reporting CoC violations If someone makes you feel uncomfortable through their behaviours or actions, report it as soon as possible. You can reach out to members of the [nf-core core team](https://nf-co.re/about) and they will forward your concerns to the safety officer(s). Issues directly concerning members of the core team will be dealt with by other members of the core team and the safety manager, and possible conflicts of interest will be taken into account. nf-core is also in discussions about having an ombudsperson, and details will be shared in due course. All reports will be handled with utmost discretion and confidentially. ## Attribution and Acknowledgements - The [Contributor Covenant, version 1.4](http://contributor-covenant.org/version/1/4) - The [OpenCon 2017 Code of Conduct](http://www.opencon2017.org/code_of_conduct) (CC BY 4.0 OpenCon organisers, SPARC and Right to Research Coalition) - The [eLife innovation sprint 2020 Code of Conduct](https://sprint.elifesciences.org/code-of-conduct/) - The [Mozilla Community Participation Guidelines v3.1](https://www.mozilla.org/en-US/about/governance/policies/participation/) (version 3.1, CC BY-SA 3.0 Mozilla) ## Changelog ### v1.0 - March 12th, 2021 - Complete rewrite from original [Contributor Covenant](http://contributor-covenant.org/) CoC. ================================================ FILE: Dockerfile ================================================ FROM nfcore/base:1.14 LABEL authors="The nf-core/eager community" \ description="Docker image containing all software requirements for the nf-core/eager pipeline" # Install the conda environment COPY environment.yml / RUN conda env create --quiet -f /environment.yml && conda clean -a # Add conda installation dir to PATH (instead of doing 'conda activate') ENV PATH /opt/conda/envs/nf-core-eager-2.5.3/bin:$PATH # Dump the details of the installed packages to a file for posterity RUN conda env export --name nf-core-eager-2.5.3 > nf-core-eager-2.5.3.yml ================================================ FILE: LICENSE ================================================ MIT License Copyright (c) The nf-core/eager community Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ================================================ FILE: README.md ================================================ # ![nf-core/eager](docs/images/nf-core_eager_logo_outline_drop.png) **A fully reproducible and state-of-the-art ancient DNA analysis pipeline**. [![GitHub Actions CI Status](https://github.com/nf-core/eager/workflows/nf-core%20CI/badge.svg)](https://github.com/nf-core/eager/actions) [![GitHub Actions Linting Status](https://github.com/nf-core/eager/workflows/nf-core%20linting/badge.svg)](https://github.com/nf-core/eager/actions) [![Nextflow](https://img.shields.io/badge/nextflow-%E2%89%A520.07.1-brightgreen.svg)](https://www.nextflow.io/) [![nf-core](https://img.shields.io/badge/nf--core-pipeline-brightgreen.svg)](https://nf-co.re/) [![DOI](https://zenodo.org/badge/135918251.svg)](https://zenodo.org/badge/latestdoi/135918251) [![Published in PeerJ](https://img.shields.io/badge/peerj-published-%2300B2FF)](https://peerj.com/articles/10947/) [![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg)](https://bioconda.github.io/) [![Docker](https://img.shields.io/docker/automated/nfcore/eager.svg)](https://hub.docker.com/r/nfcore/eager) ![Singularity Container available](https://img.shields.io/badge/singularity-available-7E4C74.svg) [![Get help on Slack](http://img.shields.io/badge/slack-nf--core%20%23eager-4A154B?logo=slack)](https://nfcore.slack.com/channels/eager) >[!IMPORTANT] > nf-core/eager versions 2.* are only compatible with Nextflow versions up to 22.10.6! ## Introduction **nf-core/eager** is a scalable and reproducible bioinformatics best-practise processing pipeline for genomic NGS sequencing data, with a focus on ancient DNA (aDNA) data. It is ideal for the (palaeo)genomic analysis of humans, animals, plants, microbes and even microbiomes. The pipeline is built using [Nextflow](https://www.nextflow.io), a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker containers making installation trivial and results highly reproducible. The pipeline pre-processes raw data from FASTQ inputs, or preprocessed BAM inputs. It can align reads and performs extensive general NGS and aDNA specific quality-control on the results. It comes with docker, singularity or conda containers making installation trivial and results highly reproducible.

nf-core/eager schematic workflow ## Quick Start 1. Install [`nextflow`](https://nf-co.re/usage/installation) (`>=20.07.1` && `<=22.10.6`) 2. Install any of [`Docker`](https://docs.docker.com/engine/installation/), [`Singularity`](https://www.sylabs.io/guides/3.0/user-guide/), [`Podman`](https://podman.io/), [`Shifter`](https://nersc.gitlab.io/development/shifter/how-to-use/) or [`Charliecloud`](https://hpc.github.io/charliecloud/) for full pipeline reproducibility _(please only use [`Conda`](https://conda.io/miniconda.html) as a last resort; see [docs](https://nf-co.re/usage/configuration#basic-configuration-profiles))_ 3. Download the pipeline and test it on a minimal dataset with a single command: ```bash nextflow run nf-core/eager -profile test, ``` > Please check [nf-core/configs](https://github.com/nf-core/configs#documentation) to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use `-profile ` in your command. This will enable either `docker` or `singularity` and set the appropriate execution settings for your local compute environment. 4. Start running your own analysis! ```bash nextflow run nf-core/eager -profile --input '*_R{1,2}.fastq.gz' --fasta '.fasta' ``` 5. Once your run has completed successfully, clean up the intermediate files. ```bash nextflow clean -f -k ``` See [usage docs](https://nf-co.re/eager/usage) for all of the available options when running the pipeline. **N.B.** You can see an overview of the run in the MultiQC report located at `./results/MultiQC/multiqc_report.html` Modifications to the default pipeline are easily made using various options as described in the documentation. ## Pipeline Summary ### Default Steps By default the pipeline currently performs the following: * Create reference genome indices for mapping (`bwa`, `samtools`, and `picard`) * Sequencing quality control (`FastQC`) * Sequencing adapter removal, paired-end data merging (`AdapterRemoval`) * Read mapping to reference using (`bwa aln`, `bwa mem`, `CircularMapper`, or `bowtie2`) * Post-mapping processing, statistics and conversion to bam (`samtools`) * Ancient DNA C-to-T damage pattern visualisation (`DamageProfiler` or `mapDamage`) * PCR duplicate removal (`DeDup` or `MarkDuplicates`) * Post-mapping statistics and BAM quality control (`Qualimap`) * Library Complexity Estimation (`preseq`) * Overall pipeline statistics summaries (`MultiQC`) ### Additional Steps Additional functionality contained by the pipeline currently includes: #### Input * Automatic merging of complex sequencing setups (e.g. multiple lanes, sequencing configurations, library types) #### Preprocessing * Illumina two-coloured sequencer poly-G tail removal (`fastp`) * Post-AdapterRemoval trimming of FASTQ files prior mapping (`fastp`) * Automatic conversion of unmapped reads to FASTQ (`samtools`) * Host DNA (mapped reads) stripping from input FASTQ files (for sensitive samples) #### aDNA Damage manipulation * Damage removal/clipping for UDG+/UDG-half treatment protocols (`BamUtil`) * Damaged reads extraction and assessment (`PMDTools`) * Nuclear DNA contamination estimation of human samples (`angsd`) #### Genotyping * Creation of VCF genotyping files (`GATK UnifiedGenotyper`, `GATK HaplotypeCaller` and `FreeBayes`) * Creation of EIGENSTRAT genotyping files (`pileupCaller`) * Creation of Genotype Likelihood files (`angsd`) * Consensus sequence FASTA creation (`VCF2Genome`) * SNP Table generation (`MultiVCFAnalyzer`) #### Biological Information * Mitochondrial to Nuclear read ratio calculation (`MtNucRatioCalculator`) * Statistical sex determination of human individuals (`Sex.DetERRmine`) #### Metagenomic Screening * Low-sequenced complexity filtering (`BBduk`) * Taxonomic binner with alignment (`MALT`) * Taxonomic binner without alignment (`Kraken2`) * aDNA characteristic screening of taxonomically binned data from MALT (`MaltExtract`) #### Functionality Overview A graphical overview of suggested routes through the pipeline depending on context can be seen below.

nf-core/eager metro map ## Documentation The nf-core/eager pipeline comes with documentation about the pipeline: [usage](https://nf-co.re/eager/usage) and [output](https://nf-co.re/eager/output). 1. [Nextflow installation](https://nf-co.re/usage/installation) 2. Pipeline configuration * [Pipeline installation](https://nf-co.re/usage/local_installation) * [Adding your own system config](https://nf-co.re/usage/adding_own_config) * [Reference genomes](https://nf-co.re/usage/reference_genomes) 3. [Running the pipeline](https://nf-co.re/eager/usage) * This includes tutorials, FAQs, and troubleshooting instructions 4. [Output and how to interpret the results](https://nf-co.re/eager/output) ## Credits This pipeline was mostly written by Alexander Peltzer ([apeltzer](https://github.com/apeltzer)) and [James A. Fellows Yates](https://github.com/jfy133), with contributions from [Stephen Clayton](https://github.com/sc13-bioinf), [Thiseas C. Lamnidis](https://github.com/TCLamnidis), [Maxime Borry](https://github.com/maxibor), [Zandra Fagernäs](https://github.com/ZandraFagernas), [Aida Andrades Valtueña](https://github.com/aidaanva) and [Maxime Garcia](https://github.com/MaxUlysse) and the nf-core community. We thank the following people for their extensive assistance in the development of this pipeline: ## Authors (alphabetical) * [Aida Andrades Valtueña](https://github.com/aidaanva) * [Alexander Peltzer](https://github.com/apeltzer) * [James A. Fellows Yates](https://github.com/jfy133) * [Judith Neukamm](https://github.com/JudithNeukamm) * [Maxime Borry](https://github.com/maxibor) * [Maxime Garcia](https://github.com/MaxUlysse) * [Stephen Clayton](https://github.com/sc13-bioinf) * [Thiseas C. Lamnidis](https://github.com/TCLamnidis) * [Zandra Fagernäs](https://github.com/ZandraFagernas) ## Additional Contributors (alphabetical) Those who have provided conceptual guidance, suggestions, bug reports etc. * [Alex Hübner](https://github.com/alexhbnr) * [Alexandre Gilardet](https://github.com/alexandregilardet) * Arielle Munters * [Åshild Vågene](https://github.com/ashildv) * [Asmaa Ali](https://github.com/asmaa-a-abdelwahab) * [Charles Plessy](https://github.com/charles-plessy) * [Elina Salmela](https://github.com/esalmela) * [Fabian Lehmann](https://github.com/Lehmann-Fabian) * [He Yu](https://github.com/paulayu) * [Hester van Schalkwyk](https://github.com/hesterjvs) * [Ido Bar](https://github.com/IdoBar) * [Irina Velsko](https://github.com/ivelsko) * [Işın Altınkaya](https://github.com/isinaltinkaya) * [Johan Nylander](https://github.com/nylander) * [Jonas Niemann](https://github.com/NiemannJ) * [Katerine Eaton](https://github.com/ktmeaton) * [Kathrin Nägele](https://github.com/KathrinNaegele) * [Kevin Lord](https://github.com/lordkev) * [Laura Lacher](https://github.com/neija2611) * [Luc Venturini](https://github.com/lucventurini) * [Mahesh Binzer-Panchal](https://github.com/mahesh-panchal) * [Marcel Keller](https://github.com/marcel-keller) * [Megan Michel](https://github.com/meganemichel) * [Pierre Lindenbaum](https://github.com/lindenb) * [Pontus Skoglund](https://github.com/pontussk) * [Raphael Eisenhofer](https://github.com/EisenRa) * [Roberta Davidson](https://github.com/roberta-davidson) * [Rodrigo Barquera](https://github.com/RodrigoBarquera) * [Selina Carlhoff](https://github.com/scarlhoff) * [Torsten Günter](https://bitbucket.org/tguenther) If you've contributed and you're missing in here, please let us know and we will add you in of course! ## Contributions and Support If you would like to contribute to this pipeline, please see the [contributing guidelines](.github/CONTRIBUTING.md). For further information or help, don't hesitate to get in touch on the [Slack `#eager` channel](https://nfcore.slack.com/channels/eager) (you can join with [this invite](https://nf-co.re/join/slack)). ## Citations If you use `nf-core/eager` for your analysis, please cite the `eager` preprint as follows: > Fellows Yates JA, Lamnidis TC, Borry M, Valtueña Andrades A, Fagernäs Z, Clayton S, Garcia MU, Neukamm J, Peltzer A. 2021. Reproducible, portable, and efficient ancient genome reconstruction with nf-core/eager. PeerJ 9:e10947. DOI: [10.7717/peerj.10947](https://doi.org/10.7717/peerj.10947). You can cite the eager zenodo record for a specific version using the following [doi: 10.5281/zenodo.3698082](https://zenodo.org/badge/latestdoi/135918251) You can cite the `nf-core` publication as follows: > **The nf-core framework for community-curated bioinformatics pipelines.** > > Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen. > > _Nat Biotechnol._ 2020 Feb 13. doi: [10.1038/s41587-020-0439-x](https://dx.doi.org/10.1038/s41587-020-0439-x). In addition, references of tools and data used in this pipeline are as follows: * **EAGER v1**, CircularMapper, DeDup* Peltzer, A., Jäger, G., Herbig, A., Seitz, A., Kniep, C., Krause, J., & Nieselt, K. (2016). EAGER: efficient ancient genome reconstruction. Genome Biology, 17(1), 1–14. [https://doi.org/10.1186/s13059-016-0918-z](https://doi.org/10.1186/s13059-016-0918-z). Download: [https://github.com/apeltzer/EAGER-GUI](https://github.com/apeltzer/EAGER-GUI) and [https://github.com/apeltzer/EAGER-CLI](https://github.com/apeltzer/EAGER-CLI) * **FastQC** Download: [https://www.bioinformatics.babraham.ac.uk/projects/fastqc/](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) * **AdapterRemoval v2** Schubert, M., Lindgreen, S., & Orlando, L. (2016). AdapterRemoval v2: rapid adapter trimming, identification, and read merging. BMC Research Notes, 9, 88. [https://doi.org/10.1186/s13104-016-1900-2](https://doi.org/10.1186/s13104-016-1900-2). Download: [https://github.com/MikkelSchubert/adapterremoval](https://github.com/MikkelSchubert/adapterremoval) * **bwa** Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics , 25(14), 1754–1760. [https://doi.org/10.1093/bioinformatics/btp324](https://doi.org/10.1093/bioinformatics/btp324). Download: [http://bio-bwa.sourceforge.net/bwa.shtml](http://bio-bwa.sourceforge.net/bwa.shtml) * **SAMtools** Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., … 1000 Genome Project Data Processing Subgroup. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics , 25(16), 2078–2079. [https://doi.org/10.1093/bioinformatics/btp352](https://doi.org/10.1093/bioinformatics/btp352). Download: [http://www.htslib.org/](http://www.htslib.org/) * **DamageProfiler** Neukamm, J., Peltzer, A., & Nieselt, K. (2020). DamageProfiler: Fast damage pattern calculation for ancient DNA. In Bioinformatics (btab190). [https://doi.org/10.1093/bioinformatics/btab190](https://doi.org/10.1093/bioinformatics/btab190). Download: [https://github.com/Integrative-Transcriptomics/DamageProfiler](https://github.com/Integrative-Transcriptomics/DamageProfiler) * **QualiMap** Okonechnikov, K., Conesa, A., & García-Alcalde, F. (2016). Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data. Bioinformatics , 32(2), 292–294. [https://doi.org/10.1093/bioinformatics/btv566](https://doi.org/10.1093/bioinformatics/btv566). Download: [http://qualimap.bioinfo.cipf.es/](http://qualimap.bioinfo.cipf.es/) * **preseq** Daley, T., & Smith, A. D. (2013). Predicting the molecular complexity of sequencing libraries. Nature Methods, 10(4), 325–327. [https://doi.org/10.1038/nmeth.2375](https://doi.org/10.1038/nmeth.2375). Download: [http://smithlabresearch.org/software/preseq/](http://smithlabresearch.org/software/preseq/) * **PMDTools** Skoglund, P., Northoff, B. H., Shunkov, M. V., Derevianko, A. P., Pääbo, S., Krause, J., & Jakobsson, M. (2014). Separating endogenous ancient DNA from modern day contamination in a Siberian Neandertal. Proceedings of the National Academy of Sciences of the United States of America, 111(6), 2229–2234. [https://doi.org/10.1073/pnas.1318934111](https://doi.org/10.1073/pnas.1318934111). Download: [https://github.com/pontussk/PMDtools](https://github.com/pontussk/PMDtools) * **MultiQC** Ewels, P., Magnusson, M., Lundin, S., & Käller, M. (2016). MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics , 32(19), 3047–3048. [https://doi.org/10.1093/bioinformatics/btw354](https://doi.org/10.1093/bioinformatics/btw354). Download: [https://multiqc.info/](https://multiqc.info/) * **BamUtils** Jun, G., Wing, M. K., Abecasis, G. R., & Kang, H. M. (2015). An efficient and scalable analysis framework for variant extraction and refinement from population-scale DNA sequence data. Genome Research, 25(6), 918–925. [https://doi.org/10.1101/gr.176552.114](https://doi.org/10.1101/gr.176552.114). Download: [https://genome.sph.umich.edu/wiki/BamUtil](https://genome.sph.umich.edu/wiki/BamUtil) * **FastP** Chen, S., Zhou, Y., Chen, Y., & Gu, J. (2018). fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics , 34(17), i884–i890. [https://doi.org/10.1093/bioinformatics/bty560](https://doi.org/10.1093/bioinformatics/bty560). Download: [https://github.com/OpenGene/fastp](https://github.com/OpenGene/fastp) * **GATK 3.5** DePristo, M. A., Banks, E., Poplin, R., Garimella, K. V., Maguire, J. R., Hartl, C., … Daly, M. J. (2011). A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics, 43(5), 491–498. [https://doi.org/10.1038/ng.806](https://doi.org/10.1038/ng.806.).Download: [https://console.cloud.google.com/storage/browser/gatk](https://console.cloud.google.com/storage/browser/gatk) * **GATK 4.X** - no citation available yet. Download: [https://github.com/broadinstitute/gatk/releases](https://github.com/broadinstitute/gatk/releases) * **VCF2Genome** - Alexander Herbig and Alex Peltzer (unpublished). Download: [https://github.com/apeltzer/VCF2Genome](https://github.com/apeltzer/VCF2Genome) * **MultiVCFAnalyzer** Bos, K.I. et al., 2014. Pre-Columbian mycobacterial genomes reveal seals as a source of New World human tuberculosis. Nature, 514(7523), pp.494–497. Available at: [http://dx.doi.org/10.1038/nature13591](http://dx.doi.org/10.1038/nature13591). Download: [https://github.com/alexherbig/MultiVCFAnalyzer](https://github.com/alexherbig/MultiVCFAnalyzer) * **MTNucRatioCalculator** Alex Peltzter (Unpublished). Download: [https://github.com/apeltzer/MTNucRatioCalculator](https://github.com/apeltzer/MTNucRatioCalculator) * **Sex.DetERRmine.py** Lamnidis, T.C. et al., 2018. Ancient Fennoscandian genomes reveal origin and spread of Siberian ancestry in Europe. Nature communications, 9(1), p.5018. Available at: [http://dx.doi.org/10.1038/s41467-018-07483-5](http://dx.doi.org/10.1038/s41467-018-07483-5). Download: [https://github.com/TCLamnidis/Sex.DetERRmine.git](https://github.com/TCLamnidis/Sex.DetERRmine.git) * **ANGSD** Korneliussen, T.S., Albrechtsen, A. & Nielsen, R., 2014. ANGSD: Analysis of Next Generation Sequencing Data. BMC bioinformatics, 15, p.356. Available at: [http://dx.doi.org/10.1186/s12859-014-0356-4](http://dx.doi.org/10.1186/s12859-014-0356-4). Download: [https://github.com/ANGSD/angsd](https://github.com/ANGSD/angsd) * **bedtools** Quinlan, A.R. & Hall, I.M., 2010. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics , 26(6), pp.841–842. Available at: [http://dx.doi.org/10.1093/bioinformatics/btq033](http://dx.doi.org/10.1093/bioinformatics/btq033). Download: [https://github.com/arq5x/bedtools2/releases](https://github.com/arq5x/bedtools2/) * **MALT**. Download: [https://software-ab.informatik.uni-tuebingen.de/download/malt/welcome.html](https://software-ab.informatik.uni-tuebingen.de/download/malt/welcome.html) * Vågene, Å.J. et al., 2018. Salmonella enterica genomes from victims of a major sixteenth-century epidemic in Mexico. Nature ecology & evolution, 2(3), pp.520–528. Available at: [http://dx.doi.org/10.1038/s41559-017-0446-6](http://dx.doi.org/10.1038/s41559-017-0446-6). * Herbig, A. et al., 2016. MALT: Fast alignment and analysis of metagenomic DNA sequence data applied to the Tyrolean Iceman. bioRxiv, p.050559. Available at: [http://biorxiv.org/content/early/2016/04/27/050559](http://biorxiv.org/content/early/2016/04/27/050559). * **MaltExtract** Huebler, R. et al., 2019. HOPS: Automated detection and authentication of pathogen DNA in archaeological remains. bioRxiv, p.534198. Available at: [https://www.biorxiv.org/content/10.1101/534198v1?rss=1](https://www.biorxiv.org/content/10.1101/534198v1?rss=1). Download: [https://github.com/rhuebler/MaltExtract](https://github.com/rhuebler/MaltExtract) * **Kraken2** Wood, D et al., 2019. Improved metagenomic analysis with Kraken 2. Genome Biology volume 20, Article number: 257. Available at: [https://doi.org/10.1186/s13059-019-1891-0](https://doi.org/10.1186/s13059-019-1891-0). Download: [https://ccb.jhu.edu/software/kraken2/](https://ccb.jhu.edu/software/kraken2/) * **endorS.py** Aida Andrades Valtueña (Unpublished). Download: [https://github.com/aidaanva/endorS.py](https://github.com/aidaanva/endorS.py) * **Bowtie2** Langmead, B. and Salzberg, S. L. 2012 Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4), p. 357–359. doi: [10.1038/nmeth.1923](https:/dx.doi.org/10.1038/nmeth.1923). * **sequenceTools** Stephan Schiffels (Unpublished). Download: [https://github.com/stschiff/sequenceTools](https://github.com/stschiff/sequenceTools) * **EigenstratDatabaseTools** Thiseas C. Lamnidis (Unpublished). Download: [https://github.com/TCLamnidis/EigenStratDatabaseTools.git](https://github.com/TCLamnidis/EigenStratDatabaseTools.git) * **mapDamage** Jónsson, H., et al 2013. mapDamage2.0: fast approximate Bayesian estimates of ancient DNA damage parameters. Bioinformatics , 29(13), 1682–1684. [https://doi.org/10.1093/bioinformatics/btt193](https://doi.org/10.1093/bioinformatics/btt193) * **BBduk** Brian Bushnell (Unpublished). Download: [https://sourceforge.net/projects/bbmap/](sourceforge.net/projects/bbmap/) ## Data References This repository uses test data from the following studies: * Fellows Yates, J. A. et al. (2017) ‘Central European Woolly Mammoth Population Dynamics: Insights from Late Pleistocene Mitochondrial Genomes’, Scientific reports, 7(1), p. 17714. [doi: 10.1038/s41598-017-17723-1](https://doi.org/10.1038/s41598-017-17723-1). * Gamba, C. et al. (2014) ‘Genome flux and stasis in a five millennium transect of European prehistory’, Nature communications, 5, p. 5257. [doi: 10.1038/ncomms6257](https://doi.org/10.1038/ncomms6257). * Star, B. et al. (2017) ‘Ancient DNA reveals the Arctic origin of Viking Age cod from Haithabu, Germany’, Proceedings of the National Academy of Sciences of the United States of America, 114(34), pp. 9152–9157. [doi: 10.1073/pnas.1710186114](https://doi.org/10.1073/pnas.1710186114). * de Barros Damgaard, P. et al. (2018). '137 ancient human genomes from across the Eurasian steppes.', Nature, 557(7705), 369–374. [doi: 10.1038/s41586-018-0094-2](https://doi.org/10.1038/s41586-018-0094-2) ================================================ FILE: assets/angsd_resources/README ================================================ **These files are originally part of angsd (release 0.931). They have been added here for convinence.** This file describes how the 'hapmap' and mappability files used by angsd is generated ##download wget http://hapmap.ncbi.nlm.nih.gov/downloads/frequencies/2010-08_phaseII+III/allele_freqs_chrX_CEU_r28_nr.b36_fwd.txt.gz wget http://hapmap.ncbi.nlm.nih.gov/downloads/frequencies/2010-08_phaseII+III/allele_freqs_chr21_CEU_r28_nr.b36_fwd.txt.gz #with the md5sum a105316eaa2ebbdb3f8d62a9cb10a2d5 allele_freqs_chr21_CEU_r28_nr.b36_fwd.txt.gz 5a0f920951ce2ded4afe2f10227110ac allele_freqs_chrX_CEU_r28_nr.b36_fwd.txt.gz ##create dummy bed file to use the liftover tools gunzip -c allele_freqs_chrX_CEU_r28_nr.b36_fwd.txt.gz| awk '{print $2" "$3-1" "$3" "$11" "$12" "$4" "$14}'|sed 1d >allele.txt ##do the liftover liftOver allele.txt /opt/liftover/hg18ToHg19.over.chain.gz hit nohit ##now remove invarible sites, and redundant columns cut -f1,3 --complement hit |grep -v -P "\t1.0"|grep -v -P "\t0\t"|gzip -c >HapMapchrX.gz ##create dummy bed file to use the liftover tools gunzip -c allele_freqs_chr21_CEU_r28_nr.b36_fwd.txt| awk '{print $2" "$3-1" "$3" "$11" "$12" "$4" "$14}'|sed 1d >allele.txt ##do the liftover liftOver allele.txt /opt/liftover/hg18ToHg19.over.chain.gz hit nohit ##now remove invarible sites, and redundant columns cut -f1,3 --complement hit |grep -v -P "\t1.0"|grep -v -P "\t0\t"|gzip -c >HapMapchr21.gz ####### ##download 100kmer mappability wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/wgEncodeCrgMapabilityAlign100mer.bigWig #md5sum a1b1a8c99431fedf6a3b4baef028cca4 wgEncodeCrgMapabilityAlign100mer.bigWig ##download convert program wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/bigWigToBedGraph ##convert ./bigWigToBedGraph wgEncodeCrgMapabilityAlign100mer.bigWig chrX -chrom=chrX ./bigWigToBedGraph wgEncodeCrgMapabilityAlign100mer.bigWig chr21 -chrom=chr21 ##only keep unique regions and discard the chr* column grep -P "\t1$" chr21 |cut -f2-3 |gzip -c >chr21.unique.gz grep -P "\t1$" chrX |cut -f2-3 |gzip -c >chrX.unique.gz ================================================ FILE: assets/angsd_resources/getALL.txt ================================================ F="ASW CEU CHB CHD GIH JPT LWK MEX MKK TSI YRI" for f in $F do echo $f wget http://hapmap.ncbi.nlm.nih.gov/downloads/frequencies/2010-08_phaseII+III/allele_freqs_chrX_${f}_r28_nr.b36_fwd.txt.gz done cat allele*.gz >allele_freqs_chrX_ALL_r28_nr.b36_fwd.txt.gz gunzip -c allele_freqs_chrX_ALL_r28_nr.b36_fwd.txt.gz| awk '{print $2" "$3-1" "$3" "$11" "$12" "$4" "$14}'|grep -v pos >allele.txt /opt/liftover/liftOver allele.txt /opt/liftover/hg18ToHg19.over.chain.gz hit nohit cut -f1,3 --complement hit |grep -v -P "\t1.0"|grep -v -P "\t0\t"|gzip -c >HapMapALL.gz ================================================ FILE: assets/email_template.html ================================================ nf-core/eager Pipeline Report

nf-core/eager v${version}

Run Name: $runName

<% if (!success){ out << """

nf-core/eager execution completed unsuccessfully!

The exit status of the task that caused the workflow execution to fail was: $exitStatus.

The full error message was:

${errorReport}
""" } else { out << """
nf-core/eager execution completed successfully!
""" } %>

The workflow was completed at $dateComplete (duration: $duration)

The command used to launch the workflow was as follows:

$commandLine

Pipeline Configuration:

<% out << summary.collect{ k,v -> "" }.join("\n") %>
$k
$v

nf-core/eager

https://github.com/nf-core/eager

================================================ FILE: assets/email_template.txt ================================================ ---------------------------------------------------- ,--./,-. ___ __ __ __ ___ /,-._.--~\\ |\\ | |__ __ / ` / \\ |__) |__ } { | \\| | \\__, \\__/ | \\ |___ \\`-._,-`-, `._,._,' nf-core/eager v${version} ---------------------------------------------------- Run Name: $runName <% if (success){ out << "## nf-core/eager execution completed successfully! ##" } else { out << """#################################################### ## nf-core/eager execution completed unsuccessfully! ## #################################################### The exit status of the task that caused the workflow execution to fail was: $exitStatus. The full error message was: ${errorReport} """ } %> The workflow was completed at $dateComplete (duration: $duration) The command used to launch the workflow was as follows: $commandLine Pipeline Configuration: ----------------------- <% out << summary.collect{ k,v -> " - $k: $v" }.join("\n") %> -- nf-core/eager https://github.com/nf-core/eager ================================================ FILE: assets/multiqc_config.yaml ================================================ custom_logo: "nf-core_eager_logo_outline_drop.png" custom_logo_url: https://github.com/nf-core/eager/ custom_logo_title: "nf-core/eager" report_comment: > This report has been generated by the nf-core/eager analysis pipeline. For information about how to interpret these results, please see the documentation. run_modules: - adapterRemoval - bowtie2 - custom_content - damageprofiler - dedup - fastp - fastqc - gatk - kraken - malt - mapdamage - mtnucratio - multivcfanalyzer - picard - preseq - qualimap - samtools - sexdeterrmine - hops - bcftools extra_fn_clean_exts: - "_fastp" - ".pe.settings" - ".se.settings" - ".settings" - ".pe.combined" - ".se.truncated" - ".mapped" - ".mapped_rmdup" - ".mapped_rmdup_stats" - "_libmerged_rg_rmdup" - "_libmerged_rg_rmdup_stats" - "_postfilterflagstat.stats" - "_flagstat.stat" - ".filtered" - ".filtered_rmdup" - ".filtered_rmdup_stats" - "_libmerged_rg_add" - "_libmerged_rg_add_stats" - "_rmdup" - ".unmapped" - ".fastq.gz" - ".fastq" - ".fq.gz" - ".fq" - ".bam" - ".kreport" - ".unifiedgenotyper" - ".trimmed_stats" - "_libmerged" - "_bt2" - type: "regex" pattern: "_udg(half|none|full)" top_modules: - "fastqc": name: "FastQC (pre-Trimming)" path_filters: - "*_raw_fastqc.zip" - "fastp" - "adapterRemoval" - "fastqc": name: "FastQC (post-Trimming)" path_filters: - "*.truncated_fastqc.zip" - "*.combined*_fastqc.zip" - "*_postartrimmed_fastqc.zip" - "bowtie2": path_filters: - "*_bt2.log" - "malt" - "hops" - "kraken" - "samtools": name: "Samtools Flagstat (pre-samtools filter)" path_filters: - "*_flagstat.stats" - "samtools": name: "Samtools Flagstat (post-samtools filter)" path_filters: - "*_postfilterflagstat.stats" - "dedup" - "picard" - "preseq": path_filters: - "*.preseq" - "damageprofiler" - "mapdamage" - "mtnucratio" - "qualimap" - "sexdeterrmine" - "bcftools" - "multivcfanalyzer": path_filters: - "*MultiVCFAnalyzer.json" qualimap_config: general_stats_coverage: - 1 - 2 - 3 - 4 - 5 remove_sections: - sexdeterrmine-snps table_columns_visible: FastQC (pre-Trimming): percent_duplicates: False percent_gc: True avg_sequence_length: True fastp: pct_duplication: False after_filtering_gc_content: True pct_surviving: False Adapter Removal: aligned_total: False percent_aligned: True FastQC (post-Trimming): avg_sequence_length: True percent_duplicates: False total_sequences: True percent_gc: True bowtie2: overall_alignment_rate: True MALT: Taxonomic assignment success: False Assig. Taxonomy: False Mappability: True Total reads: False Num. of queries: False Kraken: "% Unclassified": True "% Top 5": False Samtools Flagstat (pre-samtools filter): flagstat_total: True mapped_passed: True Samtools Flagstat (post-samtools filter): mapped_passed: True DeDup: dup_rate: False clusterfactor: True mapped_after_dedup: True Picard: PERCENT_DUPLICATION: True DamageProfiler: 5 Prime1: True 5 Prime2: True 3 Prime1: False 3 Prime2: False mean_readlength: True median: True mapDamage: 5 Prime1: True 5 Prime2: True 3 Prime1: False 3 Prime2: False mtnucratio: mt_nuc_ratio: True QualiMap: mapped_reads: True mean_coverage: True 1_x_pc: True 5_x_pc: True percentage_aligned: False median_insert_size: False MultiVCFAnalyzer: Heterozygous SNP alleles (percent): True endorSpy: endogenous_dna: True endogenous_dna_post: True nuclear_contamination: Num_SNPs: True Method1_MOM_estimate: False Method1_MOM_SE: False Method1_ML_estimate: True Method1_ML_SE: True Method2_MOM_estimate: False Method2_MOM_SE: False Method2_ML_estimate: False Method2_ML_SE: False snp_coverage: Covered_Snps: True Total_Snps: False table_columns_placement: FastQC (pre-Trimming): total_sequences: 100 avg_sequence_length: 110 percent_gc: 120 fastp: after_filtering_gc_content: 200 Adapter Removal: percent_aligned: 300 FastQC (post-Trimming): total_sequences: 400 avg_sequence_length: 410 percent_gc: 420 Bowtie 2 / HiSAT2: overall_alignment_rate: 450 MALT: Num. of queries: 430 Total reads: 440 Mappability: 450 Assig. Taxonomy: 460 Taxonomic assignment success: 470 Kraken: "% Unclassified": 480 Samtools Flagstat (pre-samtools filter): flagstat_total: 551 mapped_passed: 552 Samtools Flagstat (post-samtools filter): flagstat_total: 600 mapped_passed: 620 endorSpy: endogenous_dna: 610 endogenous_dna_post: 640 nuclear_contamination: Num_SNPs: 1100 Method1_MOM_estimate: 1110 Method1_MOM_SE: 1120 Method1_ML_estimate: 1130 Method1_ML_SE: 1140 Method2_MOM_estimate: 1150 Method2_MOM_SE: 1160 Method2_ML_estimate: 1170 Method2_ML_SE: 1180 snp_coverage: Covered_Snps: 1050 Total_Snps: 1060 DeDup: mapped_after_dedup: 620 clusterfactor: 630 Picard: PERCENT_DUPLICATION: 650 DamageProfiler: 5 Prime1: 700 5 Prime2: 710 3 Prime1: 720 3 Prime2: 730 mean_readlength: 740 median: 750 mapDamage: 5 Prime1: 760 5 Prime2: 765 3 Prime1: 770 3 Prime2: 775 mtnucratio: mtreads: 780 mt_cov_avg: 785 mt_nuc_ratio: 790 QualiMap: mapped_reads: 800 mean_coverage: 805 median_coverage: 810 1_x_pc: 820 2_x_pc: 830 3_x_pc: 840 4_x_pc: 850 5_x_pc: 860 avg_gc: 870 sexdeterrmine: RateX: 1000 RateY: 1010 MultiVCFAnalyzer: Heterozygous SNP alleles (percent): 1200 read_count_multiplier: 1 read_count_prefix: "" read_count_desc: "" ancient_read_count_prefix: "" ancient_read_count_desc: "" ancient_read_count_multiplier: 1 decimalPoint_format: "." thousandsSep_format: "," report_section_order: software_versions: order: -1000 nf-core-eager-summary: order: -1001 export_plots: true table_columns_name: FastQC (pre-Trimming): total_sequences: "Nr. Input Reads" avg_sequence_length: "Length Input Reads" percent_gc: "% GC Input Reads" percent_duplicates: "% Dups Input Reads" percent_fails: "% Failed Input Reads" FastQC (post-Trimming): total_sequences: "Nr. Processed Reads" avg_sequence_length: "Length Processed Reads" percent_gc: "% GC Processed Reads" percent_duplicates: "% Dups Processed Reads" percent_fails: "%Failed Processed Reads" Samtools Flagstat (pre-samtools filter): flagstat_total: "Nr. Reads Into Mapping" mapped_passed: "Nr. Mapped Reads" Samtools Flagstat (post-samtools filter): flagstat_total: "Nr. Mapped Reads Post-Filter" mapped_passed: "Nr. Mapped Reads Passed Post-Filter" Endogenous DNA Post (%): endogenous_dna_post (%): "Endogenous DNA Post-Filter (%)" Picard: PERCENT_DUPLICATION: "% Dup. Mapped Reads" DamageProfiler: mean_readlength: "Mean Length Mapped Reads" median_readlength: "Median Length Mapped Reads" QualiMap: mapped_reads: "Nr. Dedup. Mapped Reads" total_reads: "Nr. Dedup. Total Reads" avg_gc: "% GC Dedup. Mapped Reads" Bcftools Stats: number_of_records: "Nr. Overall Variants" number_of_SNPs: "Nr. SNPs" number_of_indels: "Nr. InDels" MALT: Mappability: "% Metagenomic Mappability" SexDetErrmine: RateErrX: "SexDet Err X Chr" RateErrY: "SexDet Err Y Chr" RateX: "SexDet Rate X Chr" RateY: "SexDet Rate Y Chr" custom_table_header_config: general_stats_table: median_coverage: format: "{:,.3f}" mean_coverage: format: "{:,.3f}" ================================================ FILE: assets/nf-core_eager_dummy.txt ================================================ This is a dummy file for when we need a 'fake' file to satisfy all nextflow channel inputs being filled, even if we actually only use one. ================================================ FILE: assets/nf-core_eager_dummy2.txt ================================================ This is a second dummy file for when we need a 'fake' file to satisfy all nextflow channel inputs being filled, even if we actually only use one. ================================================ FILE: assets/sendmail_template.txt ================================================ To: $email Subject: $subject Mime-Version: 1.0 Content-Type: multipart/related;boundary="nfcoremimeboundary" --nfcoremimeboundary Content-Type: text/html; charset=utf-8 $email_html --nfcoremimeboundary Content-Type: image/png;name="nf-core-eager_logo.png" Content-Transfer-Encoding: base64 Content-ID: Content-Disposition: inline; filename="nf-core-eager_logo.png" <% out << new File("$projectDir/assets/nf-core-eager_logo.png"). bytes. encodeBase64(). toString(). tokenize( '\n' )*. toList()*. collate( 76 )*. collect { it.join() }. flatten(). join( '\n' ) %> <% if (mqcFile){ def mqcFileObj = new File("$mqcFile") if (mqcFileObj.length() < mqcMaxSize){ out << """ --nfcoremimeboundary Content-Type: text/html; name=\"multiqc_report\" Content-Transfer-Encoding: base64 Content-ID: Content-Disposition: attachment; filename=\"${mqcFileObj.getName()}\" ${mqcFileObj. bytes. encodeBase64(). toString(). tokenize( '\n' )*. toList()*. collate( 76 )*. collect { it.join() }. flatten(). join( '\n' )} """ }} %> --nfcoremimeboundary-- ================================================ FILE: assets/where_are_my_files.txt ================================================ ===================== Where are my files? ===================== By default, the nfcore/eager pipeline does not save large intermediate files to the results directory. This is to try to conserve disk space. These files can be found in the pipeline `work` directory if needed. Alternatively, re-run the pipeline using `-resume` in addition to one of the below command-line options and they will be copied into the results directory: `--saveReference` Save any downloaded or generated reference genome files to your results folder. These can then be used for future pipeline runs, reducing processing times. ----------------------------------- Setting defaults in a config file ----------------------------------- If you would always like these files to be saved without having to specify this on the command line, you can save the following to your personal configuration file (eg. `~/.nextflow/config`): params.saveReference = true For more help, see the following documentation: https://github.com/nf-core/eager/blob/master/docs/usage.md https://www.nextflow.io/docs/latest/getstarted.html https://www.nextflow.io/docs/latest/config.html ================================================ FILE: bin/endorS.py ================================================ #!/usr/bin/env python3 # Written by Aida Andrades Valtueña and released under MIT license. # See git repository (https://github.com/aidaanva/endorS.py) for full license text. """Script to calculate the endogenous DNA in a sample from samtools flag stats. It can accept up to two files: pre-quality and post-quality filtering. We recommend to use both files but you can also use the pre-quality filtering. """ import re import sys import json import argparse import textwrap parser = argparse.ArgumentParser(prog='endorS.py', usage='python %(prog)s [-h] [--version] .stats [.stats]', formatter_class=argparse.RawDescriptionHelpFormatter, description=textwrap.dedent('''\ author: Aida Andrades Valtueña (aida.andrades[at]gmail.com) description: %(prog)s calculates endogenous DNA from samtools flagstat files and print to screen Use --output flag to write results to a file ''')) parser.add_argument('samtoolsfiles', metavar='.stats', type=str, nargs='+', help='output of samtools flagstat in a txt file (at least one required). If two files are supplied, the mapped reads of the second file is divided by the total reads in the first, since it assumes that the are related to the same sample. Useful after BAM filtering') parser.add_argument('-v','--version', action='version', version='%(prog)s 0.4') parser.add_argument('--output', '-o', nargs='?', help='specify a file format for an output file. Options: for a MultiQC json output. Default: none') parser.add_argument('--name', '-n', nargs='?', help='specify name for the output file. Default: extracted from the first samtools flagstat file provided') args = parser.parse_args() #Open the samtools flag stats pre-quality filtering: try: with open(args.samtoolsfiles[0], 'r') as pre: contentsPre = pre.read() #Extract number of total reads totalReads = float((re.findall(r'^([0-9]+) \+ [0-9]+ in total',contentsPre))[0]) #Extract number of mapped reads pre-quality filtering: mappedPre = float((re.findall(r'([0-9]+) \+ [0-9]+ mapped ',contentsPre))[0]) #Calculation of endogenous DNA pre-quality filtering: if totalReads == 0.0: endogenousPre = 0.000000 print("WARNING: no reads in the fastq input, Endogenous DNA raw (%) set to 0.000000") elif mappedPre == 0.0: endogenousPre = 0.000000 print("WARNING: no mapped reads, Endogenous DNA raw (%) set to 0.000000") else: endogenousPre = float("{0:.6f}".format(round((mappedPre / totalReads * 100), 6))) except: print("Incorrect input, please provide at least a samtools flag stats as input\nRun:\npython endorS.py --help \nfor more information on how to run this script") sys.exit() #Check if the samtools stats post-quality filtering have been provided: try: #Open the samtools flag stats post-quality filtering: with open(args.samtoolsfiles[1], 'r') as post: contentsPost = post.read() #Extract number of mapped reads post-quality filtering: mappedPost = float((re.findall(r'([0-9]+) \+ [0-9]+ mapped',contentsPost))[0]) #Calculation of endogenous DNA post-quality filtering: if totalReads == 0.0: endogenousPost = 0.000000 print("WARNING: no reads in the fastq input, Endogenous DNA modified (%) set to 0.000000") elif mappedPost == 0.0: endogenousPost = 0.000000 print("WARNING: no mapped reads, Endogenous DNA modified (%) set to 0.000000") else: endogenousPost = float("{0:.6f}".format(round((mappedPost / totalReads * 100),6))) except: print("Only one samtools flagstat file provided") #Set the number of reads post-quality filtering to 0 if samtools #samtools flag stats not provided: mappedPost = "NA" #Setting the name depending on the -name flag: if args.name is not None: name = args.name else: #Set up the name based on the first samtools flagstats: name= str(((args.samtoolsfiles[0].rsplit(".",1)[0]).rsplit("/"))[-1]) #print(name) if mappedPost == "NA": #Creating the json file jsonOutput={ "id": "endorSpy", "plot_type": "generalstats", "pconfig": { "endogenous_dna": { "max": 100, "min": 0, "title": "Endogenous DNA (%)", "format": '{:,.2f}'} }, "data": { name : { "endogenous_dna": endogenousPre} } } else: #Creating the json file jsonOutput={ "id": "endorSpy", "plot_type": "generalstats", "pconfig": { "endogenous_dna": { "max": 100, "min": 0, "title": "Endogenous DNA (%)", "format": '{:,.2f}'}, "endogenous_dna_post": { "max": 100, "min": 0, "title": "Endogenous DNA Post (%)", "format": '{:,.2f}'} }, "data": { name : { "endogenous_dna": endogenousPre, "endogenous_dna_post": endogenousPost} }, } #Checking for print to screen argument: if args.output is not None: #Creating file with the named after the name variable: #Writing the json output: fileName = name + "_endogenous_dna_mqc.json" #print(fileName) with open(fileName, "w+") as outfile: json.dump(jsonOutput, outfile) print(fileName,"has been generated") else: if mappedPost == "NA": print("Endogenous DNA (%):",endogenousPre) else: print("Endogenous DNA raw (%):",endogenousPre) print("Endogenous DNA modified (%):",endogenousPost) ================================================ FILE: bin/extract_map_reads.py ================================================ #!/usr/bin/env python3 # Written by Maxime Borry and released under the MIT license. # See git repository (https://github.com/nf-core/eager) for full license text. import argparse import pysam from xopen import xopen import logging import os from pathlib import Path def _get_args(): """This function parses and return arguments passed in""" parser = argparse.ArgumentParser( prog="extract_mapped_reads", formatter_class=argparse.RawDescriptionHelpFormatter, description="Remove mapped in bam file from fastq files", ) parser.add_argument("bam_file", help="path to bam file") parser.add_argument("fwd", help="path to forward fastq file") parser.add_argument( "-merged", dest="merged", default=False, action="store_true", help="specify if bam file was created from merged fastq files", ) parser.add_argument( "-rev", dest="rev", default=None, help="path to reverse fastq file" ) parser.add_argument( "-of", dest="out_fwd", default=None, help="path to forward output fastq file" ) parser.add_argument( "-or", dest="out_rev", default=None, help="path to forward output fastq file" ) parser.add_argument( "-m", dest="mode", default="remove", help="Read removal mode: remove reads (remove) or replace sequence by N (replace). Default = remove", ) parser.add_argument( "-t", dest="threads", default=4, help="Number of parallel threads" ) args = parser.parse_args() bam = args.bam_file in_fwd = args.fwd merged = args.merged in_rev = args.rev out_fwd = args.out_fwd out_rev = args.out_rev mode = args.mode threads = int(args.threads) return (bam, in_fwd, merged, in_rev, out_fwd, out_rev, mode, threads) def extract_mapped(bamfile, merged): """Get mapped reads in parallel Args: threads(int): number of threads to use bam(str): path to bamfile Returns: bamfile(str): path to bam alignment file result(set): list of mapped reads name (str) """ if bamfile.endswith(".bam") or bamfile.endswith(".gz"): read_mode = "rb" else: read_mode = "r" mapped_reads = set() bamfile = pysam.AlignmentFile(bamfile, mode=read_mode) for read in bamfile.fetch(): if read.flag != 4: if merged: if read.query_name.startswith("M_"): mapped_reads.add(read.query_name[2:]) elif read.query_name.startswith("MT_"): mapped_reads.add(read.query_name[3:]) else: mapped_reads.add(read.query_name) else: mapped_reads.add(read.query_name) return mapped_reads def read_write_fq(fq_in, fq_out, mapped_reads, mode, write_mode, proc): """ Read and write fastq file with mapped reads removed Args: fq_in(str): path to input fastq file fq_out(str): path to output fastq file mapped_reads(set): set of mapped reads name (str) mode(str): read removal mode (remove or replace) write_mode(str): write mode (w or wb) proc(int): number of parallel processes merged(bool): True if bam file was created from merged fastq files """ if write_mode == "w": cm = open(fq_out, write_mode) elif write_mode == "wb": cm = xopen(fq_out, mode=write_mode, threads=proc) with pysam.FastxFile(fq_in) as fh: with cm as fh_out: for read in fh: try: if read.name in mapped_reads: if mode == "replace": read.sequence = "N" * len(read.sequence) read = str(read) + "\n" if write_mode == "w": fh_out.write(read) elif write_mode == "wb": fh_out.write(read.encode()) else: read = str(read) + "\n" if write_mode == "w": fh_out.write(read) elif write_mode == "wb": fh_out.write(read.encode()) except Exception as e: logging.error(f"Problem with {str(read)}") logging.error(e) def check_remove_mode(mode): if mode.lower() not in ["replace", "remove"]: logging.info(f"Mode must be {' or '.join(mode)}") return mode.lower() if __name__ == "__main__": BAM, IN_FWD, MERGED, IN_REV, OUT_FWD, OUT_REV, MODE, PROC = _get_args() logging.basicConfig(level=logging.INFO, format="%(message)s") if OUT_FWD == None: out_fwd = os.path.join(os.getcwd(), Path(IN_FWD).stem + ".r1.fq.gz") else: out_fwd = OUT_FWD if out_fwd.endswith(".gz"): write_mode = "wb" else: write_mode = "w" remove_mode = check_remove_mode(MODE) # FORWARD OR SE FILE logging.info(f"- Extracting mapped reads from {BAM}") mapped_reads = extract_mapped(BAM, merged=MERGED) logging.info(f"- Checking forward fq file {IN_FWD}") read_write_fq( fq_in=IN_FWD, fq_out=out_fwd, mapped_reads=mapped_reads, mode=remove_mode, write_mode=write_mode, proc=PROC, ) logging.info(f"- Cleaned forward FastQ file written to {out_fwd}") # REVERSE FILE if IN_REV: if OUT_REV == None: out_rev = os.path.join(os.getcwd(), Path(IN_REV).stem + ".r2.fq.gz") else: out_rev = OUT_REV logging.info(f"- Checking reverse fq file {IN_FWD}") read_write_fq( fq_in=IN_REV, fq_out=out_rev, mapped_reads=mapped_reads, mode=remove_mode, write_mode=write_mode, proc=PROC, ) logging.info(f"- Cleaned reverse FastQ file written to {out_rev}") ================================================ FILE: bin/filter_bam_fragment_length.py ================================================ #!/usr/bin/env python3 # Written by Maxime Borry and released under the MIT license. # See git repository (https://github.com/nf-core/eager) for full license text. import argparse import pysam def get_args(): """This function parses and return arguments passed in""" parser = argparse.ArgumentParser( prog="bam_filter", description="Filter bam on fragment length" ) parser.add_argument("bam", help="Bam aligment file") parser.add_argument( "-l", dest="fraglen", default=35, type=int, help="Minimum fragment length. Default = 35", ) parser.add_argument( "-a", dest="all", default=False, action="store_true", help="Include all reads, even unmapped", ) parser.add_argument( "-o", dest="output", default=None, help="Output bam basename. Default = {bam_basename}.filtered.bam", ) args = parser.parse_args() bam = args.bam fraglen = args.fraglen allreads = args.all outfile = args.output return (bam, fraglen, allreads, outfile) def getBasename(file_name): if ("/") in file_name: basename = file_name.split("/")[-1].split(".")[0] else: basename = file_name.split(".")[0] return basename def filter_bam(infile, outfile, fraglen, allreads): """Write bam to file Args: infile (stream): pysam stream outfile (str): Path to output bam fraglen(int): Minimum fragment length to keep allreads(bool): Apply on all reads, not only mapped """ bamfile = pysam.AlignmentFile(infile, "rb") bamwrite = pysam.AlignmentFile(outfile + ".filtered.bam", "wb", template=bamfile) for read in bamfile.fetch(until_eof=True): if allreads: if read.query_length >= fraglen: bamwrite.write(read) else: if read.is_unmapped == False and read.query_length >= fraglen: bamwrite.write(read) if __name__ == "__main__": BAM, FRAGLEN, ALLREADS, OUTFILE = get_args() BAMFILE = pysam.AlignmentFile(BAM, "rb") if OUTFILE is None: OUTFILE = getBasename(BAM) filter_bam(BAM, OUTFILE, FRAGLEN, ALLREADS) ================================================ FILE: bin/kraken_parse.py ================================================ #!/usr/bin/env python # Written by Maxime Borry and released under the MIT license. # See git repository (https://github.com/nf-core/eager) for full license text. import argparse import csv def _get_args(): '''This function parses and return arguments passed in''' parser = argparse.ArgumentParser( prog='kraken_parse', formatter_class=argparse.RawDescriptionHelpFormatter, description='Parsing kraken') parser.add_argument('krakenReport', help="path to kraken report file") parser.add_argument( '-c', dest="count", default=50, help="Minimum number of hits on clade to report it. Default = 50") parser.add_argument( '-or', dest="readout", default=None, help="Read count output file. Default = .read_kraken_parsed.csv") parser.add_argument( '-ok', dest="kmerout", default=None, help="Kmer Output file. Default = .kmer_kraken_parsed.csv") args = parser.parse_args() infile = args.krakenReport countlim = int(args.count) readout = args.readout kmerout = args.kmerout return(infile, countlim, readout, kmerout) def _get_basename(file_name): if ("/") in file_name: basename = file_name.split("/")[-1].split(".")[0] else: basename = file_name.split(".")[0] return(basename) def parse_kraken(infile, countlim): ''' INPUT: infile (str): path to kraken report file countlim (int): lowest count threshold to report hit OUTPUT: resdict (dict): key=taxid, value=readCount ''' with open(infile, 'r') as f: read_dict = {} kmer_dict = {} csvreader = csv.reader(f, delimiter='\t') for line in csvreader: reads = int(line[1]) if reads >= countlim: taxid = line[6] kmer = line[3] unique_kmer = line[4] try: kmer_duplicity = float(kmer)/float(unique_kmer) except ZeroDivisionError: kmer_duplicity = 0 read_dict[taxid] = reads kmer_dict[taxid] = kmer_duplicity return(read_dict, kmer_dict) def write_output(resdict, infile, outfile): with open(outfile, 'w') as f: basename = _get_basename(infile) f.write(f"TAXID,{basename}\n") for akey in resdict.keys(): f.write(f"{akey},{resdict[akey]}\n") if __name__ == '__main__': INFILE, COUNTLIM, readout, kmerout = _get_args() if not readout: read_outfile = _get_basename(INFILE)+".read_kraken_parsed.csv" else: read_outfile = readout if not kmerout: kmer_outfile = _get_basename(INFILE)+".kmer_kraken_parsed.csv" else: kmer_outfile = kmerout read_dict, kmer_dict = parse_kraken(infile=INFILE, countlim=COUNTLIM) write_output(resdict=read_dict, infile=INFILE, outfile=read_outfile) write_output(resdict=kmer_dict, infile=INFILE, outfile=kmer_outfile) ================================================ FILE: bin/markdown_to_html.py ================================================ #!/usr/bin/env python from __future__ import print_function import argparse import markdown import os import sys import io def convert_markdown(in_fn): input_md = io.open(in_fn, mode="r", encoding="utf-8").read() html = markdown.markdown( "[TOC]\n" + input_md, extensions=["pymdownx.extra", "pymdownx.b64", "pymdownx.highlight", "pymdownx.emoji", "pymdownx.tilde", "toc"], extension_configs={ "pymdownx.b64": {"base_path": os.path.dirname(in_fn)}, "pymdownx.highlight": {"noclasses": True}, "toc": {"title": "Table of Contents"}, }, ) return html def wrap_html(contents): header = """
""" footer = """
""" return header + contents + footer def parse_args(args=None): parser = argparse.ArgumentParser() parser.add_argument("mdfile", type=argparse.FileType("r"), nargs="?", help="File to convert. Defaults to stdin.") parser.add_argument( "-o", "--out", type=argparse.FileType("w"), default=sys.stdout, help="Output file name. Defaults to stdout." ) return parser.parse_args(args) def main(args=None): args = parse_args(args) converted_md = convert_markdown(args.mdfile.name) html = wrap_html(converted_md) args.out.write(html) if __name__ == "__main__": sys.exit(main()) ================================================ FILE: bin/merge_kraken_res.py ================================================ #!/usr/bin/env python # Written by Maxime Borry and released under the MIT license. # See git repository (https://github.com/nf-core/eager) for full license text. import argparse import os import pandas as pd import numpy as np def _get_args(): '''This function parses and return arguments passed in''' parser = argparse.ArgumentParser( prog='merge_kraken_res', formatter_class=argparse.RawDescriptionHelpFormatter, description='Merging csv count files in one table') parser.add_argument( '-or', dest="readout", default="kraken_read_count_table.csv", help="Read count output file. Default = kraken_read_count_table.csv") parser.add_argument( '-ok', dest="kmerout", default="kraken_kmer_unicity_table.csv", help="Kmer unicity output file. Default = kraken_kmer_unicity_table.csv") args = parser.parse_args() readout = args.readout kmerout = args.kmerout return(readout, kmerout) def get_csv(): tmp = [i for i in os.listdir() if ".csv" in i] kmer = [i for i in tmp if '.kmer_' in i] read = [i for i in tmp if '.read_' in i] return(read, kmer) def _get_basename(file_name): if ("/") in file_name: basename = file_name.split("/")[-1].split(".")[0] else: basename = file_name.split(".")[0] return(basename) def merge_csv(all_csv): df = pd.read_csv(all_csv[0], index_col=0) for i in range(1, len(all_csv)): df_tmp = pd.read_csv(all_csv[i], index_col=0) df = pd.merge(left=df, right=df_tmp, on='TAXID', how='outer') df.fillna(0, inplace=True) return(df) def write_csv(pd_dataframe, outfile): pd_dataframe.to_csv(outfile) if __name__ == "__main__": READOUT, KMEROUT = _get_args() reads, kmers = get_csv() read_df = merge_csv(reads) kmer_df = merge_csv(kmers) write_csv(read_df, READOUT) write_csv(kmer_df, KMEROUT) ================================================ FILE: bin/parse_snp_cov.py ================================================ #!/usr/bin/env python3 # Written by Thiseas C. Lamnidis and released under the MIT license. # See git repository (https://github.com/nf-core/eager) for full license text. import sys, json from collections import OrderedDict jsonOut = OrderedDict() data = OrderedDict() input = open(sys.argv[1], 'r') for line in input: fields = line.strip().split() sample_id = fields[0] covered_snps = fields[1] total_snps = fields[2] if sample_id[0] == "#": continue data[sample_id] = {"Covered_Snps":covered_snps, "Total_Snps":total_snps} jsonOut = {"plot_type": "generalstats", "id": "snp_coverage", "pconfig": { "Covered_Snps" : {"title" : "#SNPs Covered"}, "Total_Snps" : {"title": "#SNPs Total"} }, "data" : data } with open(sys.argv[1].rstrip('.txt')+'_mqc.json', 'w') as outfile: json.dump(jsonOut, outfile) ================================================ FILE: bin/print_x_contamination.py ================================================ #!/usr/bin/env python3 # Written by Thiseas C. Lamnidis and released under the MIT license. # See git repository (https://github.com/nf-core/eager) for full license text. import sys, re, json from collections import OrderedDict jsonOut=OrderedDict() data=OrderedDict() ## Function to convert a set of elements into floating point numbers, when possible, else leave them be. def make_float(x): # print (x) output=[None for i in range(len(x))] ## If value for an estimate/error is -nan, replace with "NA". JSON does not accept NaN as a valid field. for i in range(len(x)): if x[i] == "-nan" or x[i] == "nan": output[i]="N/A" continue try: output[i]=float(x[i]) except: output[i]=x[i] return(tuple(output)) Input_files=sys.argv[1:] output = open("nuclear_contamination.txt", 'w') print ("Individual", "Num_SNPs", "Method1_MOM_estimate", "Method1_MOM_SE", "Method1_ML_estimate", "Method1_ML_SE", "Method2_MOM_estimate", "Method2_MOM_SE", "Method2_ML_estimate", "Method2_ML_SE", sep="\t", file=output) for fn in Input_files: ## For each file, reset the values to "N/A" so they don't carry over from last file. mom1, err_mom1= "N/A","N/A" ml1, err_ml1="N/A","N/A" mom2, err_mom2= "N/A","N/A" ml2, err_ml2="N/A","N/A" nSNPs="0" with open(fn, 'r') as f: Estimates={} Ind=re.sub('\.X.contamination.out$', '', fn).split("/")[-1] for line in f: fields=line.strip().split() if line.strip()[0:19] == "We have nSNP sites:": nSNPs=fields[4].rstrip(",") elif line.strip()[0:7] == "Method1" and line.strip()[9:16] == 'new_llh': mom1=fields[3].split(":")[1] err_mom1=fields[4].split(":")[1] ml1=fields[5].split(":")[1] err_ml1=fields[6].split(":")[1] ## Sometimes angsd fails to run method 2, and the error is printed directly after the SE for ML. When that happens, exclude the first word in the error from the output. (Method 2 jsonOut will be shown as NA) if err_ml1.endswith("contamination"): err_ml1 = err_ml1[:-13] elif line.strip()[0:7] == "Method2" and line.strip()[9:16] == 'new_llh': mom2=fields[3].split(":")[1] err_mom2=fields[4].split(":")[1] ml2=fields[5].split(":")[1] err_ml2=fields[6].split(":")[1] ## Convert estimates and errors to floating point numbers (ml1, err_ml1, mom1, err_mom1, ml2, err_ml2, mom2, err_mom2) = make_float((ml1, err_ml1, mom1, err_mom1, ml2, err_ml2, mom2, err_mom2)) data[Ind]={ "Num_SNPs" : int(nSNPs), "Method1_MOM_estimate" : mom1, "Method1_MOM_SE" : err_mom1, "Method1_ML_estimate" : ml1, "Method1_ML_SE" : err_ml1, "Method2_MOM_estimate" : mom2, "Method2_MOM_SE" : err_mom2, "Method2_ML_estimate" : ml2, "Method2_ML_SE" : err_ml2 } print (Ind, nSNPs, mom1, err_mom1, ml1, err_ml1, mom2, err_mom2, ml2, err_ml2, sep="\t", file=output) jsonOut = {"plot_type": "generalstats", "id": "nuclear_contamination", "pconfig": { "Num_SNPs" : {"title" : "Number of SNPs"}, "Method1_MOM_estimate" : {"title": "Contamination Estimate (Method1_MOM)"}, "Method1_MOM_SE" : {"title": "Estimate Error (Method1_MOM)"}, "Method1_ML_estimate" : {"title": "Contamination Estimate (Method1_ML)"}, "Method1_ML_SE" : {"title": "Estimate Error (Method1_ML)"}, "Method2_MOM_estimate" : {"title": "Contamination Estimate (Method2_MOM)"}, "Method2_MOM_SE" : {"title": "Estimate Error (Method2_MOM)"}, "Method2_ML_estimate" : {"title": "Contamination Estimate (Method2_ML)"}, "Method2_ML_SE" : {"title": "Estimate Error (Method2_ML)"} }, "data" : data } with open('nuclear_contamination_mqc.json', 'w') as outfile: json.dump(jsonOut, outfile) ================================================ FILE: bin/scrape_software_versions.py ================================================ #!/usr/bin/env python from __future__ import print_function from collections import OrderedDict import re regexes = { "nf-core/eager": ["v_pipeline.txt", r"(\S+)"], "Nextflow": ["v_nextflow.txt", r"(\S+)"], "FastQC": ["v_fastqc.txt", r"FastQC v(\S+)"], "MultiQC": ["v_multiqc.txt", r"multiqc, version (\S+)"], 'AdapterRemoval':['v_adapterremoval.txt', r"AdapterRemoval ver. (\S+)"], 'Picard MarkDuplicates': ['v_markduplicates.txt', r"Version:(\S+)"], 'Samtools': ['v_samtools.txt', r"samtools (\S+)"], 'Preseq': ['v_preseq.txt', r"Version: (\S+)"], 'BWA': ['v_bwa.txt', r"Version: (\S+)"], 'Bowtie2': ['v_bowtie2.txt', r"bowtie2-([0-9]+\.[0-9]+\.[0-9]+) -fdebug"], 'Qualimap': ['v_qualimap.txt', r"QualiMap v.(\S+)"], 'GATK HaplotypeCaller': ['v_gatk.txt', r"The Genome Analysis Toolkit \(GATK\) v(\S+)"], 'GATK UnifiedGenotyper': ['v_gatk3.txt', r"(\S+)"], 'bamUtil' : ['v_bamutil.txt', r"Version: (\S+);"], 'fastP': ['v_fastp.txt', r"([\d\.]+)"], 'DamageProfiler' : ['v_damageprofiler.txt', r"DamageProfiler v(\S+)"], 'angsd':['v_angsd.txt',r"version: (\S+)"], 'bedtools':['v_bedtools.txt',r"bedtools v(\S+)"], 'circulargenerator':['v_circulargenerator.txt',r"CircularGeneratorv(\S+)"], 'DeDup':['v_dedup.txt',r"DeDup v(\S+)"], 'freebayes':['v_freebayes.txt',r"v([0-9]\S+)"], 'sequenceTools':['v_sequencetools.txt',r"(\S+)"], 'maltextract':['v_maltextract.txt', r"version(\S+)"], 'malt':['v_malt.txt',r"version (\S+)"], 'multivcfanalyzer':['v_multivcfanalyzer.txt', r"MultiVCFAnalyzer - (\S+)"], 'pmdtools':['v_pmdtools.txt',r"pmdtools v(\S+)"], 'sexdeterrmine':['v_sexdeterrmine.txt',r"(\S+)"], 'MTNucRatioCalculator':['v_mtnucratiocalculator.txt',r"Version: (\S+)"], 'VCF2genome':['v_vcf2genome.txt', r"VCF2Genome \(v. ([0-9].[0-9]+) "], 'endorS.py':['v_endorSpy.txt', r"endorS.py (\S+)"], 'kraken':['v_kraken.txt', r"Kraken version (\S+)"], 'eigenstrat_snp_coverage':['v_eigenstrat_snp_coverage.txt',r"(\S+)"], 'mapDamage2':['v_mapdamage.txt',r"(\S+)"], 'bbduk':['v_bbduk.txt',r"(.*)"], 'bcftools':['v_bcftools.txt',r"(\S+)"] } results = OrderedDict() results["nf-core/eager"] = 'N/A' results["Nextflow"] = 'N/A' results["FastQC"] = 'N/A' results["MultiQC"] = 'N/A' results['AdapterRemoval'] = 'N/A' results['fastP'] = 'N/A' results['BWA'] = 'N/A' results['Bowtie2'] = 'N/A' results['circulargenerator'] = 'N/A' results['Samtools'] = 'N/A' results['endorS.py'] = 'N/A' results['DeDup'] = 'N/A' results['Picard MarkDuplicates'] = 'N/A' results['Qualimap'] = 'N/A' results['Preseq'] = 'N/A' results['GATK HaplotypeCaller'] = 'N/A' results['GATK UnifiedGenotyper'] = 'N/A' results['freebayes'] = 'N/A' results['sequenceTools'] = 'N/A' results['VCF2genome'] = 'N/A' results['MTNucRatioCalculator'] = 'N/A' results['bedtools'] = 'N/A' results['DamageProfiler'] = 'N/A' results['bamUtil'] = 'N/A' results['pmdtools'] = 'N/A' results['angsd'] = 'N/A' results['sexdeterrmine'] = 'N/A' results['multivcfanalyzer'] = 'N/A' results['malt'] = 'N/A' results['kraken'] = 'N/A' results['maltextract'] = 'N/A' results['eigenstrat_snp_coverage'] = 'N/A' results['mapDamage2'] = 'N/A' results['bbduk'] = 'N/A' results['bcftools'] = 'N/A' # Search each file using its regex for k, v in regexes.items(): try: with open(v[0]) as x: versions = x.read() match = re.search(v[1], versions) if match: results[k] = "v{}".format(match.group(1)) except IOError: results[k] = False # Remove software set to false in results for k in list(results): if not results[k]: del results[k] # Dump to YAML print( """ id: 'software_versions' section_name: 'nf-core/eager Software Versions' section_href: 'https://github.com/nf-core/eager' plot_type: 'html' description: 'are collected at run time from the software output.' data: |
""" ) for k, v in results.items(): print("
{}
{}
".format(k, v)) print("
") # Write out regexes as csv file: with open("software_versions.csv", "w") as f: for k, v in results.items(): f.write("{}\t{}\n".format(k, v)) ================================================ FILE: conf/base.config ================================================ /* * ------------------------------------------------- * nf-core/eager Nextflow base config file * ------------------------------------------------- * A 'blank slate' config file, appropriate for general * use on most high performace compute environments. * Assumes that all software is installed and available * on the PATH. Runs in `local` mode - all jobs will be * run on the logged in environment. */ process { cpus = { check_max( 1 * task.attempt, 'cpus' ) } memory = { check_max( 7.GB * task.attempt, 'memory' ) } time = { check_max( 24.h * task.attempt, 'time' ) } errorStrategy = { task.exitStatus in [143,137,104,134,139, 140] ? 'retry' : 'finish' } maxRetries = 3 maxErrors = '-1' // Process-specific resource requirements // NOTE - Only one of the labels below are used in the fastqc process in the main script. // If possible, it would be nice to keep the same label naming convention when // adding in your processes. // See https://www.nextflow.io/docs/latest/config.html#config-process-selectors // Generic resource requirements - s(ingle)c(ore)/m(ulti)c(ore) withLabel:'sc_tiny'{ cpus = { check_max( 1, 'cpus' ) } memory = { check_max( 1.GB * task.attempt, 'memory' ) } time = { check_max( 4.h * task.attempt, 'time' ) } } withLabel:'sc_small'{ cpus = { check_max( 1, 'cpus' ) } memory = { check_max( 4.GB * task.attempt, 'memory' ) } time = { check_max( 4.h * task.attempt, 'time' ) } } withLabel:'sc_medium'{ cpus = { check_max( 1, 'cpus' ) } memory = { check_max( 8.GB * task.attempt, 'memory' ) } time = { check_max( 4.h * task.attempt, 'time' ) } } withLabel:'mc_small'{ cpus = { check_max( 2 * task.attempt, 'cpus' ) } memory = { check_max( 4.GB * task.attempt, 'memory' ) } time = { check_max( 4.h * task.attempt, 'time' ) } } withLabel:'mc_medium' { cpus = { check_max( 4 * task.attempt, 'cpus' ) } memory = { check_max( 8.GB * task.attempt, 'memory' ) } time = { check_max( 4.h * task.attempt, 'time' ) } } withLabel:'mc_large'{ cpus = { check_max( 8 * task.attempt, 'cpus' ) } memory = { check_max( 16.GB * task.attempt, 'memory' ) } time = { check_max( 4.h * task.attempt, 'time' ) } } withLabel:'mc_huge'{ cpus = { check_max( 32 * task.attempt, 'cpus' ) } memory = { check_max( 256.GB * task.attempt, 'memory' ) } time = { check_max( 4.h * task.attempt, 'time' ) } } // Process-specific resource requirements (others leave at default, e.g. Fastqc) withName:get_software_versions { cache = false } withName:qualimap{ errorStrategy = { task.exitStatus in [1,143,137,104,134,139, 140] ? 'retry' : task.exitStatus in [255] ? 'ignore' : 'finish' } } withName:preseq { errorStrategy = 'ignore' } withName:damageprofiler { errorStrategy = { task.exitStatus in [1,143,137,104,134,139, 140] ? 'retry' : 'finish' } } // Add 1 retry for certain java tools as not enough heap space java errors gives exit code 1 withName: dedup { errorStrategy = { task.exitStatus in [1,143,137,104,134,139, 140] ? 'retry' : 'finish' } } withName: markduplicates { errorStrategy = { task.exitStatus in [143,137, 140] ? 'retry' : 'finish' } } // Add 1 retry as not enough heapspace java error gives exit code 1 withName: malt { errorStrategy = { task.exitStatus in [1,143,137,104,134,139, 140] ? 'retry' : 'finish' } } // other process specific exit statuses withName: nuclear_contamination { errorStrategy = { task.exitStatus in [143,137,104,134,139, 140] ? 'ignore' : 'retry' } } } params { // Defaults only, expecting to be overwritten max_memory = 128.GB max_cpus = 16 max_time = 240.h igenomes_base = 's3://ngi-igenomes/igenomes/' } ================================================ FILE: conf/benchmarking_human.config ================================================ /* * ------------------------------------------------- * Nextflow config file for running tests * ------------------------------------------------- * Defines bundled input files and everything required * to run a fast and simple test. Use as follows: * nextflow run nf-core/eager -profile test, docker (or singularity, or conda) */ params { config_profile_name = 'nf-core/eager benchmarking - human profile' config_profile_description = "A 'fullsized' benchmarking profile for deepish Human sequencing aDNA data" //Input data input = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/testdata/Benchmarking/benchmarking_human.tsv' // Genome reference fasta = 'https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz' run_bam_filtering = true bam_unmapped_type = 'discard' bam_mapping_quality_threshold = 30 dedupper = 'markduplicates' run_trim_bam = true bamutils_clip_double_stranded_none_udg_left = 1 bamutils_clip_double_stranded_none_udg_right = 1 // JAR will need to be downloaded first! run_genotyping = true genotyping_tool = 'ug' genotyping_source = 'trimmed' gatk_call_conf = 20 run_sexdeterrmine = true sexdeterrmine_bedfile = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/reference/Human/1240K.pos.list_HG19.0based.bed.gz' run_nuclear_contamination = true contamination_chrom_name = 'chrX' run_mtnucratio = true } process { withName:'makeBWAIndex'{ time = { check_max( 4.h * task.attempt, 'time' ) } } withName:'adapter_removal'{ cpus = { check_max( 8, 'cpus' ) } memory = { check_max( 16.GB * task.attempt, 'memory' ) } time = { check_max( 2.h * task.attempt, 'time' ) } } withName:'bwa'{ cpus = { check_max( 8, 'cpus' ) } memory = { check_max( 16.GB * task.attempt, 'memory' ) } time = { check_max( 4.h * task.attempt, 'time' ) } } withName:'markDup'{ cpus = { check_max( 16, 'cpus' ) } memory = { check_max( 64.GB * task.attempt, 'memory' ) } time = { check_max( 4.h * task.attempt, 'time' ) } } withName:'damageprofiler'{ cpus = 1 memory = { check_max( 8.GB * task.attempt, 'memory' ) } time = { check_max( 2.h * task.attempt, 'time' ) } } } ================================================ FILE: conf/benchmarking_vikingfish.config ================================================ /* * ------------------------------------------------- * Nextflow config file for running tests * ------------------------------------------------- * Defines bundled input files and everything required * to run a fast and simple test. Use as follows: * nextflow run nf-core/eager -profile test, docker (or singularity, or conda) */ params { config_profile_name = 'nf-core/eager benchmarking - Viking Fish profile' config_profile_description = "A 'fullsized' benchmarking profile for deepish sequencing aDNA data" //Input data input = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/testdata/Benchmarking/benchmarking_vikingfish.tsv' // Genome reference fasta = 's3://nf-core-awsmegatests/eager/ENA_Data_Fish/GCF_902167405.1_gadMor3.0_genomic.fna.gz' bwaalnn = 0.04 bwaalnl = 1024 run_bam_filtering = true bam_unmapped_type = 'discard' bam_mapping_quality_threshold = 25 run_genotyping = true genotyping_tool = 'hc' genotyping_source = 'raw' gatk_ploidy = 2 } process { withName:'adapter_removal'{ cpus = { check_max( 8, 'cpus' ) } memory = { check_max( 16.GB * task.attempt, 'memory' ) } time = { check_max( 2.h * task.attempt, 'time' ) } } withName:'bwa'{ cpus = { check_max( 8, 'cpus' ) } memory = { check_max( 16.GB * task.attempt, 'memory' ) } time = { check_max( 8.h * task.attempt, 'time' ) } } withName:'dedup'{ cpus = { check_max( 8, 'cpus' ) } memory = { check_max( 16.GB * task.attempt, 'memory' ) } time = { check_max( 4.h * task.attempt, 'time' ) } } withName:'genotyping_hc'{ cpus = { check_max( 8, 'cpus' ) } memory = { check_max( 16.GB * task.attempt, 'memory' ) } time = { check_max( 8.h * task.attempt, 'time' ) } } } ================================================ FILE: conf/igenomes.config ================================================ /* * ------------------------------------------------- * Nextflow config file for iGenomes paths * ------------------------------------------------- * Defines reference genomes, using iGenome paths * Can be used by any config that customises the base * path using $params.igenomes_base / --igenomes_base */ params { // illumina iGenomes reference file paths genomes { 'GRCh37' { fasta = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/WholeGenomeFasta/genome.fa" bwa = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/genome.fa" bowtie2 = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/" star = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/" bismark = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BismarkIndex/" gtf = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.gtf" bed12 = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.bed" readme = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Annotation/README.txt" mito_name = "MT" macs_gsize = "2.7e9" blacklist = "${projectDir}/assets/blacklists/GRCh37-blacklist.bed" } 'GRCh38' { fasta = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/WholeGenomeFasta/genome.fa" bwa = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/BWAIndex/genome.fa" bowtie2 = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/Bowtie2Index/" star = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/STARIndex/" bismark = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/BismarkIndex/" gtf = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Annotation/Genes/genes.gtf" bed12 = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Annotation/Genes/genes.bed" mito_name = "chrM" macs_gsize = "2.7e9" blacklist = "${projectDir}/assets/blacklists/hg38-blacklist.bed" } 'GRCm38' { fasta = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Sequence/WholeGenomeFasta/genome.fa" bwa = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Sequence/BWAIndex/genome.fa" bowtie2 = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Sequence/Bowtie2Index/" star = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Sequence/STARIndex/" bismark = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Sequence/BismarkIndex/" gtf = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Annotation/Genes/genes.gtf" bed12 = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Annotation/Genes/genes.bed" readme = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Annotation/README.txt" mito_name = "MT" macs_gsize = "1.87e9" blacklist = "${projectDir}/assets/blacklists/GRCm38-blacklist.bed" } 'TAIR10' { fasta = "${params.igenomes_base}/Arabidopsis_thaliana/Ensembl/TAIR10/Sequence/WholeGenomeFasta/genome.fa" bwa = "${params.igenomes_base}/Arabidopsis_thaliana/Ensembl/TAIR10/Sequence/BWAIndex/genome.fa" bowtie2 = "${params.igenomes_base}/Arabidopsis_thaliana/Ensembl/TAIR10/Sequence/Bowtie2Index/" star = "${params.igenomes_base}/Arabidopsis_thaliana/Ensembl/TAIR10/Sequence/STARIndex/" bismark = "${params.igenomes_base}/Arabidopsis_thaliana/Ensembl/TAIR10/Sequence/BismarkIndex/" gtf = "${params.igenomes_base}/Arabidopsis_thaliana/Ensembl/TAIR10/Annotation/Genes/genes.gtf" bed12 = "${params.igenomes_base}/Arabidopsis_thaliana/Ensembl/TAIR10/Annotation/Genes/genes.bed" readme = "${params.igenomes_base}/Arabidopsis_thaliana/Ensembl/TAIR10/Annotation/README.txt" mito_name = "Mt" } 'EB2' { fasta = "${params.igenomes_base}/Bacillus_subtilis_168/Ensembl/EB2/Sequence/WholeGenomeFasta/genome.fa" bwa = "${params.igenomes_base}/Bacillus_subtilis_168/Ensembl/EB2/Sequence/BWAIndex/genome.fa" bowtie2 = "${params.igenomes_base}/Bacillus_subtilis_168/Ensembl/EB2/Sequence/Bowtie2Index/" star = "${params.igenomes_base}/Bacillus_subtilis_168/Ensembl/EB2/Sequence/STARIndex/" bismark = "${params.igenomes_base}/Bacillus_subtilis_168/Ensembl/EB2/Sequence/BismarkIndex/" gtf = "${params.igenomes_base}/Bacillus_subtilis_168/Ensembl/EB2/Annotation/Genes/genes.gtf" bed12 = "${params.igenomes_base}/Bacillus_subtilis_168/Ensembl/EB2/Annotation/Genes/genes.bed" readme = "${params.igenomes_base}/Bacillus_subtilis_168/Ensembl/EB2/Annotation/README.txt" } 'UMD3.1' { fasta = "${params.igenomes_base}/Bos_taurus/Ensembl/UMD3.1/Sequence/WholeGenomeFasta/genome.fa" bwa = "${params.igenomes_base}/Bos_taurus/Ensembl/UMD3.1/Sequence/BWAIndex/genome.fa" bowtie2 = "${params.igenomes_base}/Bos_taurus/Ensembl/UMD3.1/Sequence/Bowtie2Index/" star = "${params.igenomes_base}/Bos_taurus/Ensembl/UMD3.1/Sequence/STARIndex/" bismark = "${params.igenomes_base}/Bos_taurus/Ensembl/UMD3.1/Sequence/BismarkIndex/" gtf = "${params.igenomes_base}/Bos_taurus/Ensembl/UMD3.1/Annotation/Genes/genes.gtf" bed12 = "${params.igenomes_base}/Bos_taurus/Ensembl/UMD3.1/Annotation/Genes/genes.bed" readme = "${params.igenomes_base}/Bos_taurus/Ensembl/UMD3.1/Annotation/README.txt" mito_name = "MT" } 'WBcel235' { fasta = "${params.igenomes_base}/Caenorhabditis_elegans/Ensembl/WBcel235/Sequence/WholeGenomeFasta/genome.fa" bwa = "${params.igenomes_base}/Caenorhabditis_elegans/Ensembl/WBcel235/Sequence/BWAIndex/genome.fa" bowtie2 = "${params.igenomes_base}/Caenorhabditis_elegans/Ensembl/WBcel235/Sequence/Bowtie2Index/" star = "${params.igenomes_base}/Caenorhabditis_elegans/Ensembl/WBcel235/Sequence/STARIndex/" bismark = "${params.igenomes_base}/Caenorhabditis_elegans/Ensembl/WBcel235/Sequence/BismarkIndex/" gtf = "${params.igenomes_base}/Caenorhabditis_elegans/Ensembl/WBcel235/Annotation/Genes/genes.gtf" bed12 = "${params.igenomes_base}/Caenorhabditis_elegans/Ensembl/WBcel235/Annotation/Genes/genes.bed" mito_name = "MtDNA" macs_gsize = "9e7" } 'CanFam3.1' { fasta = "${params.igenomes_base}/Canis_familiaris/Ensembl/CanFam3.1/Sequence/WholeGenomeFasta/genome.fa" bwa = "${params.igenomes_base}/Canis_familiaris/Ensembl/CanFam3.1/Sequence/BWAIndex/genome.fa" bowtie2 = "${params.igenomes_base}/Canis_familiaris/Ensembl/CanFam3.1/Sequence/Bowtie2Index/" star = "${params.igenomes_base}/Canis_familiaris/Ensembl/CanFam3.1/Sequence/STARIndex/" bismark = "${params.igenomes_base}/Canis_familiaris/Ensembl/CanFam3.1/Sequence/BismarkIndex/" gtf = "${params.igenomes_base}/Canis_familiaris/Ensembl/CanFam3.1/Annotation/Genes/genes.gtf" bed12 = "${params.igenomes_base}/Canis_familiaris/Ensembl/CanFam3.1/Annotation/Genes/genes.bed" readme = "${params.igenomes_base}/Canis_familiaris/Ensembl/CanFam3.1/Annotation/README.txt" mito_name = "MT" } 'GRCz10' { fasta = "${params.igenomes_base}/Danio_rerio/Ensembl/GRCz10/Sequence/WholeGenomeFasta/genome.fa" bwa = "${params.igenomes_base}/Danio_rerio/Ensembl/GRCz10/Sequence/BWAIndex/genome.fa" bowtie2 = "${params.igenomes_base}/Danio_rerio/Ensembl/GRCz10/Sequence/Bowtie2Index/" star = "${params.igenomes_base}/Danio_rerio/Ensembl/GRCz10/Sequence/STARIndex/" bismark = "${params.igenomes_base}/Danio_rerio/Ensembl/GRCz10/Sequence/BismarkIndex/" gtf = "${params.igenomes_base}/Danio_rerio/Ensembl/GRCz10/Annotation/Genes/genes.gtf" bed12 = "${params.igenomes_base}/Danio_rerio/Ensembl/GRCz10/Annotation/Genes/genes.bed" mito_name = "MT" } 'BDGP6' { fasta = "${params.igenomes_base}/Drosophila_melanogaster/Ensembl/BDGP6/Sequence/WholeGenomeFasta/genome.fa" bwa = "${params.igenomes_base}/Drosophila_melanogaster/Ensembl/BDGP6/Sequence/BWAIndex/genome.fa" bowtie2 = "${params.igenomes_base}/Drosophila_melanogaster/Ensembl/BDGP6/Sequence/Bowtie2Index/" star = "${params.igenomes_base}/Drosophila_melanogaster/Ensembl/BDGP6/Sequence/STARIndex/" bismark = "${params.igenomes_base}/Drosophila_melanogaster/Ensembl/BDGP6/Sequence/BismarkIndex/" gtf = "${params.igenomes_base}/Drosophila_melanogaster/Ensembl/BDGP6/Annotation/Genes/genes.gtf" bed12 = "${params.igenomes_base}/Drosophila_melanogaster/Ensembl/BDGP6/Annotation/Genes/genes.bed" mito_name = "M" macs_gsize = "1.2e8" } 'EquCab2' { fasta = "${params.igenomes_base}/Equus_caballus/Ensembl/EquCab2/Sequence/WholeGenomeFasta/genome.fa" bwa = "${params.igenomes_base}/Equus_caballus/Ensembl/EquCab2/Sequence/BWAIndex/genome.fa" bowtie2 = "${params.igenomes_base}/Equus_caballus/Ensembl/EquCab2/Sequence/Bowtie2Index/" star = "${params.igenomes_base}/Equus_caballus/Ensembl/EquCab2/Sequence/STARIndex/" bismark = "${params.igenomes_base}/Equus_caballus/Ensembl/EquCab2/Sequence/BismarkIndex/" gtf = "${params.igenomes_base}/Equus_caballus/Ensembl/EquCab2/Annotation/Genes/genes.gtf" bed12 = "${params.igenomes_base}/Equus_caballus/Ensembl/EquCab2/Annotation/Genes/genes.bed" readme = "${params.igenomes_base}/Equus_caballus/Ensembl/EquCab2/Annotation/README.txt" mito_name = "MT" } 'EB1' { fasta = "${params.igenomes_base}/Escherichia_coli_K_12_DH10B/Ensembl/EB1/Sequence/WholeGenomeFasta/genome.fa" bwa = "${params.igenomes_base}/Escherichia_coli_K_12_DH10B/Ensembl/EB1/Sequence/BWAIndex/genome.fa" bowtie2 = "${params.igenomes_base}/Escherichia_coli_K_12_DH10B/Ensembl/EB1/Sequence/Bowtie2Index/" star = "${params.igenomes_base}/Escherichia_coli_K_12_DH10B/Ensembl/EB1/Sequence/STARIndex/" bismark = "${params.igenomes_base}/Escherichia_coli_K_12_DH10B/Ensembl/EB1/Sequence/BismarkIndex/" gtf = "${params.igenomes_base}/Escherichia_coli_K_12_DH10B/Ensembl/EB1/Annotation/Genes/genes.gtf" bed12 = "${params.igenomes_base}/Escherichia_coli_K_12_DH10B/Ensembl/EB1/Annotation/Genes/genes.bed" readme = "${params.igenomes_base}/Escherichia_coli_K_12_DH10B/Ensembl/EB1/Annotation/README.txt" } 'Galgal4' { fasta = "${params.igenomes_base}/Gallus_gallus/Ensembl/Galgal4/Sequence/WholeGenomeFasta/genome.fa" bwa = "${params.igenomes_base}/Gallus_gallus/Ensembl/Galgal4/Sequence/BWAIndex/genome.fa" bowtie2 = "${params.igenomes_base}/Gallus_gallus/Ensembl/Galgal4/Sequence/Bowtie2Index/" star = "${params.igenomes_base}/Gallus_gallus/Ensembl/Galgal4/Sequence/STARIndex/" bismark = "${params.igenomes_base}/Gallus_gallus/Ensembl/Galgal4/Sequence/BismarkIndex/" gtf = "${params.igenomes_base}/Gallus_gallus/Ensembl/Galgal4/Annotation/Genes/genes.gtf" bed12 = "${params.igenomes_base}/Gallus_gallus/Ensembl/Galgal4/Annotation/Genes/genes.bed" mito_name = "MT" } 'Gm01' { fasta = "${params.igenomes_base}/Glycine_max/Ensembl/Gm01/Sequence/WholeGenomeFasta/genome.fa" bwa = "${params.igenomes_base}/Glycine_max/Ensembl/Gm01/Sequence/BWAIndex/genome.fa" bowtie2 = "${params.igenomes_base}/Glycine_max/Ensembl/Gm01/Sequence/Bowtie2Index/" star = "${params.igenomes_base}/Glycine_max/Ensembl/Gm01/Sequence/STARIndex/" bismark = "${params.igenomes_base}/Glycine_max/Ensembl/Gm01/Sequence/BismarkIndex/" gtf = "${params.igenomes_base}/Glycine_max/Ensembl/Gm01/Annotation/Genes/genes.gtf" bed12 = "${params.igenomes_base}/Glycine_max/Ensembl/Gm01/Annotation/Genes/genes.bed" readme = "${params.igenomes_base}/Glycine_max/Ensembl/Gm01/Annotation/README.txt" } 'Mmul_1' { fasta = "${params.igenomes_base}/Macaca_mulatta/Ensembl/Mmul_1/Sequence/WholeGenomeFasta/genome.fa" bwa = "${params.igenomes_base}/Macaca_mulatta/Ensembl/Mmul_1/Sequence/BWAIndex/genome.fa" bowtie2 = "${params.igenomes_base}/Macaca_mulatta/Ensembl/Mmul_1/Sequence/Bowtie2Index/" star = "${params.igenomes_base}/Macaca_mulatta/Ensembl/Mmul_1/Sequence/STARIndex/" bismark = "${params.igenomes_base}/Macaca_mulatta/Ensembl/Mmul_1/Sequence/BismarkIndex/" gtf = "${params.igenomes_base}/Macaca_mulatta/Ensembl/Mmul_1/Annotation/Genes/genes.gtf" bed12 = "${params.igenomes_base}/Macaca_mulatta/Ensembl/Mmul_1/Annotation/Genes/genes.bed" readme = "${params.igenomes_base}/Macaca_mulatta/Ensembl/Mmul_1/Annotation/README.txt" mito_name = "MT" } 'IRGSP-1.0' { fasta = "${params.igenomes_base}/Oryza_sativa_japonica/Ensembl/IRGSP-1.0/Sequence/WholeGenomeFasta/genome.fa" bwa = "${params.igenomes_base}/Oryza_sativa_japonica/Ensembl/IRGSP-1.0/Sequence/BWAIndex/genome.fa" bowtie2 = "${params.igenomes_base}/Oryza_sativa_japonica/Ensembl/IRGSP-1.0/Sequence/Bowtie2Index/" star = "${params.igenomes_base}/Oryza_sativa_japonica/Ensembl/IRGSP-1.0/Sequence/STARIndex/" bismark = "${params.igenomes_base}/Oryza_sativa_japonica/Ensembl/IRGSP-1.0/Sequence/BismarkIndex/" gtf = "${params.igenomes_base}/Oryza_sativa_japonica/Ensembl/IRGSP-1.0/Annotation/Genes/genes.gtf" bed12 = "${params.igenomes_base}/Oryza_sativa_japonica/Ensembl/IRGSP-1.0/Annotation/Genes/genes.bed" mito_name = "Mt" } 'CHIMP2.1.4' { fasta = "${params.igenomes_base}/Pan_troglodytes/Ensembl/CHIMP2.1.4/Sequence/WholeGenomeFasta/genome.fa" bwa = "${params.igenomes_base}/Pan_troglodytes/Ensembl/CHIMP2.1.4/Sequence/BWAIndex/genome.fa" bowtie2 = "${params.igenomes_base}/Pan_troglodytes/Ensembl/CHIMP2.1.4/Sequence/Bowtie2Index/" star = "${params.igenomes_base}/Pan_troglodytes/Ensembl/CHIMP2.1.4/Sequence/STARIndex/" bismark = "${params.igenomes_base}/Pan_troglodytes/Ensembl/CHIMP2.1.4/Sequence/BismarkIndex/" gtf = "${params.igenomes_base}/Pan_troglodytes/Ensembl/CHIMP2.1.4/Annotation/Genes/genes.gtf" bed12 = "${params.igenomes_base}/Pan_troglodytes/Ensembl/CHIMP2.1.4/Annotation/Genes/genes.bed" readme = "${params.igenomes_base}/Pan_troglodytes/Ensembl/CHIMP2.1.4/Annotation/README.txt" mito_name = "MT" } 'Rnor_6.0' { fasta = "${params.igenomes_base}/Rattus_norvegicus/Ensembl/Rnor_6.0/Sequence/WholeGenomeFasta/genome.fa" bwa = "${params.igenomes_base}/Rattus_norvegicus/Ensembl/Rnor_6.0/Sequence/BWAIndex/genome.fa" bowtie2 = "${params.igenomes_base}/Rattus_norvegicus/Ensembl/Rnor_6.0/Sequence/Bowtie2Index/" star = "${params.igenomes_base}/Rattus_norvegicus/Ensembl/Rnor_6.0/Sequence/STARIndex/" bismark = "${params.igenomes_base}/Rattus_norvegicus/Ensembl/Rnor_6.0/Sequence/BismarkIndex/" gtf = "${params.igenomes_base}/Rattus_norvegicus/Ensembl/Rnor_6.0/Annotation/Genes/genes.gtf" bed12 = "${params.igenomes_base}/Rattus_norvegicus/Ensembl/Rnor_6.0/Annotation/Genes/genes.bed" mito_name = "MT" } 'R64-1-1' { fasta = "${params.igenomes_base}/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Sequence/WholeGenomeFasta/genome.fa" bwa = "${params.igenomes_base}/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Sequence/BWAIndex/genome.fa" bowtie2 = "${params.igenomes_base}/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Sequence/Bowtie2Index/" star = "${params.igenomes_base}/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Sequence/STARIndex/" bismark = "${params.igenomes_base}/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Sequence/BismarkIndex/" gtf = "${params.igenomes_base}/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Annotation/Genes/genes.gtf" bed12 = "${params.igenomes_base}/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Annotation/Genes/genes.bed" mito_name = "MT" macs_gsize = "1.2e7" } 'EF2' { fasta = "${params.igenomes_base}/Schizosaccharomyces_pombe/Ensembl/EF2/Sequence/WholeGenomeFasta/genome.fa" bwa = "${params.igenomes_base}/Schizosaccharomyces_pombe/Ensembl/EF2/Sequence/BWAIndex/genome.fa" bowtie2 = "${params.igenomes_base}/Schizosaccharomyces_pombe/Ensembl/EF2/Sequence/Bowtie2Index/" star = "${params.igenomes_base}/Schizosaccharomyces_pombe/Ensembl/EF2/Sequence/STARIndex/" bismark = "${params.igenomes_base}/Schizosaccharomyces_pombe/Ensembl/EF2/Sequence/BismarkIndex/" gtf = "${params.igenomes_base}/Schizosaccharomyces_pombe/Ensembl/EF2/Annotation/Genes/genes.gtf" bed12 = "${params.igenomes_base}/Schizosaccharomyces_pombe/Ensembl/EF2/Annotation/Genes/genes.bed" readme = "${params.igenomes_base}/Schizosaccharomyces_pombe/Ensembl/EF2/Annotation/README.txt" mito_name = "MT" macs_gsize = "1.21e7" } 'Sbi1' { fasta = "${params.igenomes_base}/Sorghum_bicolor/Ensembl/Sbi1/Sequence/WholeGenomeFasta/genome.fa" bwa = "${params.igenomes_base}/Sorghum_bicolor/Ensembl/Sbi1/Sequence/BWAIndex/genome.fa" bowtie2 = "${params.igenomes_base}/Sorghum_bicolor/Ensembl/Sbi1/Sequence/Bowtie2Index/" star = "${params.igenomes_base}/Sorghum_bicolor/Ensembl/Sbi1/Sequence/STARIndex/" bismark = "${params.igenomes_base}/Sorghum_bicolor/Ensembl/Sbi1/Sequence/BismarkIndex/" gtf = "${params.igenomes_base}/Sorghum_bicolor/Ensembl/Sbi1/Annotation/Genes/genes.gtf" bed12 = "${params.igenomes_base}/Sorghum_bicolor/Ensembl/Sbi1/Annotation/Genes/genes.bed" readme = "${params.igenomes_base}/Sorghum_bicolor/Ensembl/Sbi1/Annotation/README.txt" } 'Sscrofa10.2' { fasta = "${params.igenomes_base}/Sus_scrofa/Ensembl/Sscrofa10.2/Sequence/WholeGenomeFasta/genome.fa" bwa = "${params.igenomes_base}/Sus_scrofa/Ensembl/Sscrofa10.2/Sequence/BWAIndex/genome.fa" bowtie2 = "${params.igenomes_base}/Sus_scrofa/Ensembl/Sscrofa10.2/Sequence/Bowtie2Index/" star = "${params.igenomes_base}/Sus_scrofa/Ensembl/Sscrofa10.2/Sequence/STARIndex/" bismark = "${params.igenomes_base}/Sus_scrofa/Ensembl/Sscrofa10.2/Sequence/BismarkIndex/" gtf = "${params.igenomes_base}/Sus_scrofa/Ensembl/Sscrofa10.2/Annotation/Genes/genes.gtf" bed12 = "${params.igenomes_base}/Sus_scrofa/Ensembl/Sscrofa10.2/Annotation/Genes/genes.bed" readme = "${params.igenomes_base}/Sus_scrofa/Ensembl/Sscrofa10.2/Annotation/README.txt" mito_name = "MT" } 'AGPv3' { fasta = "${params.igenomes_base}/Zea_mays/Ensembl/AGPv3/Sequence/WholeGenomeFasta/genome.fa" bwa = "${params.igenomes_base}/Zea_mays/Ensembl/AGPv3/Sequence/BWAIndex/genome.fa" bowtie2 = "${params.igenomes_base}/Zea_mays/Ensembl/AGPv3/Sequence/Bowtie2Index/" star = "${params.igenomes_base}/Zea_mays/Ensembl/AGPv3/Sequence/STARIndex/" bismark = "${params.igenomes_base}/Zea_mays/Ensembl/AGPv3/Sequence/BismarkIndex/" gtf = "${params.igenomes_base}/Zea_mays/Ensembl/AGPv3/Annotation/Genes/genes.gtf" bed12 = "${params.igenomes_base}/Zea_mays/Ensembl/AGPv3/Annotation/Genes/genes.bed" mito_name = "Mt" } 'hg38' { fasta = "${params.igenomes_base}/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa" bwa = "${params.igenomes_base}/Homo_sapiens/UCSC/hg38/Sequence/BWAIndex/genome.fa" bowtie2 = "${params.igenomes_base}/Homo_sapiens/UCSC/hg38/Sequence/Bowtie2Index/" star = "${params.igenomes_base}/Homo_sapiens/UCSC/hg38/Sequence/STARIndex/" bismark = "${params.igenomes_base}/Homo_sapiens/UCSC/hg38/Sequence/BismarkIndex/" gtf = "${params.igenomes_base}/Homo_sapiens/UCSC/hg38/Annotation/Genes/genes.gtf" bed12 = "${params.igenomes_base}/Homo_sapiens/UCSC/hg38/Annotation/Genes/genes.bed" mito_name = "chrM" macs_gsize = "2.7e9" blacklist = "${projectDir}/assets/blacklists/hg38-blacklist.bed" } 'hg19' { fasta = "${params.igenomes_base}/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa" bwa = "${params.igenomes_base}/Homo_sapiens/UCSC/hg19/Sequence/BWAIndex/genome.fa" bowtie2 = "${params.igenomes_base}/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/" star = "${params.igenomes_base}/Homo_sapiens/UCSC/hg19/Sequence/STARIndex/" bismark = "${params.igenomes_base}/Homo_sapiens/UCSC/hg19/Sequence/BismarkIndex/" gtf = "${params.igenomes_base}/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.gtf" bed12 = "${params.igenomes_base}/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.bed" readme = "${params.igenomes_base}/Homo_sapiens/UCSC/hg19/Annotation/README.txt" mito_name = "chrM" macs_gsize = "2.7e9" blacklist = "${projectDir}/assets/blacklists/hg19-blacklist.bed" } 'mm10' { fasta = "${params.igenomes_base}/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa" bwa = "${params.igenomes_base}/Mus_musculus/UCSC/mm10/Sequence/BWAIndex/genome.fa" bowtie2 = "${params.igenomes_base}/Mus_musculus/UCSC/mm10/Sequence/Bowtie2Index/" star = "${params.igenomes_base}/Mus_musculus/UCSC/mm10/Sequence/STARIndex/" bismark = "${params.igenomes_base}/Mus_musculus/UCSC/mm10/Sequence/BismarkIndex/" gtf = "${params.igenomes_base}/Mus_musculus/UCSC/mm10/Annotation/Genes/genes.gtf" bed12 = "${params.igenomes_base}/Mus_musculus/UCSC/mm10/Annotation/Genes/genes.bed" readme = "${params.igenomes_base}/Mus_musculus/UCSC/mm10/Annotation/README.txt" mito_name = "chrM" macs_gsize = "1.87e9" blacklist = "${projectDir}/assets/blacklists/mm10-blacklist.bed" } 'bosTau8' { fasta = "${params.igenomes_base}/Bos_taurus/UCSC/bosTau8/Sequence/WholeGenomeFasta/genome.fa" bwa = "${params.igenomes_base}/Bos_taurus/UCSC/bosTau8/Sequence/BWAIndex/genome.fa" bowtie2 = "${params.igenomes_base}/Bos_taurus/UCSC/bosTau8/Sequence/Bowtie2Index/" star = "${params.igenomes_base}/Bos_taurus/UCSC/bosTau8/Sequence/STARIndex/" bismark = "${params.igenomes_base}/Bos_taurus/UCSC/bosTau8/Sequence/BismarkIndex/" gtf = "${params.igenomes_base}/Bos_taurus/UCSC/bosTau8/Annotation/Genes/genes.gtf" bed12 = "${params.igenomes_base}/Bos_taurus/UCSC/bosTau8/Annotation/Genes/genes.bed" mito_name = "chrM" } 'ce10' { fasta = "${params.igenomes_base}/Caenorhabditis_elegans/UCSC/ce10/Sequence/WholeGenomeFasta/genome.fa" bwa = "${params.igenomes_base}/Caenorhabditis_elegans/UCSC/ce10/Sequence/BWAIndex/genome.fa" bowtie2 = "${params.igenomes_base}/Caenorhabditis_elegans/UCSC/ce10/Sequence/Bowtie2Index/" star = "${params.igenomes_base}/Caenorhabditis_elegans/UCSC/ce10/Sequence/STARIndex/" bismark = "${params.igenomes_base}/Caenorhabditis_elegans/UCSC/ce10/Sequence/BismarkIndex/" gtf = "${params.igenomes_base}/Caenorhabditis_elegans/UCSC/ce10/Annotation/Genes/genes.gtf" bed12 = "${params.igenomes_base}/Caenorhabditis_elegans/UCSC/ce10/Annotation/Genes/genes.bed" readme = "${params.igenomes_base}/Caenorhabditis_elegans/UCSC/ce10/Annotation/README.txt" mito_name = "chrM" macs_gsize = "9e7" } 'canFam3' { fasta = "${params.igenomes_base}/Canis_familiaris/UCSC/canFam3/Sequence/WholeGenomeFasta/genome.fa" bwa = "${params.igenomes_base}/Canis_familiaris/UCSC/canFam3/Sequence/BWAIndex/genome.fa" bowtie2 = "${params.igenomes_base}/Canis_familiaris/UCSC/canFam3/Sequence/Bowtie2Index/" star = "${params.igenomes_base}/Canis_familiaris/UCSC/canFam3/Sequence/STARIndex/" bismark = "${params.igenomes_base}/Canis_familiaris/UCSC/canFam3/Sequence/BismarkIndex/" gtf = "${params.igenomes_base}/Canis_familiaris/UCSC/canFam3/Annotation/Genes/genes.gtf" bed12 = "${params.igenomes_base}/Canis_familiaris/UCSC/canFam3/Annotation/Genes/genes.bed" readme = "${params.igenomes_base}/Canis_familiaris/UCSC/canFam3/Annotation/README.txt" mito_name = "chrM" } 'danRer10' { fasta = "${params.igenomes_base}/Danio_rerio/UCSC/danRer10/Sequence/WholeGenomeFasta/genome.fa" bwa = "${params.igenomes_base}/Danio_rerio/UCSC/danRer10/Sequence/BWAIndex/genome.fa" bowtie2 = "${params.igenomes_base}/Danio_rerio/UCSC/danRer10/Sequence/Bowtie2Index/" star = "${params.igenomes_base}/Danio_rerio/UCSC/danRer10/Sequence/STARIndex/" bismark = "${params.igenomes_base}/Danio_rerio/UCSC/danRer10/Sequence/BismarkIndex/" gtf = "${params.igenomes_base}/Danio_rerio/UCSC/danRer10/Annotation/Genes/genes.gtf" bed12 = "${params.igenomes_base}/Danio_rerio/UCSC/danRer10/Annotation/Genes/genes.bed" mito_name = "chrM" macs_gsize = "1.37e9" } 'dm6' { fasta = "${params.igenomes_base}/Drosophila_melanogaster/UCSC/dm6/Sequence/WholeGenomeFasta/genome.fa" bwa = "${params.igenomes_base}/Drosophila_melanogaster/UCSC/dm6/Sequence/BWAIndex/genome.fa" bowtie2 = "${params.igenomes_base}/Drosophila_melanogaster/UCSC/dm6/Sequence/Bowtie2Index/" star = "${params.igenomes_base}/Drosophila_melanogaster/UCSC/dm6/Sequence/STARIndex/" bismark = "${params.igenomes_base}/Drosophila_melanogaster/UCSC/dm6/Sequence/BismarkIndex/" gtf = "${params.igenomes_base}/Drosophila_melanogaster/UCSC/dm6/Annotation/Genes/genes.gtf" bed12 = "${params.igenomes_base}/Drosophila_melanogaster/UCSC/dm6/Annotation/Genes/genes.bed" mito_name = "chrM" macs_gsize = "1.2e8" } 'equCab2' { fasta = "${params.igenomes_base}/Equus_caballus/UCSC/equCab2/Sequence/WholeGenomeFasta/genome.fa" bwa = "${params.igenomes_base}/Equus_caballus/UCSC/equCab2/Sequence/BWAIndex/genome.fa" bowtie2 = "${params.igenomes_base}/Equus_caballus/UCSC/equCab2/Sequence/Bowtie2Index/" star = "${params.igenomes_base}/Equus_caballus/UCSC/equCab2/Sequence/STARIndex/" bismark = "${params.igenomes_base}/Equus_caballus/UCSC/equCab2/Sequence/BismarkIndex/" gtf = "${params.igenomes_base}/Equus_caballus/UCSC/equCab2/Annotation/Genes/genes.gtf" bed12 = "${params.igenomes_base}/Equus_caballus/UCSC/equCab2/Annotation/Genes/genes.bed" readme = "${params.igenomes_base}/Equus_caballus/UCSC/equCab2/Annotation/README.txt" mito_name = "chrM" } 'galGal4' { fasta = "${params.igenomes_base}/Gallus_gallus/UCSC/galGal4/Sequence/WholeGenomeFasta/genome.fa" bwa = "${params.igenomes_base}/Gallus_gallus/UCSC/galGal4/Sequence/BWAIndex/genome.fa" bowtie2 = "${params.igenomes_base}/Gallus_gallus/UCSC/galGal4/Sequence/Bowtie2Index/" star = "${params.igenomes_base}/Gallus_gallus/UCSC/galGal4/Sequence/STARIndex/" bismark = "${params.igenomes_base}/Gallus_gallus/UCSC/galGal4/Sequence/BismarkIndex/" gtf = "${params.igenomes_base}/Gallus_gallus/UCSC/galGal4/Annotation/Genes/genes.gtf" bed12 = "${params.igenomes_base}/Gallus_gallus/UCSC/galGal4/Annotation/Genes/genes.bed" readme = "${params.igenomes_base}/Gallus_gallus/UCSC/galGal4/Annotation/README.txt" mito_name = "chrM" } 'panTro4' { fasta = "${params.igenomes_base}/Pan_troglodytes/UCSC/panTro4/Sequence/WholeGenomeFasta/genome.fa" bwa = "${params.igenomes_base}/Pan_troglodytes/UCSC/panTro4/Sequence/BWAIndex/genome.fa" bowtie2 = "${params.igenomes_base}/Pan_troglodytes/UCSC/panTro4/Sequence/Bowtie2Index/" star = "${params.igenomes_base}/Pan_troglodytes/UCSC/panTro4/Sequence/STARIndex/" bismark = "${params.igenomes_base}/Pan_troglodytes/UCSC/panTro4/Sequence/BismarkIndex/" gtf = "${params.igenomes_base}/Pan_troglodytes/UCSC/panTro4/Annotation/Genes/genes.gtf" bed12 = "${params.igenomes_base}/Pan_troglodytes/UCSC/panTro4/Annotation/Genes/genes.bed" readme = "${params.igenomes_base}/Pan_troglodytes/UCSC/panTro4/Annotation/README.txt" mito_name = "chrM" } 'rn6' { fasta = "${params.igenomes_base}/Rattus_norvegicus/UCSC/rn6/Sequence/WholeGenomeFasta/genome.fa" bwa = "${params.igenomes_base}/Rattus_norvegicus/UCSC/rn6/Sequence/BWAIndex/genome.fa" bowtie2 = "${params.igenomes_base}/Rattus_norvegicus/UCSC/rn6/Sequence/Bowtie2Index/" star = "${params.igenomes_base}/Rattus_norvegicus/UCSC/rn6/Sequence/STARIndex/" bismark = "${params.igenomes_base}/Rattus_norvegicus/UCSC/rn6/Sequence/BismarkIndex/" gtf = "${params.igenomes_base}/Rattus_norvegicus/UCSC/rn6/Annotation/Genes/genes.gtf" bed12 = "${params.igenomes_base}/Rattus_norvegicus/UCSC/rn6/Annotation/Genes/genes.bed" mito_name = "chrM" } 'sacCer3' { fasta = "${params.igenomes_base}/Saccharomyces_cerevisiae/UCSC/sacCer3/Sequence/WholeGenomeFasta/genome.fa" bwa = "${params.igenomes_base}/Saccharomyces_cerevisiae/UCSC/sacCer3/Sequence/BWAIndex/genome.fa" bowtie2 = "${params.igenomes_base}/Saccharomyces_cerevisiae/UCSC/sacCer3/Sequence/Bowtie2Index/" star = "${params.igenomes_base}/Saccharomyces_cerevisiae/UCSC/sacCer3/Sequence/STARIndex/" bismark = "${params.igenomes_base}/Saccharomyces_cerevisiae/UCSC/sacCer3/Sequence/BismarkIndex/" readme = "${params.igenomes_base}/Saccharomyces_cerevisiae/UCSC/sacCer3/Annotation/README.txt" mito_name = "chrM" macs_gsize = "1.2e7" } 'susScr3' { fasta = "${params.igenomes_base}/Sus_scrofa/UCSC/susScr3/Sequence/WholeGenomeFasta/genome.fa" bwa = "${params.igenomes_base}/Sus_scrofa/UCSC/susScr3/Sequence/BWAIndex/genome.fa" bowtie2 = "${params.igenomes_base}/Sus_scrofa/UCSC/susScr3/Sequence/Bowtie2Index/" star = "${params.igenomes_base}/Sus_scrofa/UCSC/susScr3/Sequence/STARIndex/" bismark = "${params.igenomes_base}/Sus_scrofa/UCSC/susScr3/Sequence/BismarkIndex/" gtf = "${params.igenomes_base}/Sus_scrofa/UCSC/susScr3/Annotation/Genes/genes.gtf" bed12 = "${params.igenomes_base}/Sus_scrofa/UCSC/susScr3/Annotation/Genes/genes.bed" readme = "${params.igenomes_base}/Sus_scrofa/UCSC/susScr3/Annotation/README.txt" mito_name = "chrM" } } } ================================================ FILE: conf/test.config ================================================ /* * ------------------------------------------------- * Nextflow config file for running tests * ------------------------------------------------- * Defines bundled input files and everything required * to run a fast and simple test. Use as follows: * nextflow run nf-core/eager -profile test, docker (or singularity, or conda) */ includeConfig 'test_resources.config' params { config_profile_name = 'Test profile' config_profile_description = 'Minimal test dataset to check pipeline function' // Limit resources so that this can run on GitHub Actions max_cpus = 2 max_memory = 6.GB max_time = 48.h genome = false //Input data input = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/testdata/Mammoth/mammoth_design_fastq.tsv' // Genome references fasta = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/reference/Mammoth/Mammoth_MT_Krause.fasta' } ================================================ FILE: conf/test_direct.config ================================================ /* * ------------------------------------------------- * Nextflow config file for running tests * ------------------------------------------------- * Defines bundled input files and everything required * to run a fast and simple test. Use as follows: * nextflow run nf-core/eager -profile test, */ includeConfig 'test_resources.config' params { config_profile_name = 'Test profile' config_profile_description = 'Minimal test dataset to check pipeline function' // Limit resources so that this can run on GitHub Actions max_cpus = 2 max_memory = 6.GB max_time = 48.h genome = false //Input data single_end = false // Genome references fasta = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/reference/Mammoth/Mammoth_MT_Krause.fasta' // Ignore `--input` as otherwise the parameter validation will throw an error schema_ignore_params = 'genomes,input_paths,input' } ================================================ FILE: conf/test_full.config ================================================ /* * ------------------------------------------------- * Nextflow config file for running full-size tests * ------------------------------------------------- * Defines bundled input files and everything required * to run a full size pipeline test. Use as follows: * nextflow run nf-core/eager -profile test_full, */ params { config_profile_name = 'Full test profile for nf-core/eager' config_profile_description = 'Full test dataset to check nf-core/eager function' // Input data for full size test input = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/testdata/Benchmarking/benchmarking_vikingfish.tsv' // Genome reference fasta = 'https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_other/Gadus_morhua/representative/GCF_902167405.1_gadMor3.0/GCF_902167405.1_gadMor3.0_genomic.fna.gz' bwaalnn = 0.04 bwaalnl = 1024 run_bam_filtering = true bam_unmapped_type = 'discard' bam_mapping_quality_threshold = 25 run_genotyping = true genotyping_tool = 'hc' genotyping_source = 'raw' gatk_ploidy = 2 } process { withName:'adapter_removal'{ cpus = { check_max( 8, 'cpus' ) } memory = { check_max( 16.GB * task.attempt, 'memory' ) } time = { check_max( 2.h * task.attempt, 'time' ) } } withName:'bwa'{ cpus = { check_max( 8, 'cpus' ) } memory = { check_max( 16.GB * task.attempt, 'memory' ) } time = { check_max( 8.h * task.attempt, 'time' ) } } withName:'dedup'{ cpus = { check_max( 8, 'cpus' ) } memory = { check_max( 16.GB * task.attempt, 'memory' ) } time = { check_max( 4.h * task.attempt, 'time' ) } } withName:'genotyping_hc'{ cpus = { check_max( 8, 'cpus' ) } memory = { check_max( 16.GB * task.attempt, 'memory' ) } time = { check_max( 8.h * task.attempt, 'time' ) } } // Ignore `--input` as otherwise the parameter validation will throw an error schema_ignore_params = 'genomes,input_paths,input' } ================================================ FILE: conf/test_resources.config ================================================ /* * ------------------------------------------------- * Nextflow config file for running tests * ------------------------------------------------- * Defines the base computing resources used across all CI tests (primarily the * time limit) */ process { withLabel:'sc_tiny'{ cpus = { check_max( 1, 'cpus' ) } memory = { check_max( 1.GB * task.attempt, 'memory' ) } time = { check_max( 10.m * task.attempt, 'time' ) } } withLabel:'sc_small'{ cpus = { check_max( 1, 'cpus' ) } memory = { check_max( 4.GB * task.attempt, 'memory' ) } time = { check_max( 10.m * task.attempt, 'time' ) } } withLabel:'sc_medium'{ cpus = { check_max( 1, 'cpus' ) } memory = { check_max( 8.GB * task.attempt, 'memory' ) } time = { check_max( 10.m * task.attempt, 'time' ) } } withLabel:'mc_small'{ cpus = { check_max( 2 * task.attempt, 'cpus' ) } memory = { check_max( 4.GB * task.attempt, 'memory' ) } time = { check_max( 10.m * task.attempt, 'time' ) } } withLabel:'mc_medium' { cpus = { check_max( 4 * task.attempt, 'cpus' ) } memory = { check_max( 8.GB * task.attempt, 'memory' ) } time = { check_max( 10.m * task.attempt, 'time' ) } } withLabel:'mc_large'{ cpus = { check_max( 8 * task.attempt, 'cpus' ) } memory = { check_max( 16.GB * task.attempt, 'memory' ) } time = { check_max( 10.m * task.attempt, 'time' ) } } withLabel:'mc_huge'{ cpus = { check_max( 32 * task.attempt, 'cpus' ) } memory = { check_max( 256.GB * task.attempt, 'memory' ) } time = { check_max( 10.m * task.attempt, 'time' ) } } withName:'mapdamage_rescaling'{ time = { check_max( 20.m * task.attempt, 'time' ) } } } ================================================ FILE: conf/test_stresstest_human.config ================================================ /* * ------------------------------------------------- * Nextflow config file for running tests * ------------------------------------------------- * Defines bundled input files and everything required * to run a fast and simple test. Use as follows: * nextflow run nf-core/eager -profile test, docker (or singularity, or conda) */ params { config_profile_name = 'nf-core/eager stresstess - human profile' config_profile_description = "A large-scale benchmarking profile AWS stress-testing of large sample number study" //Input data input = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/testdata/Benchmarking/human_stresstest.tsv' // Genome reference fasta = 'https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz' save_reference = true email = 'james@nf-co.re' run_mtnucratio = true mtnucratio_header = 'ChrM' run_bam_filtering = true bam_unmapped_type = 'discard' bam_mapping_quality_threshold = 30 dedupper = 'markduplicates' run_sexdeterrmine = true sexdeterrmine_bedfile = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/reference/Human/1240K.pos.list_HG19.0based.bed.gz' run_nuclear_contamination = true contamination_chrom_name = 'chrX' run_mtnucratio = true } process { errorStrategy = 'retry' maxRetries = 5 withName:'makeBWAIndex'{ time = { check_max( 48.h * task.attempt, 'time' ) } } withName:'adapter_removal'{ cpus = { check_max( 8, 'cpus' ) } memory = { check_max( 16.GB * task.attempt, 'memory' ) } time = { check_max( 48.h * task.attempt, 'time' ) } } withName:'bwa'{ cpus = { check_max( 8, 'cpus' ) } memory = { check_max( 16.GB * task.attempt, 'memory' ) } time = { check_max( 48.h * task.attempt, 'time' ) } } withName:'markduplicates'{ errorStrategy = { task.exitStatus in [143,137,104,134,139] ? 'retry' : 'finish' } cpus = { check_max( 16, 'cpus' ) } memory = { check_max( 16.GB * task.attempt, 'memory' ) } time = { check_max( 48.h * task.attempt, 'time' ) } } withName:'damageprofiler'{ cpus = 1 memory = { check_max( 8.GB * task.attempt, 'memory' ) } time = { check_max( 48.h * task.attempt, 'time' ) } } } ================================================ FILE: conf/test_tsv_bam.config ================================================ /* * ------------------------------------------------- * Nextflow config file for running tests * ------------------------------------------------- * Defines bundled input files and everything required * to run a fast and simple test. Use as follows: * nextflow run nf-core/eager -profile test, docker (or singularity, or conda) */ includeConfig 'test_resources.config' params { config_profile_name = 'Test profile' config_profile_description = 'Minimal test dataset to check pipeline function' // Limit resources so that this can run on Travis max_cpus = 2 max_memory = 6.GB max_time = 48.h genome = false //Input data input = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/testdata/Mammoth/mammoth_design_bam.tsv' // Genome references fasta = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/reference/Mammoth/Mammoth_MT_Krause.fasta' } ================================================ FILE: conf/test_tsv_complex.config ================================================ /* * ------------------------------------------------- * Nextflow config file for running tests * ------------------------------------------------- * Defines bundled input files and everything required * to run a fast and simple test. Use as follows: * nextflow run nf-core/eager -profile test, docker (or singularity, or conda) */ includeConfig 'test_resources.config' params { config_profile_name = 'Test profile' config_profile_description = 'Minimal test dataset to check pipeline function' // Limit resources so that this can run on GitHub Actions max_cpus = 2 max_memory = 6.GB max_time = 48.h genome = false //Input data input = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/testdata/Mammoth/mammoth_design_fastq_multilane_multilib.tsv' // Genome references fasta = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/reference/Mammoth/Mammoth_MT_Krause.fasta' } ================================================ FILE: conf/test_tsv_fna.config ================================================ /* * ------------------------------------------------- * Nextflow config file for running tests * ------------------------------------------------- * Defines bundled input files and everything required * to run a fast and simple test. Use as follows: * nextflow run nf-core/eager -profile test, docker (or singularity, or conda) */ includeConfig 'test_resources.config' params { config_profile_name = 'Test profile' config_profile_description = 'Minimal test dataset to check pipeline function' // Limit resources so that this can run on Travis max_cpus = 2 max_memory = 6.GB max_time = 48.h genome = false //Input data input = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/testdata/Mammoth/mammoth_design_fastq.tsv' // Genome references fasta = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/reference/Mammoth/Mammoth_MT_Krause.fna' } ================================================ FILE: conf/test_tsv_humanbam.config ================================================ /* * ------------------------------------------------- * Nextflow config file for running tests * ------------------------------------------------- * Defines bundled input files and everything required * to run a fast and simple test. Use as follows: * nextflow run nf-core/eager -profile test, docker (or singularity, or conda) */ includeConfig 'test_resources.config' params { config_profile_name = 'Test profile' config_profile_description = 'Minimal test dataset to check pipeline function' // Limit resources so that this can run on Travis max_cpus = 2 max_memory = 6.GB max_time = 48.h genome = false //Input data input = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/testdata/Human/human_design_bam.tsv' // Genome references fasta = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/reference/Mammoth/Mammoth_MT_Krause.fasta' sexdeterrmine_bedfile = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/reference/Human/1240K.pos.list_hs37d5.0based.bed.gz' // Genotyping pileupcaller_bedfile = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/reference/Human/1240K.pos.list_hs37d5.0based.bed.gz' pileupcaller_snpfile = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/reference/Human/1240K_covered_in_JK2067_downsampled_s0.1.numeric_chromosomes.snp' } ================================================ FILE: conf/test_tsv_kraken.config ================================================ /* * ------------------------------------------------- * Nextflow config file for running tests * ------------------------------------------------- * Defines bundled input files and everything required * to run a fast and simple test. Use as follows: * nextflow run nf-core/eager -profile test, docker (or singularity, or conda) */ includeConfig 'test_resources.config' params { config_profile_name = 'Test profile kraken' config_profile_description = 'Minimal test dataset to check pipeline function with kraken metagenomic profiler' // Limit resources so that this can run on Travis max_cpus = 2 max_memory = 6.GB max_time = 48.h genome = false //Input data metagenomic_tool = 'kraken' run_metagenomic_screening = true input = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/testdata/Mammoth/mammoth_design_fastq.tsv' // Genome references fasta = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/reference/Mammoth/Mammoth_MT_Krause.fasta' database = 'https://github.com/nf-core/test-datasets/raw/eager/databases/kraken/eager_test.tar.gz' } ================================================ FILE: conf/test_tsv_pretrim.config ================================================ /* * ------------------------------------------------- * Nextflow config file for running tests * ------------------------------------------------- * Defines bundled input files and everything required * to run a fast and simple test. Use as follows: * nextflow run nf-core/eager -profile test, docker (or singularity, or conda) */ includeConfig 'test_resources.config' params { config_profile_name = 'Test profile' config_profile_description = 'Minimal test dataset to check pipeline function' // Limit resources so that this can run on Travis max_cpus = 2 max_memory = 6.GB max_time = 48.h genome = false //Input data input = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/testdata/Mammoth/mammoth_design_fastq_pretrim.tsv' // Genome references fasta = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/reference/Mammoth/Mammoth_MT_Krause.fasta' } ================================================ FILE: docs/README.md ================================================ # nf-core/eager: Documentation The nf-core/eager documentation is split into the following pages: * [Usage](usage.md) * An overview of how the pipeline works, how to run it and a description of all of the different command-line flags. * Also includes: FAQ, Troubleshooting and Tutorials * [Output](output.md) * An overview of the different results produced by the pipeline and how to interpret them. You can find a lot more documentation about installing, configuring and running nf-core pipelines on the website: [https://nf-co.re](https://nf-co.re). Additional pages are: * [Installation](https://nf-co.re/usage/installation) * Pipeline configuration * [Local installation](https://nf-co.re/usage/local_installation) * [Adding your own system config](https://nf-co.re/usage/adding_own_config) * [Reference genomes](https://nf-co.re/usage/reference_genomes) * [Contribution Guidelines](../.github/CONTRIBUTING.md) * Basic contribution & behaviour guidelines * Checklists and guidelines for people who would like to contribute code ================================================ FILE: docs/images/README.md ================================================ # Documentation Images Information The font used for all documentation images is Kalam by Indian Type Foundry and is released under the [Open Font License](https://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=OFL) Originally downloaded from [Google Fonts](https://fonts.google.com/specimen/Kalam?sidebar.open&selection.family=Kalam:wght@300;400;700) ================================================ FILE: docs/images/usage/nfcore-eager_tsv_template.tsv ================================================ Sample_Name Library_ID Lane Colour_Chemistry SeqType Organism Strandedness UDG_Treatment R1 R2 BAM ================================================ FILE: docs/output.md ================================================ # nf-core/eager: Output ## Introduction The output of nf-core/eager primarily consists of the following main components: output alignment files (e.g. VCF, BAM or FASTQ files), and summary statistics of the whole run presented in a [`MultiQC`](https://multiqc.info) report. Intermediate files and module-specific statistics files are also retained depending on your particular run configuration. ## Directory Structure The default directory structure of nf-core/eager is as follows ```bash results/ ├── MultiQC/ ├── / ├── / ├── / ├── pipeline_info/ └── reference_genome/ work/ ``` * The parent directory `` is the parent directory of the run, either the directory the pipeline was run from or as specified by the `--outdir` flag. The default name of the output directory (unless otherwise specified) will be `./results/`. ### Primary Output Directories These directories are the ones you will use on a day-to-day basis and are those which you should familiarise yourself with. * The `MultiQC` directory is the most important directory and contains the main summary report of the run in HTML format, which can be viewed in a web-browser of your choice. The sub-directory contains the MultiQC collected data used to build the HTML report. The Report allows you to get an overview of the sequencing and mapping quality as well as aDNA metrics (see the [MultiQC Report](#multiqc-report) section for more detail). * A `` directory contains the (cleaned-up) output from a particular software module. This is the second most important set of directories. This contains output files such as FASTQ, BAM, statistics, and/or plot files of a specific module (see the [Output Files](#output-files) section for more detail). The latter two are only needed when you need finer detail about that particular part of the pipeline. ### Secondary Output Directories These are less important directories which are used less often, normally in the context of bug-reporting. * `pipeline_info/`: [Nextflow](https://www.nextflow.io/docs/latest/tracing.html) provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage. * Reports generated by Nextflow: `execution_report.html`, `execution_timeline.html`, `execution_trace.txt` and `pipeline_dag.dot`/`pipeline_dag.svg`. * Reports generated by the pipeline: `pipeline_report.html`, `pipeline_report.txt` and `software_versions.csv`. * Documentation for interpretation of results in HTML format: `results_description.html`. * `reference_genome/` contains either text files describing the location of specified reference genomes, and if not already supplied when running the pipeline, auxiliary indexing files. This is often useful when re-running other samples using the same reference genome, but is otherwise often not important. * The `work/` directory contains all the `nextflow` processing directories. This is where `nextflow` actually does all the work, but in an efficient programmatic procedure that is not intuitive to human-readers. Due to this, the directory is often not important to a user as all the useful output files are linked to the module directories (see above). Otherwise, this directory maybe useful when a bug-reporting. > :warning: Note that `work/` will be created wherever you are running the `nextflow run` command from, unless you specify the location with `-w`, i.e. it will not by default be in `outdir`!. ## MultiQC Report In this section we will run through the output of each **default** module as reported in a MultiQC output. This can be viewed by opening the HTML file in your `/MultiQC/` directory in a web browser. The section will also provide some basic tips on how to interpret the plots and values, although we highly recommend reading the READMEs or original papers of the tools used in the pipeline. A list of references can be seen on the [nf-core/eager github repository](https://github.com/nf-core/eager/) For more information about how to use MultiQC reports, see [http://multiqc.info](http://multiqc.info) ### General Stats Table #### Background This is the main summary table produced by MultiQC that the report begins with. This section of the report is generated by MultiQC itself rather than stats produced by a specific module. It shows whatever each module considers to be as the 'most important' values to be displayed — however the nf-core/eager version has been somewhat customised to make it as close to the EAGER (v1) ReportTable format as possible, with some opinionated tweaks. #### Table This table will report values per-file, library, or sample statistics depending on which stage along the pipeline you have gone through. MultiQC will try and collapse the rows as far as possible, if the log files have the same name. However, separate libraries will be displayed separately, for example when using DamageProfiler with the using TSV input and merging of samples is performed (which would be reported at the qualimap level). If you're only interested in a single part of the results (e.g. qualimap) you can use the `Configure Columns` to remove columns and the corresponding rows will be not displayed, resulting in a more compact table. Each column name is supplied by the module, so you may see similar column names. When unsure, hovering over the column name will allow you see which module it is derived from. The possible columns displayed by default are as follows (note you may see additional columns depending on what other modules you activate): * **Sample Name** This is the log file name without file suffix(s). This will depend on the module outputs. * **Nr. Input Reads** This is from Pre-AdapterRemoval FastQC. Represents the number of raw reads in your untrimmed and (paired end) unmerged FASTQ file. Each row should be approximately equal to the number of reads you requested to be sequenced, divided by the number of FASTQ files you received for that library. * **Length Input Reads** This is from Pre-AdapterRemoval FastQC. This is the average read length in your untrimmed and (paired end) unmerged FASTQ file and should represent the number of cycles of your sequencing chemistry. * **% GC Input Reads** This is from Pre-AdapterRemoval FastQC. This is the average GC content in percent of all the reads in your untrimmed and (paired end) unmerged FASTQ file. * **GC content** This is from FastP. This is the average GC of all reads in your untrimmed and unmerged FASTSQ file after poly-G tail trimming. If you have lots of tails, this value should drop from the pre-AdapterRemoval FastQC %GC column. * **% Trimmed** This is from AdapterRemoval. It is the percentage of reads which had an adapter sequence removed from the end of the read. * **Nr. Processed Reads** This is from Post-AdapterRemoval FastQC. Represents the number of preprocessed reads in your adapter trimmed (paired end) merged FASTQ file. The loss between this number and the Pre-AdapterRemoval FastQC can give you an idea of the quality of trimming and merging. * **% GC Processed Reads** This is from Post-AdapterRemoval FastQC. Represents the average GC of all preprocessed reads in your adapter trimmed (paired end) merged FASTQ file. * **Length Processed Reads** This is from post-AdapterRemoval FastQC. This is the average read length in your trimmed and (paired end) merged FASTQ file and should represent the 'realistic' average lengths of your DNA molecules * **% Aligned** This is from bowtie2. It reports the percentage of input reads that mapped to your reference genome. This number will be likely similar to Endogenous DNA % (see below). * **% Metagenomic Mappability** This is from MALT. It reports the percentage of the off-target reads (from mapping), that could map to your MALT metagenomic database. This can often be low for aDNA due to short reads and database bias. * **% Unclassified** This is from Kraken. It reports the percentage of reads that could not be aligned and taxonomically assigned against your Kraken metagenomic database. This can often be high for aDNA due to short reads and database bias. * **Nr. Reads Into Mapping** This is from Samtools. This is the raw number of preprocessed reads that went into mapping. * **Nr. Mapped Reads** This is from Samtools. This is the raw number of preprocessed reads mapped to your reference genome _prior_ map quality filtering. * **Endogenous DNA (%)** This is from the endorS.py tool. It displays a percentage of mapped reads over total reads that went into mapped (i.e. the percentage DNA content of the library that matches the reference). Assuming a perfect ancient sample with no modern contamination, this would be the amount of true ancient DNA in the sample. However this value _most likely_ include contamination and will not entirely be the true 'endogenous' content. * **Nr. Mapped Reads Post-Filter** This is from Samtools. This is the raw number of preprocessed reads mapped to your reference genome _after_ map quality filtering (note the column name does not distinguish itself from prior-map quality filtering, but the post-filter column is always second) * **Endogenous DNA Post-Filter (%)** This is from the endorS.py tool. It displays a percentage of mapped reads _after_ BAM filtering (i.e. for mapping quality and/or bam-level length filtering) over total reads that went into mapped (i.e. the percentage DNA content of the library that matches the reference). This column will only be displayed if BAM filtering is turned on and is based on the original mapping for total reads, and mapped reads as calculated from the post-filtering BAM. * **ClusterFactor** This is from **DeDup only**. This is a value representing how many duplicates in the library exist for each unique read. This ratio is calculated as `reads_before_deduplication / reads_after_deduplication`. Can be converted to %Dups by calculating `1 - (1 / CF)`. A cluster factor close to one indicates a highly complex library and could be sequenced further. Generally with a value of more than 2 you will not be gaining much more information by sequencing deeper. * **% Dup. Mapped Reads** This is from **Picard's markDuplicates only**. It represents the percentage of reads in your library that were exact duplicates of other reads in your library. The lower the better, as high duplication rate means lots of sequencing of the same information (and therefore is not time or cost effective). * **X Prime Y>Z N base** These columns are from DamageProfiler or mapDamage. The prime numbers represent which end of the reads the damage is referring to. The Y>Z is the type of substitution (C>T is the true damage, G>A is the complementary). You should see for no- and half-UDG treatment a decrease in frequency from the 1st to 2nd base. * **Mean Length Mapped Reads** This is from DamageProfiler. This is the mean length of all de-duplicated mapped reads. Ancient DNA normally will have a mean between 30-75, however this can vary. * **Median Length Mapped Reads** This is from DamageProfiler. This is the median length of all de-duplicated mapped reads. Ancient DNA normally will have a mean between 30-75, however this can vary. * **Nr. Dedup. Mapped Reads** This is from Qualimap. This is the total number of _deduplicated_ reads that mapped to your reference genome. This is the **best** number to report for final mapped reads in final publications. * **Mean/Median Coverage** This is from Qualimap. This is the mean/median number of times a base on your reference genome was covered by a read (i.e. depth coverage). This average includes bases with 0 reads covering that position. * **>= 1X** to **>= 5X** These are from Qualimap. This is the percentage of the genome covered at that particular depth coverage. * **% GC Dedup. Mapped Reads** This is the mean GC content in percent of all mapped reads post-deduplication. This should normally be close to the GC content of your reference genome. * **MT to Nuclear Ratio** This from MTtoNucRatio. This reports the number of reads aligned to a mitochondrial entry in your reference FASTA to all other entries. This will typically be high but will vary depending on tissue type. * **SexDet Rate X Chr** This is from Sex.DetERRmine. This is the relative depth of coverage on the X-chromosome. * **SexDet Rate Y Chr** This is from Sex.DetERRmine. This is the relative depth of coverage on the Y-chromosome. * **#SNPs Covered** This is from eigenstrat\_snp\_coverage. The number of called SNPs after genotyping with pileupcaller. * **#SNPs Total** This is from eigenstrat\_snp\_coverage. The maximum number of covered SNPs, i.e. the number of SNPs in the .snp file provided to pileupcaller with `--pileupcaller_snpfile`. * **Number of SNPs** This is from ANGSD. The number of SNPs left after removing sites with no data in a 5 base pair surrounding region. * **Contamination Estimate (Method1_ML)** This is from the nuclear contamination function of ANGSD. The Maximum Likelihood contamination estimate according to Method 1. The estimates using Method of Moments and/or those based on Method 2 can be unhidden through the "Configure Columns" button. * **Estimate Error (Method1_ML)** This is from ANGSD. The standard error of the Method1 Maximum likelihood estimate. The errors associated with Method of Moments and/or Method2 estimates can be unhidden through the "Configure Columns" button. * **% Hets** This is from MultiVCFAnalyzer. This reports the number of SNPs on an assumed haploid organism that have two possible alleles. A high percentage may indicate cross-mapping from a related species. For other non-default columns (activated under 'Configure Columns'), hover over the column name for further descriptions. ### FastQC #### Background [FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) gives general quality metrics about your raw reads. It provides information about the quality score distribution across your reads, the per base sequence content (%T/A/G/C) as sequenced. You also get information about adapter contamination and other overrepresented sequences. You will receive output for each supplied FASTQ file. When dealing with ancient DNA data the MultiQC plots for FastQC will often show lots of 'warning' or 'failed' samples. You generally can discard this sort of information as we are dealing with very degraded and metagenomic samples which have artefacts that violate the FastQC 'quality definitions', while still being valid data for aDNA researchers. Instead you will _normally_ be looking for 'global' patterns across all samples of a sequencing run to check for library construction or sequencing failures. Decision on whether a individual sample has 'failed' or not should be made by the user after checking all the plots themselves (e.g. if the sample is consistently an outlier to all others in the run). [FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the [FastQC help pages](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/). > **NB:** The FastQC (pre-Trimming) plots displayed in the MultiQC report shows _untrimmed_ reads. They may contain adapter sequence and potentially regions with low quality. To see how your reads look after trimming, look at the FastQC reports in the FastQC (post-Trimming) section. You should expect after AdapterRemoval, that most of the artefacts are removed. > :warning: If you turned on `--post_ar_fastq_trimming` your 'post-Trimming' report the statistics _after_ this trimming. There is no separate report for the post-AdapterRemoval trimming. #### Sequence Counts This shows a barplot with the overall number of sequences (x axis) in your raw library after demultiplexing, **per file** (y-axis). If you have paired end data, you will have one bar for Read 1 (or forward), and a second bar for Read 2 (or reverse). Each entire bar should represent approximately what you requested from the sequencer itself — unless you have your library sequenced over multiple lanes, where it should be what you request divided by the number of lanes it was split over. A section of the bar will also show an approximate estimation of the fraction of the total number of reads that are duplicates of another. This can derive from over-amplification of the library, or lots of single adapters. This can be later checked with the Deduplication check. A good library and sequencing run should have very low amounts of duplicates reads.

#### Sequence Quality Histograms This line plot represents the Phred scores across each base pair of all the reads. The x-axis is the base position across each read, and the y-axis is the average base-calling score (Phred-scaled) of the nucleotides across all reads. Again, this is per FASTQ file (i.e. forward/reverse and/or lanes separately). The background colours represent approximate ranges of quality, with green section being acceptable quality, orange is dubious and red is bad. You will often see that the first 5 or so bases have slightly lower quality than the rest of the read as this the calibration steps of the machine. The bulk of the read should then stay ~35. Do not worry if you see the last 10-20 bases of reads do often have lower quality base calls that the middle of the read, as the sequencing reagents start to deplete during these cycles (e.g. making nucleotide fluorescence weaker). Furthermore, the reverse reads of sequencing data will often be even lower at ends than forward reads for the same reason.

Things to watch out for: * all positions having Phred scores less than 27 * a sharp drop-off of quality early in the read * for paired-end data, if either R1 or R2 is significantly lower quality across the whole read compared to the complementary read. #### Per Sequence Quality Scores This is a further summary of the previous plot. This is a histogram of the _overall_ read quality (compared to per-base, above). The x axis is the mean read-quality score (summarising all the bases of the read in a single value), and the y-axis is the number of reads with this Phred score. You should see a peak with the majority of your reads between 27-35.

Things to watch out for: * bi-modal peaks which suggests artefacts in some of the sequencing cycles * all peaks being in orange or red sections which suggests an overall bad sequencing run (possibly due to a faulty flow-cell). #### Per Base Sequencing Content This is a heatmap which shows the average percentage of C, G, T, and A nucleotides across ~4bp bins across all reads. You expect to see whole heatmap to be a relatively equal block of colour (normally black), representing an equal mix of A, C, T, G colors (see legend).

Things to watch out for: * If you see a particular colour becoming more prominent this suggests there is an over-representation of those bases at that base-pair range across all reads (e.g. 20-24bp). This could happen if you have lots of PCR duplicates, or poly-G tails from Illumina NextSeq/NovaSeq 2-colour chemistry data (where no fluorescence can mean both G or 'no-call'). > If you see Poly-G tails, we recommend to turn on FastP poly-G trimming with EAGER. See the 'running' documentation page for details. #### Per Sequence GC Content This line graph shows the number percentage reads (y-axis) with an average percent GC content (y-axis). In 'isolate' samples (i.e. majority of the reads should be from the host species of the sample), this should be represented by a sharp peak around the average percent GC content of the reference genome. In metagenomic contexts this should be a wide flat distribution with a mean around 50%, however this can be highly different for other types of data.

Things to watch out for: * If you see particularly high percent GC content peak with NextSeq/NovaSeq data, you may have lots of PCR duplicates, or poly-G tails from Illumina NextSeq/NovaSeq 2-colour chemistry data (where no fluorescence can mean both G or 'no-call'). Consider re-running nf-core/eager using the poly-G trimming option from `fastp` See the 'running' documentation page for details. #### Per Base N Content This line graph shows you the average numbers of Ns found across all reads of a sample. Ns can be caused for a variety of reasons such as low-confidence base call, or the base has been masked. The lines should be very low (as close to 0 as possible) and generally be flat across the whole read. Increases in Ns may reflect in HiSeq data issues of the last cycles running out of chemistry.

> **NB:** Publicly downloaded data may have extremely high N contents across all reads. These normally come from 'masked' reads that may have originally be, for example, from a human sample for microbial analysis where the consent for publishing of the host DNA was not given. In these cases you do not need to worry about this plot. #### Sequence Duplication Levels This plot is some-what similar to looking at duplication rate or 'cluster factor' of mapped reads. In this case however FastQC takes the sequences of the first 100 thousand reads of a library, and looks to see how often a read sequence is repeated in the rest of the library.

A good library should have very low rates of duplication (vast majority of reads having a duplication rate of 1) — suggesting 'high complexity' or lots of unique reads and useful data. This is represented as a steep drop in the line plot and possible a very small curve at about a duplication rate of 2 or 3 and then remaining at ~0 for higher duplication rates. Note that good libraries may sometimes have small peaks at high duplication levels. This maybe due to free-adapters (with no inserts), or mono-nucleotide reads (e.g. GGGGG in NextSeq/NovaSeq data). Bad libraries which have extremely low input DNA (so during amplification the same molecules been amplified repeatedly), or a good library that has been erroneously over-amplified will show very high duplication levels — so a very slowly decreasing curve. Alternatively, if your library construction failed and many adapters were not ligated to insert molecules, a high duplication rate may be caused by these free-adapters (see 'Overrepresented sequences' for more information). > **NB:** amplicon libraries such as for 16S rRNA analysis may appear here as having high duplication rates and these peaks can be ignored. This can be verified if no contaminants are found in the 'Overrepresented sequences' section. #### Overrepresented sequences After identifying duplicates (see the previous section), a table will be displayed in the 'Overrepresented sequences' section of the report. This is an attempt by FastQC to check to see if the duplicates identified match common contaminants such as free adapters or mono-nucleotide reads. You can then use this table help inform you in identification where the problem occurred in the construction and sequencing of this library. E.g. if you have high duplication rates but no identified contaminants, this suggests over-amplification of reads rather than left over adapters. #### Adapter Content This plot shows the percentage of reads (y-axis), which has an adapter starting at a particular position along a read (x-axis). There can be multiple lines per sample as each line represents a particular adapter. It is common in aDNA libraries to see very rapid increases in the proportion of reads with an adapter 'early on' in the read, as by nature aDNA molecules are fragmented and very short. Palaeolithic samples can have reads as short as 25bp, so sequences can already start having adapters 25bp into a read. This can already give you an indication on the authenticity of your library - as if you see very low proportions of reads with adapters this suggests long insert molecules that are less likely to derive from a 'true' aDNA library. On the flip-side, if you are working with modern DNA - it can give an indication of over-sonication if you have artificially fragmented your reads to lower than your target molecule length.

If you have downloaded public data this often is uploaded with adapters already removed, so you can expect a flat distribution straight away. When comparing pre- and post-AdapterRemoval FASTQC plots of fresh sequencing data (assuming your sequencing center doesn't already remove adapters), you expect to see something similar to the left panel of the example above _pre-_ adapter removal and the right hand panel _post-_ adapter removal. ### FastP #### Background FastP is a general pre-processing toolkit for Illumina sequencing data. In nf-core/eager we currently only use the 'poly-G' trimming function. Poly-G tails occur at ends of reads when using two-colour chemistry kits (i.e. in NextSeq and NovaSeq). This occurs as 'no fluorescence' is interpreted by the machine; however if the chemistry runs out or the read is shorter than the number of cycles in the kit, you will get at the ends of reads lots of cycles with no nucleotides and these are then recorded as Gs. While the machine should detect a reduction in base-calling quality, this is not always the case and you will retain these tails in your FASTQ files. This can cause skews in GC content and false positive SNP calls when the reference genome has long mono-nucleotide stretches (typically in larger eukaryotic genomes). In the case of dual-indexed paired-end sequencing, it is likely poly-G tails are less of an issue as during your AdapterRemoval step, anything passed the adapter will be clipped off anyway. However you can check this under the 'Per Sequence GC Content' plot in FastQC. > **NB:** As you are more likely to see this at the end of the run, in paired-end data you should see all 'Read 2' data having a higher GC content distribution than the 'Read 1' While the MultiQC report has multiple plots for FastP, we will only look at GC content as that's the functionality we use currently. The pipeline will generate the respective output for each supplied FASTQ file. #### GC Content This line plot shows the average GC content (Y axis) across each nucleotide of the reads (X-axis). There are two buttons per read (i.e. 2 for single-end, and 4 for paired-end) representing before and after the poly-G tail trimming. Before filtering, if you have poly-G tails, you should see the lines going up at the end of the right-hand side of the plot. After filtering, you should see that the average GC content along the reads is now reduced to around the general trend of the entire read. Things to look out for: * If you see a distinct GC content increase at the end of the reads, but are not removed after filtering, check to see where along the read the increase seems to start. If it is less than 10 base pairs from the end, consider reducing the overlap parameter `--complexity_filter_poly_g_min`, which tells FastP how far in the read the Gs need to go before removing them. ### AdapterRemoval #### Background AdapterRemoval a tool that does the post-sequencing clean up of your sequencing reads. It performs the following functions * 'Merges' (or 'collapses') forward and reverse reads of Paired End data * Removes remaining library indexing adapters * Trims low quality base tails from ends of reads * Removes too-short reads In more detail merging is where the same read from the forward and reverse files of a single library (based on the flowcell coordinates), are compared to find a stretch of sequence that are the same. If this overlap reaches certain quality thresholds, the two reads are 'collapsed' into a single read, with the base quality scores are updated accordingly accounting for the increase quality call precision. Adapter removal involves finding overlaps at the 5' and 3' end of reads for the artificial NGS library adapters (which connect the DNA molecule insert, and the index), and stretches that match each other are then removed from the read itself. Note, by default AdapterRemoval does _not_ remove 'internal barcodes' (between insert and the adapter), so these statistics are not considered. Quality trimming (or 'truncating') involves looking at ends of reads for low-confidence bases (i.e. where the FASTQ Phred score is below a certain threshold). These are then removed remove the read. Length filtering involves removing any read that does not reach the number of bases specified by a particular value. You will receive output for each FASTQ file supplied for single end data, or for each pair of merged FASTQ files for paired end data. #### Retained and Discarded Reads Plot These stacked bars plots are unfortunately a little confusing, when displayed in MultiQC. However are relatively straight-forward once you understand each category. They can be displayed as counts of reads per AdapterRemoval read-category, or as percentages of the same values. Each forward(/reverse) file combination are displayed once. The most important value is the **Retained Read Pairs** which gives you the final number of reads output into the file that goes into mapping. Note, however, this section of the stack bar _includes_ the other categories displayed (see below) in the calculation. Other Categories: * If paired-end, the **Singleton [mate] R1(/R2)** categories represent reads which were unable to be collapsed, possibly due to the reads being too long to overlap. * If paired-end, **Full-length collapsed pairs** are reads which were collapsed and did not require low-quality bases at end of reads to be removed. * If paired-end, **Truncated collapsed pairs** are paired-end that were collapsed but did required the removal of low quality bases at the end of reads. * **Discarded [mate] R1/R2** represent reads which were a part of a pair, but one member of the pair did not reach other quality criteria and was discarded. However the other member of the pair is still retained in the output file as it still reached other quality criteria.

For ancient DNA, assuming a good quality run, you expect to see a the vast majority of your reads overlapping because we have such fragmented molecules. Large numbers of singletons suggest your molecules are too long and may not represent true ancient DNA. If you see high numbers of discarded or truncated reads, you should check your FastQC results for low sequencing quality of that particular run. #### Length Distribution Plot The length distribution plots show the number of reads at each read-length. You can change the plot to display different categories. * All represent the overall distribution of reads. In the case of paired-end sequencing You may see a peak at the turn around from forward to reverse cycles. * **Mate 1** and **Mate 2** represents the length of the forward and reverse read respectively prior collapsing * **Singleton** represent those reads that had a one member of a pair discarded * **Collapsed** and **Collapsed Truncated** represent reads that overlapped and able to merge into a single read, with the latter including base-quality trimming off ends of reads. These plots will start with a vertical rise representing where you are above the minimum-read threshold you set. * **Discarded** here represents the number of reads that did not each the read length filter. You will likely see a vertical drop at what your threshold was set to.

With paired-end ancient DNA sequencing runs You expect to see a slight increase in shorter fragments in the reverse (R2) read, as our fragments are so short we often don't reach the maximum number of cycles of that particular sequencing run. ### Bowtie2 #### Background This module provides information on mapping when running the Bowtie2 aligner. Bowtie2, like bwa, takes raw FASTQ reads and finds the most likely place on the reference genome it derived from. While this module is somewhat redundant with the [Samtools](#samtools) (which reports mapping statistics for bwa) and the endorSp.y endogenous DNA value in the general statistics table, it does provide some details that could be useful in certain contexts. You will receive output for each _library_. This means that if you use TSV input and have one library sequenced over multiple lanes and sequencing types, these are merged and you will get mapping statistics of all lanes in one value. #### Single/Paired-end alignments This bar plot shows the number of different categories of reads that Bowtie2 was able to align to the reference genome. You will get slightly different plots for Paired-End (PE) and Single-End (SE) data, but they are basically the same. Ancient DNA samples typically have low endogenous DNA values, as most of the DNA from the sample is from taphonomic sources (burial environment, modern handling etc), so it is normal to get low numbers of mapping reads.

The main additional useful information compared to [Samtools](#samtools) is that these plots can inform you how many reads had multiple places on the reference the read could align to. This can occur with low complexity reads or reads derived from e.g. repetitive regions on the genome. If you have large amounts of multi-mapping reads, this can be a warning flag that there is an issue either with the reference genome or library itself (e.g. library construction artefacts). You should investigate cases like this more closely before using the data downstream. ### MALT #### Background MALT is a metagenomic aligner (equivalent to BLAST, but much faster). It produces direct alignments of sequencing reads in a reference genome. It is often used for metagenomic profiling or pathogen screening, and specifically in nf-core/eager, of off-target reads from genome mapping. You will receive output for each _library_. This means that if you use TSV input and have one library sequenced over multiple lanes and sequencing types, these are merged and you will get mapping statistics of all lanes and sequencing configurations in one value. #### Metagenomic Mappability This bar plot gives an approximation of how many reads in your off-target FASTQ file was able to align to your metagenomic database. Due to low 'endogenous' content of aDNA, and the high biodiversity of modern or environmental microbes that normally exists in archaeological and museum samples, you often will get relatively low mappability percentages.

This can also be influenced by the type of database you supplied — many databases have an over-abundance of taxa of clinical or economic interest, so when you have a large amount of uncharacterised environmental taxa, this may also result in low mappability. #### Taxonomic assignment success In addition to actually being able to align to a given reference sequence, MALT can also allow sequences without a 'taxonomic' ID to be included in a database. Furthermore, it utilises a 'lowest common ancestor' algorithm (LCA), that can result in a read getting no taxonomic identification (because it can align to multiple reference sequences with equal probability). Because of this, MultiQC also produces a bar plot indicating of the successfully aligned reads (see Metagenomic Mappability above), how many could be assigned a taxon ID.

For the same reasons above, you can often get not very many reads being taxonomically assigned when working with aDNA. This can also occur when many of your reads are from conservative regions of genomes and can map onto multiple references. At this point LCA pushes the possible taxon identification so high up the tree, it cannot give a taxonomic assignment. If you have multiple samples of a similar level of preservation, but one with unusually low numbers of taxonomically assigned reads, it maybe worth investigating what the alignments look like in case there is some sequencing artefact (although it could just be badly preserved and little DNA). ### Kraken #### Background Kraken is another metagenomic classifier, but takes a different approach to alignment as with [MALT](#malt). It uses 'K-mer similarity' between reads and references to very efficiently find similar patterns in sequences. It does not however, do alignment — meaning you cannot screen for authentication criteria such as damage patterns and fragment lengths. It is useful when you do not have large computing power or you want very rapid but rough approximation of the metagenomic profile of your sample. You will receive output for each _library_. This means that if you use TSV input and have one library sequenced over multiple lanes and sequencing types, these are merged and you will get mapping statistics of all lanes and sequencing configurations in one value. #### Top Taxa This plot gives you an approximation of the abundance of the five top taxa identified. Typically for ancient DNA, this will be quite a small fraction of taxa, as archaeological and museum samples have a large biodiversity from environmental microbes — therefore a large fraction of 'unclassified' can be quite normal.

However for screening for specific metagenomic profiles, such as ancient microbiomes, if the top taxa are from your specific microbiome of interest (e.g. looking at calculus for oral microbiomes, or paleofaeces for gut microbiome), this can be a good indicator that you have a well preserved sample. But of course, you must do further downstream (manual!) authentication of these taxa to ensure they are not from modern contamination. ### Samtools #### Background This module provides numbers in raw counts of the mapping of your DNA reads to your reference genome. You will receive output for each _library_. This means that if you use TSV input and have one library sequenced over multiple lanes and sequencing types, these are merged and you will get mapping statistics of all lanes in one value. #### Flagstat Plot This dot plot shows different statistics, and the number of reads (typically as an multiple e.g. million, or thousands), are represented by dots on the X axis. In most cases the first two rows, 'Total Reads' and 'Total Passed QC' will be the same as EAGER (v1) does not do quality control of reads with samtools. This number should normally be the same the number of (clipped, and if paired-end, merged) retained reads coming out of AdapterRemoval. The third row 'Mapped' represents the number of reads that found a place that could be aligned on your reference genome. This is the raw number of mapped reads, prior PCR duplication. The remaining rows will be 0 when running `bwa aln` as these characteristics of the data are not considered by the algorithm by default.

> **NB:** The Samtools (pre-samtools filter) plots displayed in the MultiQC report shows mapped reads without mapping quality filtering. This will contain reads that can map to multiple places on your reference genome with equal or slightly less mapping quality score. To see how your reads look after mapping quality, look at the FastQC reports in the Samtools (pre-samtools filter). You should expect after mapping quality filtering, that you will have less reads. ### DeDup You will receive output for each _library_. This means that if you use TSV input and have one library sequenced over multiple lanes and sequencing types, these are merged and you will get mapping statistics of all lanes of the library in one value. #### Background DeDup is a duplicate removal tool which searches for PCR duplicates and removes them from your BAM file. We remove these duplicates because otherwise you would be artificially increasing your coverage and subsequently confidence in genotyping, by considering these lab artefacts which are not biologically meaningful. DeDup looks for reads with the same start and end coordinates, and whether they have exactly the same sequence. The main difference of DeDup versus e.g. `samtools markduplicates` is that DeDup considers _both_ ends of a read, not just the start position, so it is more precise in removing actual duplicates without penalising often already low aDNA data. #### DeDup Plot This stacked bar plot shows as a whole the total number of reads in the BAM file going into DeDup. The different sections of a given bar represents the following: * **Not Removed** — the overall number of reads remaining after duplicate removal. These may have had a duplicate (see below). * **Reverse Removed** — the number of reads that found to be a duplicate of another and removed that were un-collapsed reverse reads (from the earlier read merging step). * **Forward Removed** — the number of reads that found to be a duplicate of another and removed that were an un-collapsed forward reads (from the earlier read merging step). * **Merged Removed** — the number of reads that were found to be a duplicate and removed that were a collapsed read (from the earlier read merging step). Exceptions to the above: * If you do not have paired end data, you will not have sections for 'Merged removed' or 'Reverse removed'. * If you use the `--dedup_all_merged` flag, you will not have the 'Forward removed' or 'Reverse removed' sections.

Things to look out for: * The smaller the number of the duplicates removed the better. If you have a small number of duplicates, and wish to sequence deeper, you can use the preseq module (see below) to make an estimate on how much deeper to sequence. * If you have a very large number of duplicates that were removed this may suggest you have an over amplified library, or a lot of left-over adapters that were able to map to your genome. ### Picard #### Background Picard is a toolkit for general BAM file manipulation with many different functions. nf-core/eager most visibly uses the 'markduplicates' tool, for the removal of exact PCR duplicates that can occur during library amplification and results in false inflated coverages (and overly-confident genotyping calls). #### Mark Duplicates The deduplication stats plot shows you how many reads were detected and then removed during deduplication of a mapped BAM file. Well-preserved and constructed libraries will typically have many unique reads and few duplicates. These libraries are often good candidates for deeper sequencing (if required), but low-endogenous DNA libraries that have been over-amplified will have few unique reads and many copies of each read. For better calculations you can see the [Preseq](#preseq) module below.

The amount of unmapped reads will depend on whether you have filtered out unmapped reads out not (see the [usage/running the pipeline](usage.md) documentation for more information. Things to look out for: * The smaller the number of the duplicates removed the better. If you have a smaller number of duplicates, and wish to sequence deeper, you can use the preseq module (see below) to make an estimate on how much deeper to sequence. * If you have a very large number of duplicates that were removed this may suggest you have an over amplified library, a badly preserved sample with a very low yield, or a lot of left-over adapters that were able to map to your genome. ### Preseq #### Background Preseq is a collection of tools that allow assessment of the complexity of the library, where complexity means the number of unique molecules in your library (i.e. not molecules with the exact same length and sequence). There are two algorithms from the tools we use: `c_curve` and `lc_extrap`. The former gives you the expected number of unique reads if you were to repeated sequencing but with fewer reads than your first sequencing run. The latter tries to extrapolate the decay in the number of unique reads you would get with re-sequencing but with more reads than your initial sequencing run. Due to endogenous DNA being so low when doing initial screening, the maths behind `lc_extrap` often fails as there is not enough data. Therefore nf-core/eager sticks with `c_curve` which gives a similar approximation of the library complexity, but is more robust to smaller datasets. You will receive output for each deduplicated _library_. This means that if you use TSV input and have one library sequenced over multiple lanes and sequencing types, these are merged and you will get mapping statistics of all lanes of the library in one value. #### Complexity Curve Using the de-duplication information from DeDup, the calculated curve (a solid line) allows you to estimate: at this sequencing depth (on the X axis), how many unique molecules would you have sequenced (along the Y axs). When you start getting DNA sequences that are the mostly same as ones you've sequenced before, it is often not cost effective to continue sequencing and is a good point to stop. The dashed line represents a 'perfect' library containing only unique molecules and no duplicates. You are looking for your library stay as close to this line as possible. Plateauing of your curve shows that at that point you would not be getting any more unique molecules and you shouldn't sequence further than this.

Plateauing can be caused by a number of reasons: * You have simply sequenced your library to exhaustion * You have an over-amplified library with many PCR duplicates. You should consider rebuilding the library to maximise data to cost ratio * You have a low quality library made up of mappable sequencing artefacts that were able to pass filtering (e.g. adapters) ### Damage Calculation #### Background DamageProfiler and mapDamage are tools that calculate a variety of standard 'aDNA' metrics from a BAM file. The primary plots here are the misincorporation and length distribution plots. Ancient DNA undergoes depurination and hydrolysis, causing fragmentation of molecules into gradually shorter fragments, and cytosine to thymine deamination damage, that occur on the subsequent single-stranded overhangs at the ends of molecules. Therefore, three main characteristics of ancient DNA are: * Short DNA fragments * Elevated G and As (purines) just before strand breaks * Increased C and Ts at ends of fragments You will receive output for each deduplicated _library_. This means that if you use TSV input and have one library sequenced over multiple lanes and sequencing types, these are merged and you will get mapping statistics of all lanes of the library in one value. #### Misincorporation Plots The MultiQC DamageProfiler and mapDamage module misincorporation plots shows the percent frequency (Y axis) of C to T mismatches at 5' read ends and complementary G to A mismatches at the 3' ends. The X axis represents base pairs from the end of the molecule from the given prime end, going into the middle of the molecule i.e. 1st base of molecule, 2nd base of molecule etc until the 14th base pair. The mismatches are when compared to the base of the reference genome at that position. When looking at the misincorporation plots, keep the following in mind: * As few-base single-stranded overhangs are more likely to occur than long overhangs, we expect to see a gradual decrease in the frequency of the modifications from position 1 to the inside of the reads. * If your library has been **partially-UDG treated**, only the first one or two bases will display the misincorporation frequency. * If your library has been **UDG treated** you will expect to see extremely-low to no misincorporations at read ends. * If your library is **single-stranded**, you will expect to see only C to T misincorporations at both 5' and 3' ends of the fragments. * We generally expect that the older the sample, or the less-ideal preservational environment (hot/wet) the greater the frequency of C to T/G to A. * The curve will be not smooth then you have few reads informing the frequency calculation. Read counts of less than 500 are likely not reliable. * If the `mapdamage_downsample` parameter was specified and mapDamage was used for damage calculation, the damage frequency for each base is based only on the specified number of reads.

> **NB:** An important difference to note compared to the mapDamage tool, which DamageProfiler is otherwise an exact-re-implementation of, is that the percent frequency on the Y axis is not fixed between 0 and 0.3, and will 'zoom' into small values the less damage there is #### Length Distribution The MultiQC DamageProfiler and mapDamage module length distribution plots show the frequency of read lengths across forward and reverse reads respectively. When looking at the length distribution plots, keep in mind the following: * Your curves will likely not start at 0, and will start wherever your minimum read-length setting was when removing adapters. * You should typically see the bulk of the distribution falling between 40-120bp, which is normal for aDNA * You may see large peaks at paired-end turn-arounds, due to very-long reads that could not overlap for merging being present, however this reads are normally from modern contamination. * If the `mapdamage_downsample` parameter was specified and mapDamage was used for damage calculation, the length distribution is based only on the specified number of reads. ### QualiMap #### Background Qualimap is a tool which provides statistics on the quality of the mapping of your reads to your reference genome. It allows you to assess how well covered your reference genome is by your data, both in 'fold' depth (average number of times a given base on the reference is covered by a read) and 'percentage' (the percentage of all bases on the reference genome that is covered at a given fold depth). These outputs allow you to make decision if you have enough quality data for downstream applications like genotyping, and how to adjust the parameters for those tools accordingly. > NB: Neither fold coverage nor percent coverage on there own is sufficient to assess whether you have a high quality mapping. Abnormally high fold coverages of a smaller region such as highly conserved genes or un-removed-adapter-containing reference genomes can artificially inflate the mean coverage, yet a high percent coverage is not useful if all bases of the genome are covered at just 1x coverage. Note that many of the statistics from this module are displayed in the General Stats table (see above), as they represent single values that are not plottable. You will receive output for each _sample_. This means you will statistics of deduplicated values of all types of libraries combined in a single value (i.e. non-UDG treated, full-UDG, paired-end, single-end all together). :warning: If your library has no reads mapping to the reference, this will result in an empty BAM file. Qualimap will therefore not produce any output even if a BAM exists! #### Coverage Histogram This plot shows on the Y axis the range of fold coverages that the bases of the reference genome are possibly covered by. The Y axis shows the number of bases that were covered at the given fold coverage depth as indicated on the Y axis. The greater the number of bases covered at as high as possible fold coverage, the better.

Things to watch out for: * You will typically see a direct decay from the lowest coverage to higher. A large range of coverages along the X axis is potentially suspicious. * If you have stacking of reads i.e. a small region with an abnormally large amount of reads despite the rest of the reference being quite shallowly covered, this will artificially increase your coverage. This would be represented by a small peak that is a much further along the X axis away from the main distribution of reads. #### Cumulative Genome Coverage This plot shows how much of the genome in percentage (X axis) is covered by a given fold depth coverage (Y axis). An ideal plot for this is to see an increasing curve, representing larger greater fractions of the genome being increasingly covered at higher depth. However, for low-coverage ancient DNA data, you will be more likely to see decreasing curves starting at a large percentage of the genome being covered at 0 fold coverage — something particular true for large genomes such as for humans.

#### GC Content Distribution This plot shows the distribution of the frequency of reads at different GC contents. The X axis represents the GC content (i.e the percentage of Gs and Cs nucleotides in a given read), the Y axis represents the frequency.

Things to watch out for: * This plot should normally show a normal distribution around the average GC content of your reference genome. * Bimodal peaks may represent lab-based artefacts that should be further investigated. * Skews of the peak to a higher GC content that the reference in Illumina dual-colour chemistry data (e.g. NextSeq or NovaSeq), may suggest long poly-G tails that are mapping to poly-G stretches of your genome. The nf-core/eager trimming option `--complexity_filter_poly_g` can be used to remove these tails by utilising the tool FastP for detection and trimming. ### Sex.DetERRmine #### Background Sex.DetERRmine calculates the coverage of your mapped reads on the X and Y chromosomes relative to the coverage on the autosomes (X-/Y-rate). This metric can be thought of as the number of copies of chromosomes X and Y that is found within each cell, relative to the autosomal copies. The number of autosomal copies is assumed to be two, meaning that an X-rate of `1.0` means there are two X chromosomes in each cell, while `0.5` means there is a single copy of the X chromosome per cell. Human females have two copies of the X chromosome and no Y chromosome (XX), while human males have one copy of each of the X and Y chromosomes (XY). When a bedfile of specific sites is provided, Sex.DetERRmine additionally calculates error bars around each relative coverage estimate. For this estimate to be trustworthy, the sites included in the bedfile should be spaced apart enough that a single sequencing read cannot overlap multiple sites. Hence, when a bedfile has not been provided, this error should be ignored. When a suitable bedfile is provided, each observation of a covered site is independent, and the error around the coverage is equal to the binomial error estimate. This error is then propagated during the calculation of relative coverage for the X and Y chromosomes. > Note that in nf-core/eager this will be run on single- and double-stranded variants of the same library _separately_. This can also help assess for differential contamination between libraries. #### Relative Coverage Theoretically, males are expected to cluster around (0.5, 0.5) in the produced scatter plot, while females are expected to cluster around (1.0, 0.0). In practice, when analysing ancient DNA, these relative coverage on both axes is slightly lower than expected, and individuals can cluster around (0.45, 0.45) and (0.85, 0.05). As the number of covered sites for an individual gets smaller, the confidence on the estimate becomes lower, because it is increasingly more likely to be affected by randomness in the preservation and sequencing of ancient DNA. Placement of individuals between the male and female clusters can be indicative of low coverage and in some cases contamination, when the contaminant and sampled individuals are of opposite biological sex. Aneuploidy of the sex chromosomes can also be identified with this approach when the placement of an individual in the scatter plot is unexpected. For example, placement of an individual around (1.0, 0.5) despite good genomic coverage is indicative of XXY karyotype (Klinefelter syndrome), while placement around (0.5, 0) could be indicative of karyotype X (Turner's syndrome).

#### Read Counts This plot gives you the number of reads mapped onto the autosomes, X or Y chromosomes. When the total number of mapped reads is low, the estimates are more likely to be dominated by random effects, and hence untrustworthy. For well-covered data without any skews, you should see long bars that are comprised mostly of autosomal reads. The edge of the bars in female individuals should be mostly X (some small amounts of Y reads are expected and are usually caused by random mapping on the Y chromosome). In males, the number of X-reads will still be higher (since the X chromosome is longer), but the Y reads should be clearly visible on the rightmost end of the bars. The ratio between the number of sites in each bin should roughly correlate with the difference in length in base pairs of each chromosome type. If this correlation is not observed, your data is skewed towards higher coverage on some chromosomes. This can be expected if you have enriched for a specific set of markers (e.g. Y-chromosome capture), or if the number of reads is too low.

### Bcftools ### Background Bcftools is a toolkit for processing and summarising of VCF files, i.e. variant call format files. nf-core/eager currently uses bcftools for the `stats` functionality. This summarises in a text file a range of statistics about VCF files, produced by GATK and FreeBayes variant callers. #### Variant Substitution Types This stack bar plot shows you the distribution of all types of point-mutation variants away from the reference nucleotide at each position, (e.g. A>C, A>G etc.). For low-coverage non-UDG treated, non-trimmed nor re-scaled aDNA data, you expect to see a C>T substitutions as the largest category, due to the most common ancient DNA damage being C to T deamination. #### Variant Quality This gives you the distribution of variant-call _qualities_ in your VCF files. Each variant will get given a 'Phred-scale' like value that represents the confidence of the variant caller that it has made the right call. The scale is very similar to that of base-call values in FASTQ files (as assessed by FastQC). Distributions that have peaks at higher variant quality scores (>= 30) suggest more confident variant calls. However, in cases of low-coverage aDNA data, these distributions may not be so good. More detailed explanation of variant quality scores can be seen in the Broad Institute's [GATK documentation](https://gatk.broadinstitute.org/hc/en-us/articles/360035531872-Phred-scaled-quality-scores). #### Indel Distribution This plot shows you the distribution of the sizes of insertion- and deletions (InDels) in the variant calling (assuming you configured your variant caller parameters to do so). Low-coverage aDNA data often will not have high enough coverage to accurately assess InDels. In cases of high-coverage data of small-genomes such as microbes, large numbers of InDels, however, may indicate your reads are actually from a _relative_ of the reference mapped to - and should be verified downstream. #### Variant depths This plot shows the distribution of depth coverages of each variant called. Typically higher coverage will result in higher quality variant calls (see Variant Quality, above), however in many cases in aDNA these may be low and unequally distributed (due to uneven mapping coverage from contamination). ### MultiVCFAnalyzer #### Background MultiVCFanalyzer is a SNP alignment generation tool, that allows further evaluation and filtering of SNP calls made by the GATK UnifiedGenotyper. More specifically it takes one or more VCF files as well as a reference genome, and will allow filtering of SNPs via a variety of metrics and produces a FASTA file with each sample as an entry containing 'consensus calls' at each position. #### Summary metrics This table shows the contents of the `snpStatistics.tsv` file produced by MultiVCFAnalyzer. Descriptions of each column can be seen at the MultiVCFAnalyzer page [here](https://github.com/alexherbig/MultiVCFAnalyzer#snpstatisticstsv). #### Call statistics barplot You can get different variants of the call statistics bar plot, depending on how you configured the MultiVCFAnalyzer options. If you ran with `--min_allele_freq_hom` and `--min_allele_freq_het` set to two different values (left panel A in the figure below), this allows you to assess the number of multi-allelic positions that were called in your genome. Typically MultiVCFAnalyzer is used for analysing smallish haploid genomes (such as mitochondrial or bacterial genomes), therefore a position with multiple possible 'alleles' suggests some form of cross-mapping from other taxa or presence of multiple strains. If this is the case, you will need to be careful with downstream analysis of the consensus sequence (e.g. for phylogenetic tree analysis) as you may accidentally pick up SNPs from other taxa/strains — particularly when dealing with low coverage data. Therefore if you have a high level of 'het' values (see image), you should carefully check your alignments manually to see how clean your genomes are, or whether you can do some form of strain separation (e.g. by majority/minority calling).

If you ran with `--min_allele_freq_hom` and `--min_allele_freq_het` set to the same value, you will have three sections to your bars: SNP Calls (hom), Reference Calls and Discard SNP Call. The overall size of the bars will generally depend on the percentage of the genome covered, meaning less well preserved samples will likely have smaller bars than well-preserved samples. Typically you wish to have a low number of discarded SNP calls, but this can be quite high when you have low coverage data (as many positions may not reach that threshold). The number of SNP calls to reference calls can vary depending on the mutation rate of your target organism. ## Output Files This section gives a brief summary of where to look for what files for downstream analysis. This covers _all_ modules. Each module has it's own output directory which sit alongside the `MultiQC/` directory from which you opened the report. * `reference_genome/`: this directory contains the indexing files of your input reference genome (i.e. the various `bwa` indices, a `samtools`' `.fai` file, and a picard `.dict`), if you used the `--saveReference` flag. * When masking of the reference is requested prior to running pmdtools, an additional directory `reference_genome/masked_genome` will be found here, containing the masked reference. * `fastqc/`: this contains the original per-FASTQ FastQC reports that are summarised with MultiQC. These occur in both `html` (the report) and `.zip` format (raw data). The `after_clipping` folder contains the same but for after AdapterRemoval. * `adapterremoval/`: this contains the log files (ending with `.settings`) with raw trimming (and merging) statistics after AdapterRemoval. In the `output` sub-directory, are the output trimmed (and merged) `fastq` files. These you can use for downstream applications such as taxonomic binning for metagenomic studies. * `lanemerging/`: this contains adapter-trimmed and merged (i.e. collapsed) FASTQ files that were merged across lanes, where applicable. These files are the reads that go into mapping (when multiple lanes were specified for a library), and can be used for downstream applications such as taxonomic binning for metagenomic studies. * `post_ar_fastq_trimmed`: this contains `fastq` files that have been additionally trimmed after AdapterRemoval (if turned on). These reads are usually that had internal barcodes, or damage that needed to be removed before mapping. * `mapping/`: this contains a sub-directory corresponding to the mapping tool you used, inside of which will be the initial BAM files containing the reads that mapped to your reference genome with no modification (see below). You will also find a corresponding BAM index file (ending in `.csi` or `.bai`), and if running the `bowtie2` mapper: a log ending in `_bt2.log`. You can use these for downstream applications e.g. if you wish to use a different de-duplication tool not included in nf-core/eager (although please feel free to add a new module request on the Github repository's [issue page](https://github.com/nf-core/eager/issues)!). * `samtools/`: this contains two sub-directories. `stats/` contain the raw mapping statistics files (ending in `.stats`) from directly after mapping. `filter/` contains BAM files that have had a mapping quality filter applied (set by the `--bam_mapping_quality_threshold` flag) and a corresponding index file. Furthermore, if you selected `--bam_discard_unmapped`, you will find your separate file with only unmapped reads in the format you selected. Note unmapped read BAM files will _not_ have an index file. * `deduplication/`: this contains a sub-directory called `dedup/`, inside here are sample specific directories. Each directory contains a BAM file containing mapped reads but with PCR duplicates removed, a corresponding index file and two stats file. `.hist.` contains raw data for a deduplication histogram used for tools like preseq (see below), and the `.log` contains overall summary deduplication statistics. * `endorSpy/`: this contains all JSON files exported from the endorSpy endogenous DNA calculation tool. The JSON files are generated specifically for display in the MultiQC general statistics table and is otherwise very likely not useful for you. * `preseq/`: this contains a `.preseq` file for every BAM file that had enough deduplication statistics to generate a complexity curve for estimating the amount unique reads that will be yield if the library is re-sequenced. You can use this file for plotting e.g. in `R` to find your sequencing target depth. * `qualimap/`: this contains a sub-directory for every sample, which includes a qualimap report and associated raw statistic files. You can open the `.html` file in your internet browser to see the in-depth report (this will be more detailed than in MultiQC). This includes stuff like percent coverage, depth coverage, GC content and so on of your mapped reads. * `damageprofiler/`: this contains sample specific directories containing raw statistics and damage plots from DamageProfiler. The `.pdf` files can be used to visualise C to T miscoding lesions or read length distributions of your mapped reads. All raw statistics used for the PDF plots are contained in the `.txt` files. * `mapdamage/`: this contains sample specific directories containing raw statistics and damage plots from mapDamage. The `.pdf` files can be used to visualise C to T miscoding lesions or read length distributions of your mapped reads. All raw statistics used for the PDF plots are contained in the `.txt` files. The `Runtime_log.txt` file contains runtime information. * `pmdtools/`: this contains raw output statistics of pmdtools (estimates of frequencies of substitutions), and BAM files which have been filtered to remove reads that do not have a Post-mortem damage (PMD) score of `--pmdtools_threshold`. * `trimmed_bam/`: this contains the BAM files with X number of bases trimmed off as defined with the `--bamutils_clip_half_udg_left`, `--bamutils_clip_half_udg_right`, `--bamutils_clip_none_udg_left`, and `--bamutils_clip_none_udg_right` flags and corresponding index files. You can use these BAM files for downstream analysis such as re-mapping data with more stringent parameters (if you set trimming to remove the most likely places containing damage in the read). * `damage_rescaling/`: this contains rescaled BAM files from mapDamage. These BAM files have damage probabilistically removed via a bayesian model, and can be used for downstream genotyping. * `genotyping/`: this contains all the (gzipped) genotyping files produced by your genotyping module. The file suffix will have the genotyping tool name. You will have files corresponding to each of your deduplicated BAM files (except pileupcaller), or any turned-on downstream processes that create BAMs (e.g. trimmed bams or pmd tools). If `--gatk_ug_keep_realign_bam` supplied, this may also contain BAM files from InDel realignment when using GATK 3 and UnifiedGenotyping for variant calling. When pileupcaller is used to create eigenstrat genotypes, this directory also contains eigenstrat SNP coverage statistics. * `multivcfanalyzer/`: this contains all output from MultiVCFAnalyzer, including SNP calling statistics, various SNP table(s) and FASTA alignment files. * `sex_determination/`: this contains the output for the sex determination run. This is a single `.tsv` file that includes a table with the sample name, the number of autosomal SNPs, number of SNPs on the X/Y chromosome, the number of reads mapping to the autosomes, the number of reads mapping to the X/Y chromosome, the relative coverage on the X/Y chromosomes, and the standard error associated with the relative coverages. These measures are provided for each bam file, one row per file. If the `sexdeterrmine_bedfile` option has not been provided, the error bars cannot be trusted, and runtime will be considerably longer. * `nuclear_contamination/`: this contains the output of the nuclear contamination processes. The directory contains one `*.X.contamination.out` file per individual, as well as `nuclear_contamination.txt` which is a summary table of the results for all individual. `nuclear_contamination.txt` contains a header, followed by one line per individual, comprised of the Method of Moments (MOM) and Maximum Likelihood (ML) contamination estimate (with their respective standard errors) for both Method1 and Method2. * `bedtools/`: this contains two files as the output from bedtools coverage. One file contains the 'breadth' coverage (`*.breadth.gz`). This file will have the contents of your annotation file (e.g. BED/GFF), and the following subsequent columns: no. reads on feature, # bases at depth, length of feature, and % of feature. The second file (`*.depth.gz`), contains the contents of your annotation file (e.g. BED/GFF), and an additional column which is mean depth coverage (i.e. average number of reads covering each position). * `metagenomic_complexity_filter`: this contains the output from filtering of input reads to metagenomic classification of low-sequence complexity reads as performed by `bbduk`. This will include the filtered FASTQ files (`*_lowcomplexityremoved.fq.gz`) and also the run-time log (`_bbduk.stats`) for each sample. **Note:** there are no sections in the MultiQC report for this module, therefore you must check the `._bbduk.stats` files to get summary statistics of the filtering. * `metagenomic_classification/`: this contains the output for a given metagenomic classifier. * Running MALT will contain RMA6 files that can be loaded into MEGAN6 or MaltExtract for phylogenetic visualisation of read taxonomic assignments and aDNA characteristics respectively. Additional a `malt.log` file is provided which gives additional information such as run-time, memory usage and per-sample statistics of numbers of alignments with taxonomic assignment etc. This will also include gzip SAM files if requested. * Running kraken will contain the Kraken output and report files, as well as a merged Taxon count table. You will also get a Kraken kmer duplication table, in a [KrakenUniq](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1568-0) fashion. This is very useful to check for breadth of coverage and detect read stacking. A small number of aligned reads (low coverage) and a kmer duplication >1 is usually a sign of read stacking, usually indicative of a false positive hit (e.g. from over-amplified libraries). _Kmer duplication is defined as: number of kmers / number of unique kmers_. You will find two kraken reports formats available: * the `*.kreport` which is the old report format, without distinct minimizer count information, used by some tools such as [Pavian](https://github.com/fbreitwieser/pavian) * the `*.kraken2_report` which is the new kraken report format, with the distinct minimizer count information. * finally, the `*.kraken.out` file are the direct output of Kraken2 * ⚠️ If your sample has no hits, no kraken output files will be created for that sample! * `maltextract/`: this contains a `results` directory in which contains the output from MaltExtract - typically one folder for each filter type, an error and a log file. The characteristics of each node (e.g. damage, read lengths, edit distances - each in different txt formats) can be seen in each sub-folder of the filter folders. Output can be visualised either with the [HOPS postprocessing script](https://github.com/rhuebler/HOPS) or [MEx-IPA](https://github.com/jfy133/MEx-IPA) * `consensus_sequence/`: this contains three FASTA files from VCF2Genome of a consensus sequence based on the reference FASTA with each sample's unique modifications. The main FASTA is a standard file with bases not passing the specified thresholds as Ns. The two other FASTAS (`_refmod.fasta.gz`) and (`_uncertainity.fasta.gz`) are IUPAC uncertainty codes (rather than Ns) and a special number-based uncertainty system used for other downstream tools, respectively. `merged_bams/initial`: these contain the BAM files that would go into UDG-treatment specific BAM trimming. All libraries of the sample sample, **and** same UDG-treatment type will be in these BAM files. * `merged_bams/additional`: these contain the final BAM files that would go into genotyping (if genotyping is turned on). This means the files will contain all libraries of a given sample (including trimmed non-UDG or half-UDG treated libraries, if BAM trimming turned on) * `bcftools`: this currently contains a single directory called `stats/` that includes general statistics on variant callers producing VCF files as output by `bcftools stats`. These includethings such as the number of positions, number of transititions/transversions and depth coverage of SNPs etc. These are only produced if `--run_bcftools_stats` is supplied. ================================================ FILE: docs/usage.md ================================================ # nf-core/eager: Usage ## :warning: Please read this documentation on the nf-core website: [https://nf-co.re/eager/usage](https://nf-co.re/eager/usage) > _Documentation of pipeline parameters is generated automatically from the pipeline schema and can no longer be found in markdown files._ ## Introduction ## Running the pipeline ### Quick Start > Before you start you should change into the output directory you wish your > results to go in. This will guarantee, that when you start the Nextflow job, > it will place all the log files and 'working' folders in the corresponding > output directory, (and not wherever else you may have executed the run from) The typical command for running the pipeline is as follows: ```bash nextflow run nf-core/eager --input '*_R{1,2}.fastq.gz' --fasta 'some.fasta' -profile standard,docker ``` where the reads are from FASTQ files of the same pairing. This will launch the pipeline with the `docker` configuration profile. See below for more information about profiles. Note that the pipeline will create the following files in your working directory: ```bash work # Directory containing the Nextflow working files results # Finished results (configurable, see below) .nextflow.log # Log file from Nextflow # Other Nextflow hidden files, eg. history of pipeline runs and old logs. ``` To see the the nf-core/eager pipeline help message run: `nextflow run nf-core/eager --help` If you want to configure your pipeline interactively using a graphical user interface, please visit [nf-co.re launch](https://nf-co.re/launch?pipeline=eager). Select the `eager` pipeline and the version you intend to run, and follow the on-screen instructions to create a config for your pipeline run. ### Updating the pipeline When you run the above command, Nextflow automatically pulls the pipeline code from GitHub and stores it as a cached version. When running the pipeline after this, it will always use the cached version if available - even if the pipeline has been updated since. To make sure that you're running the latest version of the pipeline, make sure that you regularly update the cached version of the pipeline: ```bash nextflow pull nf-core/eager ``` ### Reproducibility It's a good idea to specify a pipeline version when running the pipeline on your data. This ensures that a specific version of the pipeline code and software are used when you run your pipeline. If you keep using the same tag, you'll be running the same version of the pipeline, even if there have been changes to the code since. First, go to the [nf-core/eager releases page](https://github.com/nf-core/eager/releases) and find the latest version number - numeric only (eg. `1.3.1`). Then specify this when running the pipeline with `-r` (one hyphen) - eg. `-r 1.3.1`. This version number will be logged in reports when you run the pipeline, so that you'll know what you used when you look back in the future. Additionally, nf-core/eager pipeline releases are named after Swabian German Cities. The first release V2.0 is named "Kaufbeuren". Future releases are named after cities named in the [Swabian league of Cities](https://en.wikipedia.org/wiki/Swabian_League_of_Cities). ### Automatic Resubmission By default, if a pipeline step fails, nf-core/eager will resubmit the job with twice the amount of CPU and memory. This will occur two times before failing. ## Core Nextflow arguments > **NB:** These options are part of Nextflow and use a _single_ hyphen (pipeline > parameters use a double-hyphen). ### `-profile` Use this parameter to choose a configuration profile. Profiles can give configuration presets for different compute environments. Several generic profiles are bundled with the pipeline which instruct the pipeline to use software packaged using different methods (Docker, Singularity, Podman, Shifter, Charliecloud, Conda) - see below. > We highly recommend the use of Docker or Singularity containers for full pipeline reproducibility, however when this is not possible, Conda is also supported. The pipeline also dynamically loads configurations from [https://github.com/nf-core/configs](https://github.com/nf-core/configs) when it runs, making multiple config profiles for various institutional clusters available at run time. For more information and to see if your system is available in these configs please see the [nf-core/configs documentation](https://github.com/nf-core/configs#documentation). Note that multiple profiles can be loaded, for example: `-profile test,docker` - the order of arguments is important! They are loaded in sequence, so later profiles can overwrite earlier profiles. If `-profile` is not specified, the pipeline will run locally and expect all software to be installed and available on the `PATH`. This is _not_ recommended. * `docker` * A generic configuration profile to be used with [Docker](https://docker.com/) * Pulls software from Docker Hub: [`nfcore/eager`](https://hub.docker.com/r/nfcore/eager/) * `singularity` * A generic configuration profile to be used with [Singularity](https://sylabs.io/docs/) * Pulls software from Docker Hub: [`nfcore/eager`](https://hub.docker.com/r/nfcore/eager/) * `podman` * A generic configuration profile to be used with [Podman](https://podman.io/) * Pulls software from Docker Hub: [`nfcore/eager`](https://hub.docker.com/r/nfcore/eager/) * `shifter` * A generic configuration profile to be used with [Shifter](https://nersc.gitlab.io/development/shifter/how-to-use/) * Pulls software from Docker Hub: [`nfcore/eager`](https://hub.docker.com/r/nfcore/eager/) * `charliecloud` * A generic configuration profile to be used with [Charliecloud](https://hpc.github.io/charliecloud/) * Pulls software from Docker Hub: [`nfcore/eager`](https://hub.docker.com/r/nfcore/eager/) * `conda` * Please only use Conda as a last resort i.e. when it's not possible to run the pipeline with Docker, Singularity, Podman, Shifter or Charliecloud. * A generic configuration profile to be used with [Conda](https://conda.io/docs/) * Pulls most software from [Bioconda](https://bioconda.github.io/) * `test` * A profile with a complete configuration for automated testing * Includes links to test data so needs no other parameters > _Important_: If running nf-core/eager on a cluster - ask your system > administrator what profile to use. **Institution Specific Profiles** These are profiles specific to certain **HPC clusters**, and are centrally maintained at [nf-core/configs](https://github.com/nf-core/configs). Those listed below are regular users of nf-core/eager, if you don't see your own institution here check the [nf-core/configs](https://github.com/nf-core/configs) repository. * `uzh` * A profile for the University of Zurich Research Cloud * Loads Singularity and defines appropriate resources for running the pipeline. * `binac` * A profile for the BinAC cluster at the University of Tuebingen 0 Loads Singularity and defines appropriate resources for running the pipeline * `shh` * A profile for the S/CDAG cluster at the Department of Archaeogenetics of the Max Planck Institute for the Science of Human History * Loads Singularity and defines appropriate resources for running the pipeline **Pipeline Specific Institution Profiles** There are also pipeline-specific institution profiles. I.e., we can also offer a profile which sets special resource settings to specific steps of the pipeline, which may not apply to all pipelines. This can be seen at [nf-core/configs](https://github.com/nf-core/configs) under [conf/pipelines/eager/](https://github.com/nf-core/configs/tree/master/conf/pipeline/eager). We currently offer a nf-core/eager specific profile for * `shh` * A profiler for the S/CDAG cluster at the Department of Archaeogenetics of the Max Planck Institute for the Science of Human History * In addition to the nf-core wide profile, this also sets the MALT resources to match our commonly used databases Further institutions can be added at [nf-core/configs](https://github.com/nf-core/configs). Please ask the eager developers to add your institution to the list above, if you add one! If you are likely to be running `nf-core` pipelines regularly it may be a good idea to request that your custom config file is uploaded to the `nf-core/configs` git repository. Before you do this please can you test that the config file works with your pipeline of choice using the `-c` parameter (see definition above). You can then create a pull request to the `nf-core/configs` repository with the addition of your config file, associated documentation file (see examples in [`nf-core/configs/docs`](https://github.com/nf-core/configs/tree/master/docs)), and amending [`nfcore_custom.config`](https://github.com/nf-core/configs/blob/master/nfcore_custom.config) to include your custom profile. If you have any questions or issues please send us a message on [Slack](https://nf-co.re/join/slack) on the [`#configs` channel](https://nfcore.slack.com/channels/configs). ### `-resume` Specify this when restarting a pipeline. Nextflow will used cached results from any pipeline steps where the inputs are the same, continuing from where it got to previously. You can also supply a run name to resume a specific run: `-resume [run-name]`. Use the `nextflow log` command to show previous run names. ### `-c` Specify the path to a specific config file (this is a core Nextflow command). See the [nf-core website documentation](https://nf-co.re/usage/configuration) for more information. #### Custom resource requests Each step in the pipeline has a default set of requirements for number of CPUs, memory and time. For most of the steps in the pipeline, if the job exits with an error code of `143` (exceeded requested resources) it will automatically resubmit with higher requests (2 x original, then 3 x original). If it still fails after three times then the pipeline is stopped. Whilst these default requirements will hopefully work for most people with most data, you may find that you want to customise the compute resources that the pipeline requests. You can do this by creating a custom config file. For example, to give the workflow process `star` 32GB of memory, you could use the following config: ```nextflow process { withName: bwa { memory = 32.GB } } ``` To find the exact name of a process you wish to modify the compute resources, check the live-status of a nextflow run displayed on your terminal or check the nextflow error for a line like so: `Error executing process > 'bwa'`. In this case the name to specify in the custom config file is `bwa`. See the main [Nextflow documentation](https://www.nextflow.io/docs/latest/config.html) for more information. If you are likely to be running `nf-core` pipelines regularly it may be a good idea to request that your custom config file is uploaded to the `nf-core/configs` git repository. Before you do this please can you test that the config file works with your pipeline of choice using the `-c` parameter (see definition below). You can then create a pull request to the `nf-core/configs` repository with the addition of your config file, associated documentation file (see examples in [`nf-core/configs/docs`](https://github.com/nf-core/configs/tree/master/docs)), and amending [`nfcore_custom.config`](https://github.com/nf-core/configs/blob/master/nfcore_custom.config) to include your custom profile. If you have any questions or issues please send us a message on [Slack](https://nf-co.re/join/slack) on the [`#configs` channel](https://nfcore.slack.com/channels/configs). #### `-name` Name for the pipeline run. If not specified, Nextflow will automatically generate a random mnemonic. This is used in the MultiQC report (if not default) and in the summary HTML / e-mail (always). **NB:** Single hyphen (core Nextflow option) ### Running in the background Nextflow handles job submissions and supervises the running jobs. The Nextflow process must run until the pipeline is finished. The Nextflow `-bg` flag launches Nextflow in the background, detached from your terminal so that the workflow does not stop if you log out of your session. The logs are saved to a file. Alternatively, you can use `screen` / `tmux` or similar tool to create a detached session which you can log back into at a later time. Some HPC setups also allow you to run nextflow within a cluster job submitted your job scheduler (from where it submits more jobs). To create a screen session: ```bash screen -R nf-core/eager ``` To disconnect, press `ctrl+a` then `d`. To reconnect, type: ```bash screen -r nf-core/eager ``` to end the screen session while in it type `exit`. #### Nextflow memory requirements In some cases, the Nextflow Java virtual machines can start to request a large amount of memory. We recommend adding the following line to your environment to limit this (typically in `~/.bashrc` or `~./bash_profile`): ```bash NXF_OPTS='-Xms1g -Xmx4g' ``` ## Input Specifications There are two possible ways of supplying input sequencing data to nf-core/eager. The most efficient but more simplistic is supplying direct paths (with wildcards) to your FASTQ or BAM files, with each file or pair being considered a single library and each one run independently. TSV input requires creation of an extra file by the user and extra metadata, but allows more powerful lane and library merging. ### Direct Input Method This method is where you specify with `--input`, the path locations of FASTQ (optionally gzipped) or BAM file(s). This option is mutually exclusive to the [TSV input method](#tsv-input-method), which is used for more complex input configurations such as lane and library merging. When using the direct method of `--input` you can specify one or multiple samples in one or more directories files. File names **must be unique**, even if in different directories. By default, the pipeline _assumes_ you have paired-end data. If you want to run single-end data you must specify [`--single_end`]('#single_end') For example, for a single set of FASTQs, or multiple paired-end FASTQ files in one directory, you can specify: ```bash --input 'path/to/data/sample_*_{1,2}.fastq.gz' ``` If you have multiple files in different directories, you can use additional wildcards (`*`) e.g.: ```bash --input 'path/to/data/*/sample_*_{1,2}.fastq.gz' ``` > :warning: It is not possible to run a mixture of single-end and paired-end files in one run with the paths `--input` method! Please see the [TSV input method](#tsv-input-method) for possibilities. **Please note** the following requirements: 1. Valid file extensions: `.fastq.gz`, `.fastq`, `.fq.gz`, `.fq`, `.bam`. 2. The path **must** be enclosed in quotes 3. The path must have at least one `*` wildcard character 4. When using the pipeline with **paired end data**, the path must use `{1,2}` notation to specify read pairs. 5. Files names must be unique, having files with the same name, but in different directories is _not_ sufficient * This can happen when a library has been sequenced across two sequencers on the same lane. Either rename the file, try a symlink with a unique name, or merge the two FASTQ files prior input. 6. Due to limitations of downstream tools (e.g. FastQC), sample IDs may be truncated after the first `.` in the name, Ensure file names are unique prior to this! 7. For input BAM files you should provide a small decoy reference genome with pre-made indices, e.g. the human mtDNA or phiX genome, for the mandatory parameter `--fasta` in order to avoid long computational time for generating the index files of the reference genome, even if you do not actually need a reference genome for any downstream analyses. ### TSV Input Method Alternatively to the [direct input method](#direct-input-method), you can supply to `--input` a path to a TSV file that contains paths to FASTQ/BAM files and additional metadata. This allows for more complex procedures such as merging of sequencing data across lanes, sequencing runs, sequencing configuration types, and samples.

Schematic diagram indicating merging points of different types of libraries, given a TSV input. Dashed boxes are optional library-specific processes

> Only different libraries from a single sample that have been BAM trimmed will be merged together. Rescaled or PMD filtered libraries will not be merged prior genotyping as each library _may_ have a different model applied to it and have their own biases (i.e. users may need to play around with settings to get the damage-removal optimal). The use of the TSV `--input` method is recommended when performing more complex procedures such as lane or library merging. You do not need to specify `--single_end`, `--bam`, `--colour_chemistry`, `-udg_type` etc. when using TSV input - this is defined within the TSV file itself. You can only supply a single TSV per run (i.e. `--input '*.tsv'` will not work). This TSV should look like the following: | Sample_Name | Library_ID | Lane | Colour_Chemistry | SeqType | Organism | Strandedness | UDG_Treatment | R1 | R2 | BAM | |-------------|------------|------|------------------|--------|----------|--------------|---------------|----|----|-----| | JK2782 | JK2782 | 1 | 4 | PE | Mammoth | double | full | [https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2782_TGGCCGATCAACGA_L008_R1_001.fastq.gz.tengrand.fq.gz](https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2782_TGGCCGATCAACGA_L008_R1_001.fastq.gz.tengrand.fq.gz) | [https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2782_TGGCCGATCAACGA_L008_R2_001.fastq.gz.tengrand.fq.gz](https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2782_TGGCCGATCAACGA_L008_R2_001.fastq.gz.tengrand.fq.gz) | NA | | JK2802 | JK2802 | 2 | 2 | SE | Mammoth | double | full | [https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2802_AGAATAACCTACCA_L008_R1_001.fastq.gz.tengrand.fq.gz](https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2802_AGAATAACCTACCA_L008_R1_001.fastq.gz.tengrand.fq.gz) | NA | NA | A template can be taken from [here](https://raw.githubusercontent.com/nf-core/test-datasets/eager/reference/TSV_template.tsv). > :warning: Cells **must not** contain spaces before or after strings, as this will make the TSV unreadable by nextflow. Strings containing spaces should be wrapped in quotes. When using TSV_input, nf-core/eager will merge FASTQ files of libraries with the same `Library_ID` but different `Lanes` values after adapter clipping (and merging), assuming all other metadata columns are the same. If you have the same `Library_ID` but with different `SeqType`, this will be merged directly after mapping prior BAM filtering. Finally, it will also merge BAM files with the same `Sample_ID` but different `Library_ID` after duplicate removal, but prior to genotyping. Please see caveats to this below. Column descriptions are as follows: * **Sample_Name:** A text string containing the name of a given sample of which there can be multiple libraries. All libraries with the same sample name and same SeqType will be merged after deduplication. * **Library_ID:** A text string containing a given library, which there can be multiple sequencing lanes (with the same SeqType). * **Lane:** A number indicating which lane the library was sequenced on. Files from the libraries sequenced on different lanes (and different SeqType) will be concatenated after read clipping and merging. * **Colour Chemistry** A number indicating whether the Illumina sequencer the library was sequenced on was a 2 (e.g. Next/NovaSeq) or 4 (Hi/MiSeq) colour chemistry machine. This informs whether poly-G trimming (if turned on) should be performed. * **SeqType:** A text string of either 'PE' or 'SE', specifying paired end (with both an R1 [or forward] and R2 [or reverse]) and single end data (only R1 [forward], or BAM). This will affect lane merging if different per library. * **Organism:** A text string of the organism name of the sample or 'NA'. This currently has no functionality and can be set to 'NA', but will affect lane/library merging if different per library * **Strandedness:** A text string indicating whether the library type is'single' or 'double'. This will affect lane/library merging if different per library. * **UDG_Treatment:** A text string indicating whether the library was generated with UDG treatment - either 'full', 'half' or 'none'. Will affect lane/library merging if different per library. * **R1:** A text string of a file path pointing to a forward or R1 FASTQ file. This can be used with the R2 column. File names **must be unique**, even if they are in different directories. * **R2:** A text string of a file path pointing to a reverse or R2 FASTQ file, or 'NA' when single end data. This can be used with the R1 column. File names **must be unique**, even if they are in different directories. * **BAM:** A text string of a file path pointing to a BAM file, or 'NA'. Cannot be specified at the same time as R1 or R2, both of which should be set to 'NA' For example, the following TSV table: | Sample_Name | Library_ID | Lane | Colour_Chemistry | SeqType | Organism | Strandedness | UDG_Treatment | R1 | R2 | BAM | |-------------|------------|------|------------------|---------|----------|--------------|---------------|----------------------------------------------------------------|----------------------------------------------------------------|-----| | JK2782 | JK2782 | 7 | 4 | PE | Mammoth | double | full | data/JK2782_TGGCCGATCAACGA_L007_R1_001.fastq.gz.tengrand.fq.gz | data/JK2782_TGGCCGATCAACGA_L007_R2_001.fastq.gz.tengrand.fq.gz | NA | | JK2782 | JK2782 | 8 | 4 | PE | Mammoth | double | full | data/JK2782_TGGCCGATCAACGA_L008_R1_001.fastq.gz.tengrand.fq.gz | data/JK2782_TGGCCGATCAACGA_L008_R2_001.fastq.gz.tengrand.fq.gz | NA | | JK2802 | JK2802 | 7 | 4 | PE | Mammoth | double | full | data/JK2802_AGAATAACCTACCA_L007_R1_001.fastq.gz.tengrand.fq.gz | data/JK2802_AGAATAACCTACCA_L007_R2_001.fastq.gz.tengrand.fq.gz | NA | | JK2802 | JK2802 | 8 | 4 | SE | Mammoth | double | full | data/JK2802_AGAATAACCTACCA_L008_R1_001.fastq.gz.tengrand.fq.gz | NA | NA | will have the following effects: * After AdapterRemoval, and prior to mapping, FASTQ files from lane 7 and lane 8 _with the same `SeqType`_ (and all other _metadata_ columns) will be concatenated together for each **Library**. * After mapping, and prior BAM filtering, BAM files with different `SeqType` (but with all other metadata columns the same) will be merged together for each **Library**. * After duplicate removal, BAM files with different `Library_ID`s but with the same `Sample_Name` and the same `UDG_Treatment` will be merged together. * If BAM trimming is turned on, all post-trimming BAMs (i.e. non-UDG and half-UDG ) will be merged with UDG-treated (untreated) BAMs, if they have the same `Sample_Name`. Note the following important points and limitations for setting up: * The TSV must use actual tabs (not spaces) between cells. * The input FASTQ filenames are discarded after FastQC, all other downstream results files are based on `Sample_Name`, `Library_ID` and `Lane` columns for filenames. * _File_ names must be unique regardless of file path, due to risk of over-writing (see: [https://github.com/nextflow-io/nextflow/issues/470](https://github.com/nextflow-io/nextflow/issues/470)). * At different stages of the merging process, (as above) nf-core/eager will use as output filenames the information from the `Sample_Name`, `Library_ID` and/or `Lane` columns for filenames. * Library_IDs must be unique (other than if they are spread across multiple lanes). For example, your .tsv file must not have rows with both the strings in the Library_ID column as `Library1` and `Library1`, for **both** `SampleA` and `SampleB` in the Sample_ID column, otherwise the two `Library1.fq.gz` files may result in a filename collision. * If it is 'too late' and you already have duplicated FASTQ file names before starting a run, a workaround is to concatenate the FASTQ files together and supply this to a nf-core/eager run. The only downside is that you will not get independent FASTQC results for each file. * Lane IDs must be unique for each sequencing of each library. * If you have a library sequenced e.g. on Lane 8 of two HiSeq runs, you can give a fake lane ID (e.g. 20) for one of the FASTQs, and the libraries will still be processed correctly. * This also applies to the SeqType column, i.e. with the example above, if one run is PE and one run is SE, you need to give fake lane IDs to one of the runs as well. * All _BAM_ files must be specified as `SE` under `SeqType`. * You should provide a small decoy reference genome with pre-made indices, e.g. the human mtDNA or phiX genome, for the mandatory parameter `--fasta` in order to avoid long computational time for generating the index files of the reference genome, even if you do not actually need a reference genome for any downstream analyses. * nf-core/eager will only merge multiple _lanes_ of sequencing runs with the same single-end or paired-end configuration * Accordingly nf-core/eager will not merge _lanes_ of FASTQs with BAM files (unless you use `--run_convertbam`), as only FASTQ files are lane-merged together. * nf-core/eager is able to correctly handle libraries that are sequenced multiple times on different sequencing configurations (i.e mixtures of single- and paired-end data). These will be merged after mapping and considered 'paired-end' during downstream processes. * **Important** we do not recommend choosing to use DeDup (i.e. `--dedupper 'dedup'`) when mixing PE and SE data, as SE data will not necessarily have the correct end position of the read, and DeDup requires both ends of the molecule to remove a duplicate read. Therefore you may end up with inflated (false-positive) coverages due to suboptimal deduplication. * When you wish to run PE/SE data together, the default `-dedupper markduplicates` is therefore preferred, as it only looks at the first position. While more conservative (i.e. it'll remove more reads even if not technically duplicates, because it assumes it can't see the true ends of molecules), it is more consistent. * An error will be thrown if you try to merge both PE and SE and also supply `--skip_merging`. * If you truly want to mix SE data and PE data but using mate-pair info for PE mapping, please run FASTQ preprocessing mapping manually and supply BAM files for downstream processing by nf-core/eager * If you _regularly_ want to run the situation above, please leave a feature request on github. * DamageProfiler, NuclearContamination, MTtoNucRatio and PreSeq are performed on each unique library separately after deduplication (but prior same-treated library merging). * nf-core/eager functionality such as `--run_trim_bam` will be applied to only non-UDG (UDG_Treatment: none) or half-UDG (UDG_Treatment: half) libraries. - Qualimap is run on each sample, after merging of libraries (i.e. your values will reflect the values of all libraries combined - after being damage trimmed etc.). * Genotyping will be typically performed on each `sample` independently, as normally all libraries will have been merged together. However, if you have a mixture of single-stranded and double-stranded libraries, you will normally need to genotype separately. In this case you **must** give each the SS and DS libraries _distinct_ `Sample_IDs`; otherwise you will receive a `file collision` error in steps such as `sexdeterrmine`, and then you will need to merge these yourself. We will consider changing this behaviour in the future if there is enough interest. ## Clean up Once a run has completed, you will have _lots_ of (some very large) intermediate files in your output directory. These are stored within the directory named `work`. After you have verified your run completed correctly and everything in the module output directories are present as you expect and need, you can perform a clean-up. > **Important**: Once clean-up is completed, you will _not_ be able to re-rerun > the pipeline from an earlier step and you'll have to re-run from scratch. While in your output directory, firstly verify you're only deleting files stored in `work/` with the dry run command: ```bash nextflow clean -n ``` > :warning: some institutional profiles already have clean-up on successful run > completion turned on by default. If you're ready, you can then remove the files with ```bash nextflow clean -f -k ``` This will make your system administrator very happy as you will _halve_ the hard drive footprint of the run, so be sure to do this! ## Troubleshooting and FAQs ### I get a file name collision error during merging When using TSV input, nf-core/eager will attempt to merge all `Lanes` of a `Library_ID`, or all files with the same `Library_ID` or `Sample_ID`. However, if you have specified the same `Lane` or `Library_ID` for two sets of FASTQ files you will likely receive an error such as ```bash Error executing process > 'library_merge (JK2782)' Caused by: Process `library_merge` input file name collision -- There are multiple input files for each of the following file names: JK2782.mapped_rmdup.bam.csi, JK2782.mapped_rmdup.bam Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh` Execution cancelled -- Finishing pending tasks before exit ``` In this case: for lane merging errors, you can give 'fake' lane IDs to ensure they are unique (e.g. if one library was sequenced on Lane 8 of two HiSeq runs, specify lanes as 8 and 16 for each FASTQ file respectively). For library merging errors, you must modify your `Library_ID`s accordingly, to make them unique. ### A library or sample is missing in my MultiQC report In some cases it maybe no output log is produced by a particular tool for MultiQC. Therefore this sample will not be displayed. Known cases include: * Qualimap: there will be no MultiQC output if the BAM file is empty. An empty BAM file is produced when no reads map to the reference and causes Qualimap to crash - this is crash is ignored by nf-core/eager (to allow the rest of the pipeline to continue) and will therefore have no log file for that particular sample/library ## Tutorials ### Tutorial - How to investigate a failed run As with most pipelines, nf-core/eager can sometimes fail either through a problem with the pipeline itself, but also sometimes through an issue of the program being run at the given step. To help try and identify what has caused the error, you can perform the following steps before reporting the issue: #### 1a Nextflow reports an 'error executing process' with command error Firstly, take a moment to read the terminal output that is printed by an nf-core/eager command. When reading the following, you can see that the actual _command_ failed. When you get this error, this would suggest that an actual program used by the pipeline has failed. This is identifiable when you get an `exit status` and a `Command error:`, the latter of which is what is reported by the failed program itself. ```bash ERROR ~ Error executing process > 'circulargenerator (hg19_complete_500.fasta)' Caused by: Process `circulargenerator (hg19_complete_500.fasta)` terminated with an error exit status (1) Command executed: circulargenerator -e 500 -i hg19_complete.fasta -s MT bwa index hg19_complete_500.fasta Command exit status: 1 Command output: (empty) Command error: Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:3332) at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448) at java.lang.StringBuffer.append(StringBuffer.java:270) at CircularGenerator.extendFastA(CircularGenerator.java:155) at CircularGenerator.main(CircularGenerator.java:119) Work dir: /projects1/microbiome_calculus/RIII/03-preprocessing/mtCap_preprocessing/work/7f/52f33fdd50ed2593d3d62e7c74e408 Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run` -- Check '.nextflow.log' file for details ``` If you find it is a common error try and fix it yourself by changing your options in your nf-core/eager run - it could be a configuration error on your part. However in some cases it could be an error in the way we've set up the process in nf-core/eager. To further investigate, go to step 2. #### 1b Nextflow reports an 'error executing process' with no command error Alternatively, you may get an error with Nextflow itself. The most common one would be a 'process fails' and it looks like the following. ```bash Error executing process > 'library_merge (JK2782)' Caused by: Process `library_merge` input file name collision -- There are multiple input files for each of the following file names: JK2782.mapped_rmdup.bam.csi, JK2782.mapped_rmdup.bam Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh` Execution cancelled -- Finishing pending tasks before exit ``` However in this case, there is no `exit status` or `Command error:` message. In this case this is a Nextflow issue. The example above is because a user has specified multiple sequencing runs of different libraries but with the same library name. In this case Nextflow could not identify which is the correct file to merge because they have the same name. This again can also be a user or Nextflow error, but the errors are often more abstract and less clear how to solve (unless you are familiar with Nextflow). Try to investigate a bit further and see if you can understand what the error refers to, but if you cannot - please ask on the #eager channel on the [nf-core slack](https://nf-co.re/join/slack) or leave a [github issue](https://github.com/nf-core/eager/issues). #### 2 Investigating an failed process's `work/` directory If you haven't found a clear solution to the failed process from the reported errors, you can next go into the directory where the process was working in, and investigate the log and error messages that are produced by each command of the process. For example, in the error in [1a](#1a-nextflow-reports-an-error-executing-process-with-command-error) you can see the following line ```bash Work dir: /projects1/microbiome_calculus/RIII/03-preprocessing/mtCap_preprocessing/work/7f/52f33fdd50ed2593d3d62e7c74e408 ``` > A shortened version of the 'hash' directory ID can also be seen in your > terminal while the pipeline is running in the square brackets at the beginning > of each line. If you change into this with `cd` and run `ls -la` you should see a collection of normal files, symbolic links (symlinks) and hidden files (indicated with `.` at the beginning of the file name). * Symbolic links: are typically input files from previous processes. * Normal files: are typically successfully completed output files from some of some of the commands in the process * Hidden files are Nextflow generated files and include the submission commands as well as log files When you have an error run, you can firstly check the contents of the output files to see if they are empty or not (e.g. with `cat` or `zcat`), interpretation of which will depend on the program thus dependent on the user knowledge. Next, you can investigate `.command.err` and `.command.out`, or `.command.log`. These represent the standard out or error (in the case of `.log`, both combined) of all the commands/programs in the process - i.e. what would be printed to screen if you were running the command/program yourself. Again, view these with e.g. `cat` and see if you can identify the error of the program itself. Finally, you can also try running the commands _yourself_. You can firstly try to do this by loading your given nf-core/eager environment (e.g. `singularity shell /\/\/nf-core-eager-X-X-X.img` or `conda activate nf-core-eager-X.X.X`), then running `bash .command.sh`. If this doesn't work, this suggests either there is something wrong with the nf-core/eager environment configuration, _or_ there is still a problem with the program itself. To confirm the former, try running the command within the `.command.sh` file (viewable with `cat`) but with locally installed versions of programs you may already have on your system. If the command still doesn't work, it is a problem with the program or your specified configuration. If it does work locally, please report as a [github issue](https://github.com/nf-core/eager/issues). If it does, please ask the developer of the tool (although we will endeavour to help as much as we can via the [nf-core slack](https://nf-co.re/join/slack) in the #eager channel). ### Tutorial - What are profiles and how to use them #### Tutorial Profiles - Background A useful feature of Nextflow is the ability to use configuration _profiles_ that can specify many default parameters and other settings on how to run your pipeline. For example, you can use it to set your preferred mapping parameters, or specify where to keep Docker, Singularity or Conda environments, and which cluster scheduling system (and queues) your pipeline runs should normally use. This are defined in `.config` files, and these in-turn can contain different profiles that can define parameters for different contexts. For example, a `.config` file could contain two profiles, one for shallow-sequenced samples that uses only a small number of CPUs and memory e.g. `small`, and another for deep sequencing data, `deep`, that allows larger numbers of CPUs and memory. As another example you could define one profile called `loose` that contains mapping parameters to allow reads with aDNA damage to map, and then another called `strict` that reduces the likelihood of damaged DNA to map and cause false positive SNP calls. Within nf-core, there are two main levels of configs * Institutional-level profiles: these normally define things like paths to common storage, resource maximums, scheduling system * Pipeline-level profiles: these normally define parameters specifically for a pipeline (such as mapping parameters, turning specific modules on or off) As well as allowing more efficiency and control at cluster or Institutional levels in terms of memory usage, pipeline-level profiles can also assist in facilitating reproducible science by giving a way for researchers to 'publish' their exact pipeline parameters in way other users can automatically re-run the pipeline with the pipeline parameters used in the original publication but on their _own_ cluster. To illustrate this, lets say we analysed our data on a HPC called 'blue' for which an institutional profile already exists, and for our analysis we defined a profile called 'old_dna'. We will have run our pipeline with the following command ```bash nextflow run nf-core/eager -c old_dna_profile.config -profile hpc_blue,old_dna <...> ``` Then our colleague wished to recreate your results. As long as the `old_dna_profile.config` was published alongside your results, they can run the same pipeline settings but on their own cluster HPC 'purple'. ```bash nextflow run nf-core/eager -c old_dna_profile.config -profile hpc_purple,old_dna <...> ``` (where the `old_dna` profile is defined in `old_dna_profile.config`, and `hpc_purple` is defined on nf-core/configs) This tutorial will describe how to create and use profiles that can be used by or from other researchers. #### Tutorial Profiles - Inheritance Rules ##### Tutorial Profiles - Profiles An important thing to understand before you start writing your own profile is understanding 'inheritance' of profiles when specifying multiple profiles, when using `nextflow run`. When specifying multiple profiles, parameters defined in the profile in the first position will be overwritten by those in the second, and everything defined in the first and second will be overwritten everything in a third. This can be illustrated as follows. ```bash overwrites overwrites ┌──────┐ ┌──────┐ ▼ │ ▼ │ -profile institution,cluster,my_paper ``` This would be translated as follows. If your parameters looked like the following | Parameter | Resolved Parameters | institution | cluster | my_paper | | ----------------|------------------------|-------------|----------|----------| | --executor | singularity | singularity | \ | \ | | --max_memory | 256GB | 756GB | 256GB | \ | | --bwa_aln | 0.1 | \ | 0.01 | 0.1 | (where '\' is a parameter not defined in a given profile.) You can see that `my_paper` inherited the `0.1` parameter over the `0.01` defined in the `cluster` profile. > :warning: You must always check if parameters are defined in any 'upstream' > profiles that have been set by profile administrators that you may be unaware > of. This is make sure there are no unwanted or unreported 'defaults' away from > original nf-core/eager defaults. ##### Tutorial Profiles - Configuration Files > :warning: This section is only needed for users that want to set up > institutional-level profiles. Otherwise please skip to [Writing your own profile](#tutorial-profiles---writing-your-own-profile) In actuality, a nf-core/eager run already contains many configs and profiles, and will normally use _multiple_ configs profiles in a single run. Multiple configuration and profiles files can be used, and each new one selected will inherit all the previous one's parameters, and the parameters in the new one will then overwrite any that have been changed from the original. This can be visualised here

Using the example given in the [background](#tutorial-profiles---background), if the `hpc_blue` profile has the following pipeline parameters set ```txt <...> mapper = 'bwamem' dedupper = 'markduplicates' <...> ``` However, the profile `old_dna` has only the following parameter ```txt <...> mapper = 'bwaaln' <...> ``` Then running the pipeline with the profiles in the order of the following run command: ```bash nextflow run nf-core/eager -c old_dna_profile.config -profile hpc_blue,old_dna <...> ``` In the background, any parameters in the pipeline's `nextflow.config` (containing default parameters) will be overwritten by the `old_dna_profile.config`. In addition, the `old_dna` _profile_ will overwrite any parameters set in the config but outside the profile definition of `old_dna_profile.config`. Therefore, the final profile used by your given run would look like: ```txt <...> mapper = 'bwaaln' dedupper = 'markduplicates' <...> ``` You can see here that `markduplicates` has not changed as originally defined in the `hpc_blue` profile, but the `mapper` parameter has been changed from `bwamem` to `bwaaln`, as specified in the `old_dna` profile. The order of loading of different configuration files can be seen here: | Loading Order | Configuration File | | -------------:|:----------------------------------------------------------------------------------------------------------------| | 1 | `nextflow.config` in your current directory | | 2 | (if using a script for `nextflow run`) a `nextflow.config` in the directory the script is located | | 3 | `config` stored in your human directory under `~/.nextflow/` | | 4 | `.config` if you specify in the `nextflow run` command with `-c` | | 5 | general nf-core institutional configurations stored at [nf-core/configs](https://github.com/nf-core/configs) | | 6 | pipeline-specific nf-core institutional configurations at [nf-core/configs](https://github.com/nf-core/configs) | This loading order of these `.config` files will not normally affect the settings you use for the pipeline run itself; `-profiles` are normally more important. However this is good to keep in mind when you need to debug profiles if your run does not use the parameters you expect. > :warning: It is also possible to ignore every configuration file other when > specifying a custom `.config` file by using `-C` (capital C) instead of `-c` > (which inherits previously specify parameters) Another thing that is important to note is that if a specific _profile_ is specified in `nextflow run`, this replaces any 'global' parameter that is specified within the config file (but outside a profile) itself - **regardless** of profile order (see above). For example, see the example adapted from the SHH nf-core/eager pipeline-specific [configuration](https://github.com/nf-core/configs/blob/master/conf/pipeline/eager/shh.config). This pipeline-specific profile is automatically loaded if nf-core/eager detects we are running eager, and that we specified the profile as `shh`. ```txt // global 'fallback' parameters params { // Specific nf-core/configs params config_profile_contact = 'James Fellows Yates (@jfy133)' config_profile_description = 'nf-core/eager SHH profile provided by nf-core/configs' // default BWA bwaalnn = 0.04 bwaalnl = 32 } } // profile specific parameters profiles { pathogen_loose { params { config_profile_description = 'Pathogen (loose) MPI-SHH profile, provided by nf-core/configs.' bwaalnn = 0.01 bwaalnl = 16 } } } ``` If you run with `nextflow run -profile shh` to specify to use an institutional-level nf-core config, the parameters will be read as `--bwaalnn 0.04` and `--bwaalnl 32` as these are the default 'fall back' params as indicated in the example above. If you specify as `nextflow run -profile shh,pathogen_loose`, as expected Nextflow will resolve the two parameters as `0.01` and `16`. Importantly however, if you specify `-profile pathogen_loose,shh` the `pathogen_loose` **profile** will **still** take precedence over just the 'global' params. Equally, a **process**-level defined parameter (within the nf-core/eager code itself) will take precedence over the fallback parameters in the `config` file. This is also described in the Nextflow documentation [here](https://www.nextflow.io/docs/latest/config.html#config-profiles) This is because selecting a `profile` will always take precedence over the values specified in a config file, but outside of a profile. #### Tutorial Profiles - Writing your own profile We will now provide an example of how to write, use and share a project specific profile. We will use the example of [Andrades Valtueña et al. 2016](https://doi.org/10.1016/j.cub.2017.10.025). In it they used the original EAGER (v1) to map DNA from ancient DNA to the genome of the bacterium **Yersinia pestis**. Now, we will generate a profile, that, if they were using nf-core/eager they could share with other researchers. In the methods they described the following: > ... reads mapped to **Y. pestis** CO92 reference with BWA aln (-l 16, -n 0.01, > hereby referred to as non-UDG parameters). Reads with mapping quality scores > lower than 37 were filtered out. PCR duplicates were removed with > MarkDuplicates." Furthermore, in their 'Table 1' they say they used the NCBI **Y. pestis** genome 'NC_003143.1', which can be found on the NCBI FTP server at: [https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/009/065/GCF_000009065.1_ASM906v1/GCF_000009065.1_ASM906v1_genomic.fna.gz](https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/009/065/GCF_000009065.1_ASM906v1/GCF_000009065.1_ASM906v1_genomic.fna.gz) To make a profile with these parameters for use with nf-core/eager we first need to open a text editor, and define a Nextflow 'profile' block. ```txt profiles { } ``` Next we need to define the name of the profile. This is what we would write in `-profile`. Lets call this AndradesValtuena2018. ```txt profiles { AndradesValtuena2018 { } } ``` Now we need to make a `params` 'scope'. This means these are the parameters you specifically pass to nf-core/eager itself (rather than Nextflow configuration parameters). You should generally not add [non-`params` scopes](https://www.nextflow.io/docs/latest/config.html?highlight=profile#config-scopes) in profiles for a specific project. This is because these will normally modify the way the pipeline will run on the computer (rather than just nf-core/eager itself, e.g. the scheduler/executor or maximum memory available), and thus not allow other researchers to reproduce your analysis on their own computer/clusters. ```txt profiles { AndradesValtuena2018 { params { } } } ``` Now, as a cool little trick, we can use a couple of nf-core specific parameters that can help you keep track which profile you are using when running the pipeline. The `config_profile_description` and `config_profile_contact` profiles are displayed in the console log when running the pipeline. So you can use these to check if your profile loaded as expected. These are free text boxes so you can put what you like. ```txt profiles { AndradesValtuena2018 { params { config_profile_description = 'non-UDG parameters used in Andrades Valtuena et al. 2018 Curr. Bio.' config_profile_contact = 'Aida Andrades Valtueña (@aidaanva)' } } } ``` Now we can add the specific nf-core/eager parameters that will modify the mapping and deduplication parameters in nf-core/eager. ```txt profiles { AndradesValtuena2018 { params { config_profile_description = 'non-UDG parameters used in Andrades Valtuena et al. 2018 Curr. Bio.' config_profile_contact = 'Aida Andrades Valtueña (@aidaanva)' fasta = 'https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/009/065/GCF_000009065.1_ASM906v1/GCF_000009065.1_ASM906v1_genomic.fna.gz' bwaalnn = 0.01 bwaalnl = 16 run_bam_filtering = true bam_mapping_quality_threshold = 37 dedupper = 'markduplicates' } } } ``` Once filled in, we can save the file as `AndradesValtuena2018.config`. This you can use yourself, or upload alongside your publication for others to use. To use the profile you just need to specify the file containing the profile you wish to use, and then the profile itself. For example, Aida (Andrades Valtueña) at the MPI-SHH (`shh`) in Jena could run the following: ```bash nextflow run nf-core/eager -c ///AndradesValtuena2018.config -profile shh,AndradesValtuena2018 --input '////' <...> ``` Then a colleague at a different institution, such as the SciLifeLab, could run the same profile on the UPPMAX cluster in Uppsala with: ```bash nextflow run nf-core/eager -c ///AndradesValtuena2018.config -profile uppmax,AndradesValtuena2018 --input '////' <...> ``` And that's all there is to it. Of course you should always check that there are no other 'default' parameters for your given pipeline are defined in any pipeline-specific or institutional profiles. This ensures that someone re-running the pipeline with your settings is as close to the nf-core/eager defaults as possible, and only settings specific to your given project are used. If there are 'upstream' defaults, you should explicitly specify these in your project profile. ### Tutorial - How to set up nf-core/eager for human population genetics #### Tutorial Human Pop-Gen - Introduction This tutorial will give a basic example on how to set up nf-core/eager to perform initial screening of samples in the context of ancient human population genetics research. > :warning: this tutorial does not describe how to install and set up > nf-core/eager For this please see other documentation on the > [nf-co.re](https://nf-co.re/usage/installation) website. We will describe how to set up mapping of ancient sequences against the human reference genome to allow sequencing and library quality-control, estimation of nuclear contamination, genetic sex determination, and production of random draw genotypes in eigenstrat format for a specific set of sites, to be used in further analysis. For this example, I will be using the 1240k SNP set. This SNP set was first described in [Mathieson et al. 2015](https://www.nature.com/articles/nature16152) and contains various positions along the genome that have been extensively genotyped in present-day and ancient populations, and are therefore useful for ancient population genetic analysis. Capture techniques are often used to enrich DNA libraries for fragments, that overlap these SNPs, as is being assumed has been performed in this example. > :warning: Please be aware that the settings used in this tutorial may not use > settings nor produce files you would actually use in 'real' analysis. The > settings are only specified for demonstration purposes. Please consult the > your colleagues, communities and the literature for optimal parameters. #### Tutorial Human Pop-Gen - Preparation Prior setting up the nf-core/eager run, we will need: 1. Raw sequencing data in FASTQ format 2. Reference genome in FASTA format, with associated pre-made `bwa`, `samtools` and `picard SequenceDictionary` indices (however note these can be made for you with nf-core/eager, but this can make a pipeline run take much longer!) 3. A BED file with the positions of the sites of interest. 4. An eigenstrat formatted `.snp` file for the positions of interest. We should also ensure we have the very latest version of the nf-core/eager pipeline so we have all latest bugfixes etc. In this case we will be using nf-core/eager version 2.2.0. You should always check on the [nf-core](https://nf-co.re/eager) website whether a newer release has been made (particularly point releases e.g. 2.2.1). ```bash nextflow pull nf-core/eager -r 2.2.0 ``` It is important to note that if you are planning on running multiple runs of nf-core/eager for a given project, that the version should be **kept the same** for all runs to ensure consistency in settings for all of your libraries. #### Tutorial Human Pop-Gen - Inputs and Outputs To start, lets make a directory where all your nf-core/eager related files for this run will go, and change into it. ```bash mkdir projectX_preprocessing20200727 cd projectX_preprocessing20200727 ``` The first part of constructing any nf-core/eager run is specifying a few generic parameters that will often be common across all runs. This will be which pipeline, version and _profile_ we will use. We will also specify a unique name of the run to help us keep track of all the nf-core/eager runs you may be running. ```bash nextflow run nf-core/eager \ -r 2.2.0 \ -profile singularity,shh \ -name 'projectX_preprocessing20200727' \ <...> ``` For the `-profile` parameter, I have indicated that I wish to use Singularity as my software container environment, and I will use the MPI-SHH institutional config as listed on [nf-core/configs](https://github.com/nf-core/configs/blob/master/conf/shh.config). These profiles specify settings optimised for the specific cluster/institution, such as maximum memory available or which scheduler queues to submit to. More explanations about configs and profiles can be seen in the [nf-core website](https://nf-co.re/usage/configuration) and the [profile tutorial](#tutorial---what-are-profiles-and-how-to-use-them). Next we need to specify our input data. nf-core/eager can accept input FASTQs files in two main ways, either with direct paths to files (with wildcards), or with a Tab-Separate-Value (TSV) file which contains the paths and extra metadata. In this example, we will use the TSV method, as to simulate a realistic use-case, such as receiving paired-end data from an Illumina NextSeq of double-stranded libraries. Illumina NextSeqs sequence a given library across four different 'lanes', so for each library you will receive four FASTQ files. The TSV input method is more useful for this context, as it allows 'merging' of these lanes after preprocessing prior mapping (whereas direct paths will consider each pair of FASTQ files as independent libraries/samples). Our TSV file will look something like the following: ```bash Sample_Name Library_ID Lane Colour_Chemistry SeqType Organism Strandedness UDG_Treatment R1 R2 BAM EGR001 EGR001.B0101.SG1 1 2 PE homo_sapiens double half ../../02-raw_data/EGR001.B0101.SG1.1/EGR001.B0101.SG1.1_S0_L001_R1_001.fastq.gz ../../02-raw_data/EGR001.B0101.SG1.1/EGR001.B0101.SG1.1_S0_L001_R2_001.fastq.gz NA EGR001 EGR001.B0101.SG1 2 2 PE homo_sapiens double half ../../02-raw_data/EGR001.B0101.SG1.1/EGR001.B0101.SG1.1_S0_L002_R1_001.fastq.gz ../../02-raw_data/EGR001.B0101.SG1.1/EGR001.B0101.SG1.1_S0_L002_R2_001.fastq.gz NA EGR001 EGR001.B0101.SG1 3 2 PE homo_sapiens double half ../../02-raw_data/EGR001.B0101.SG1.1/EGR001.B0101.SG1.1_S0_L003_R1_001.fastq.gz ../../02-raw_data/EGR001.B0101.SG1.1/EGR001.B0101.SG1.1_S0_L003_R2_001.fastq.gz NA EGR001 EGR001.B0101.SG1 4 2 PE homo_sapiens double half ../../02-raw_data/EGR001.B0101.SG1.1/EGR001.B0101.SG1.1_S0_L004_R1_001.fastq.gz ../../02-raw_data/EGR001.B0101.SG1.1/EGR001.B0101.SG1.1_S0_L004_R2_001.fastq.gz NA EGR001 EGR001.B0101.SG1 5 2 PE homo_sapiens double half ../../02-raw_data/EGR001.B0101.SG1.2/EGR001.B0101.SG1.2_S0_L001_R1_001.fastq.gz ../../02-raw_data/EGR001.B0101.SG1.2/EGR001.B0101.SG1.2_S0_L001_R2_001.fastq.gz NA EGR001 EGR001.B0101.SG1 6 2 PE homo_sapiens double half ../../02-raw_data/EGR001.B0101.SG1.2/EGR001.B0101.SG1.2_S0_L002_R1_001.fastq.gz ../../02-raw_data/EGR001.B0101.SG1.2/EGR001.B0101.SG1.2_S0_L002_R2_001.fastq.gz NA EGR001 EGR001.B0101.SG1 7 2 PE homo_sapiens double half ../../02-raw_data/EGR001.B0101.SG1.2/EGR001.B0101.SG1.2_S0_L003_R1_001.fastq.gz ../../02-raw_data/EGR001.B0101.SG1.2/EGR001.B0101.SG1.2_S0_L003_R2_001.fastq.gz NA EGR002 EGR002.B0201.SG1 8 2 PE homo_sapiens double half ../../02-raw_data/EGR001.B0101.SG1.2/EGR001.B0101.SG1.2_S0_L004_R1_001.fastq.gz ../../02-raw_data/EGR001.B0101.SG1.2/EGR001.B0101.SG1.2_S0_L004_R2_001.fastq.gz NA EGR002 EGR002.B0201.SG1 1 2 PE homo_sapiens double half ../../02-raw_data/EGR002.B0201.SG1.1/EGR002.B0201.SG1.1_S0_L001_R1_001.fastq.gz ../../02-raw_data/EGR002.B0201.SG1.1/EGR002.B0201.SG1.1_S0_L001_R2_001.fastq.gz NA EGR002 EGR002.B0201.SG1 2 2 PE homo_sapiens double half ../../02-raw_data/EGR002.B0201.SG1.1/EGR002.B0201.SG1.1_S0_L002_R1_001.fastq.gz ../../02-raw_data/EGR002.B0201.SG1.1/EGR002.B0201.SG1.1_S0_L002_R2_001.fastq.gz NA EGR002 EGR002.B0201.SG1 3 2 PE homo_sapiens double half ../../02-raw_data/EGR002.B0201.SG1.1/EGR002.B0201.SG1.1_S0_L003_R1_001.fastq.gz ../../02-raw_data/EGR002.B0201.SG1.1/EGR002.B0201.SG1.1_S0_L003_R2_001.fastq.gz NA EGR002 EGR002.B0201.SG1 4 2 PE homo_sapiens double half ../../02-raw_data/EGR002.B0201.SG1.1/EGR002.B0201.SG1.1_S0_L004_R1_001.fastq.gz ../../02-raw_data/EGR002.B0201.SG1.1/EGR002.B0201.SG1.1_S0_L004_R2_001.fastq.gz NA EGR002 EGR002.B0201.SG1 5 2 PE homo_sapiens double half ../../02-raw_data/EGR002.B0201.SG1.2/EGR002.B0201.SG1.2_S0_L001_R1_001.fastq.gz ../../02-raw_data/EGR002.B0201.SG1.2/EGR002.B0201.SG1.2_S0_L001_R2_001.fastq.gz NA EGR002 EGR002.B0201.SG1 6 2 PE homo_sapiens double half ../../02-raw_data/EGR002.B0201.SG1.2/EGR002.B0201.SG1.2_S0_L002_R1_001.fastq.gz ../../02-raw_data/EGR002.B0201.SG1.2/EGR002.B0201.SG1.2_S0_L002_R2_001.fastq.gz NA EGR002 EGR002.B0201.SG1 7 2 PE homo_sapiens double half ../../02-raw_data/EGR002.B0201.SG1.2/EGR002.B0201.SG1.2_S0_L003_R1_001.fastq.gz ../../02-raw_data/EGR002.B0201.SG1.2/EGR002.B0201.SG1.2_S0_L003_R2_001.fastq.gz NA EGR002 EGR002.B0201.SG1 8 2 PE homo_sapiens double half ../../02-raw_data/EGR002.B0201.SG1.2/EGR002.B0201.SG1.2_S0_L004_R1_001.fastq.gz ../../02-raw_data/EGR002.B0201.SG1.2/EGR002.B0201.SG1.2_S0_L004_R2_001.fastq.gz NA ``` You can see that we have a single line for each pair of FASTQ files representing each `Lane`, but the `Sample_Name` and `Library_ID` columns identify and group them together accordingly. Secondly, as we have NextSeq data, we have specified we have `2` for `Colour_Chemistry`, which is important for downstream processing (see below). See the nf-core/eager parameter documentation above for more specifications on how to set up a TSV file (e.g. why despite NextSeqs only having 4 lanes, we go up to 8 in the example above). Alongside our input TSV file, we will also specify the paths to our reference FASTA file and the corresponding indices. ```bash nextflow run nf-core/eager \ -r 2.2.0 \ -profile singularity,shh \ -name 'projectX_preprocessing20200727' \ --input 'preprocessing20200727.tsv' \ --fasta '../Reference/genome/hs37d5.fa' \ --bwa_index '../Reference/genome/hs37d5/' \ --fasta_index '../Reference/genome/hs37d5.fa.fai' \ --seq_dict '../Reference/genome/hs37d5.dict' \ <...> ``` We specify the paths to each reference genome and it's corresponding tool specific index. Paths should always be encapsulated in quotes to ensure Nextflow evaluates them, rather than your shell! Also note that as `bwa` generates multiple index files, nf-core/eager takes a _directory_ that must contain these indices instead. > Note the difference between single and double `-` parameters. The former > represent Nextflow flags, while the latter are nf-core/eager specific flags. Finally, we can also specify the output directory and the Nextflow `work/` directory (which contains 'intermediate' working files and directories). ```bash nextflow run nf-core/eager \ -r 2.2.0 \ -profile singularity,shh \ -name 'projectX_preprocessing20200727' \ --input 'preprocessing20200727.tsv' \` --fasta '../Reference/genome/hs37d5.fa' \ --bwa_index '../Reference/genome/hs37d5/' \ --fasta_index '../Reference/genome/hs37d5.fa.fai' \ --seq_dict '../Reference/genome/hs37d5.dict' \ --outdir './results/' \ -w './work/' \ <...> ``` #### Tutorial Human Pop-Gen - Pipeline Configuration Now that we have specified the input data, we can start moving onto specifying settings for each different module we will be running. As mentioned above, we are pretending to run with NextSeq data, which is generated with a two-colour imaging technique. What this means is when you have shorter molecules than the number of cycles of the sequencing chemistry, the sequencer will repeatedly see 'G' calls (no colour) at the last few cycles, and you get long poly-G 'tails' on your reads. We therefore will turn on the poly-G clipping functionality offered by [`fastp`](https://github.com/OpenGene/fastp), and any pairs of files indicated in the TSV file as having `2` in the `Colour_Chemistry` column will be passed to `fastp`. We will not change the default minimum length of a poly-G string to be clipped. ```bash nextflow run nf-core/eager \ -r 2.2.0 \ -profile singularity,shh \ -name 'projectX_preprocessing20200727' \ --input 'preprocessing20200727.tsv' \ --fasta '../Reference/genome/hs37d5.fa' \ --bwa_index '../Reference/genome/hs37d5/' \ --fasta_index '../Reference/genome/hs37d5.fa.fai' \ --seq_dict '../Reference/genome/hs37d5.dict' \ --outdir './results/' \ -w './work/' \ --complexity_filter_poly_g \ <...> ``` Since our input data is paired-end, we will be using `DeDup` for duplicate removal, which takes into account both the start and end of a merged read before flagging it as a duplicate. To ensure this happens works properly we first need to disable base quality trimming of collapsed reads within Adapter Removal. To do this, we will provide the option `--preserve5p`. Additionally, Dedup should only be provided with merged reads, so we will need to provide the option `--mergedonly` here as well. We can then specify which dedupper we want to use with `--dedupper`. ```bash nextflow run nf-core/eager \ -r 2.2.0 \ -profile singularity,shh \ -name 'projectX_preprocessing20200727' \ --input 'preprocessing20200727.tsv' \ --fasta '../Reference/genome/hs37d5.fa' \ --bwa_index '../Reference/genome/hs37d5/' \ --fasta_index '../Reference/genome/hs37d5.fa.fai' \ --seq_dict '../Reference/genome/hs37d5.dict' \ --outdir './results/' \ -w './work/' \ --complexity_filter_poly_g \ --preserve5p \ --mergedonly \ --dedupper 'dedup' \ <...> ``` We then need to specify the mapping parameters for this run. The default mapping parameters of nf-core/eager are fine for the purposes of our run. Personally, I like to set `--bwaalnn` to `0.01`, (down from the default `0.04`) which reduces the stringency in the number of allowed mismatches between the aligned sequences and the reference. ```bash nextflow run nf-core/eager \ -r 2.2.0 \ -profile singularity,shh \ -name 'projectX_preprocessing20200727' \ --input 'preprocessing20200727.tsv' \ --fasta '../Reference/genome/hs37d5.fa' \ --bwa_index '../Reference/genome/hs37d5/' \ --fasta_index '../Reference/genome/hs37d5.fa.fai' \ --seq_dict '../Reference/genome/hs37d5.dict' \ --outdir './results/' \ -w './work/' \ --complexity_filter_poly_g \ --preserve5p \ --mergedonly \ --dedupper 'dedup' \ --bwaalnn 0.01 \ <...> ``` We may also want to remove ambiguous sequences from our alignments, and also remove off-target reads to speed up downstream processing (and reduce your hard-disk footprint). We can do this with the samtools filter module to set a mapping-quality filter (e.g. with a value of `25` to retain only slightly ambiguous alignments that might occur from damage), and to indicate to discard unmapped reads. ```bash nextflow run nf-core/eager \ -r 2.2.0 \ -profile singularity,shh \ -name 'projectX_preprocessing20200727' \ --input 'preprocessing20200727.tsv' \ --fasta '../Reference/genome/hs37d5.fa' \ --bwa_index '../Reference/genome/hs37d5/' \ --fasta_index '../Reference/genome/hs37d5.fa.fai' \ --seq_dict '../Reference/genome/hs37d5.dict' \ --outdir './results/' \ -w './work/' \ --complexity_filter_poly_g \ --preserve5p \ --mergedonly \ --dedupper 'dedup' \ --bwaalnn 0.01 \ --run_bam_filtering \ --bam_mapping_quality_threshold 25 \ --bam_unmapped_type 'discard' \ <...> ``` Next, we will set up trimming of the mapped reads to alleviate the effects of DNA damage during genotyping. To do this we will activate trimming with `--run_trim_bam`. The libraries in this underwent 'half' UDG treatment. This will generally restrict all remaining DNA damage to the first 2 base pairs of a fragment. We will therefore use `--bamutils_clip_half_udg_left` and `--bamutils_clip_half_udg_right` to trim 2bp on either side of each fragment. ```bash nextflow run nf-core/eager \ -r 2.2.0 \ -profile singularity,shh \ -name 'projectX_preprocessing20200727' \ --input 'preprocessing20200727.tsv' \ --fasta '../Reference/genome/hs37d5.fa' \ --bwa_index '../Reference/genome/hs37d5/' \ --fasta_index '../Reference/genome/hs37d5.fa.fai' \ --seq_dict '../Reference/genome/hs37d5.dict' \ --outdir './results/' \ -w './work/' \ --complexity_filter_poly_g \ --preserve5p \ --mergedonly \ --dedupper 'dedup' \ --bwaalnn 0.01 \ --run_bam_filtering \ --bam_mapping_quality_threshold 25 \ --bam_unmapped_type 'discard' \ --run_trim_bam \ --bamutils_clip_double_stranded_half_udg_left 2 \ --bamutils_clip_double_stranded_half_udg_right 2 \ <...> ``` To activate human sex determination (using [Sex.DetERRmine.py](https://github.com/TCLamnidis/Sex.DetERRmine)) we will provide the option `--run_sexdeterrmine`. Additionally, we will provide sexdeterrmine with the BED file of our SNPs of interest using the `--sexdeterrmine_bedfile` flag. Here I will use the 1240k SNP set as an example. This will cut down on computational time and also provide us with an error bar around the relative coverage on the X and Y chromosomes. If you wish to use the same bedfile to follow along with this tutorial, you can download the file from [here](https://github.com/nf-core/test-datasets/blob/eager/reference/Human/1240K.pos.list_hs37d5.0based.bed.gz). ```bash nextflow run nf-core/eager \ -r 2.2.0 \ -profile singularity,shh \ -name 'projectX_preprocessing20200727' \ --input 'preprocessing20200727.tsv' \ --fasta '../Reference/genome/hs37d5.fa' \ --bwa_index '../Reference/genome/hs37d5/' \ --fasta_index '../Reference/genome/hs37d5.fa.fai' \ --seq_dict '../Reference/genome/hs37d5.dict' \ --outdir './results/' \ -w './work/' \ --complexity_filter_poly_g \ --preserve5p \ --mergedonly \ --dedupper 'dedup' \ --bwaalnn 0.01 \ --run_bam_filtering \ --bam_mapping_quality_threshold 25 \ --bam_unmapped_type 'discard' \ --run_trim_bam \ --bamutils_clip_half_udg_left 2 \ --bamutils_clip_half_udg_right 2 \ --run_sexdeterrmine \ --sexdeterrmine_bedfile '../Reference/genome/1240k.sites.bed' \ <...> ``` Similarly, we will activate nuclear contamination estimation with `--run_nuclear_contamination`. This process requires us to also specify the contig name of the X chromosome in the reference genome we are using with `--contamination_chrom_name`. Here, we are using hs37d5, where the X chromosome is simply named 'X'. ```bash nextflow run nf-core/eager \ -r 2.2.0 \ -profile singularity,shh \ -name 'projectX_preprocessing20200727' \ --input 'preprocessing20200727.tsv' \ --fasta '../Reference/genome/hs37d5.fa' \ --bwa_index '../Reference/genome/hs37d5/' \ --fasta_index '../Reference/genome/hs37d5.fa.fai' \ --seq_dict '../Reference/genome/hs37d5.dict' \ --outdir './results/' \ -w './work/' \ --complexity_filter_poly_g \ --preserve5p \ --mergedonly \ --dedupper 'dedup' \ --bwaalnn 0.01 \ --run_bam_filtering \ --bam_mapping_quality_threshold 25 \ --bam_unmapped_type 'discard' \ --run_trim_bam \ --bamutils_clip_double_stranded_half_udg_left 2 \ --bamutils_clip_double_stranded_half_udg_right 2 \ --run_sexdeterrmine \ --sexdeterrmine_bedfile '../Reference/genome/1240k.sites.bed' \ --run_nuclear_contamination \ --contamination_chrom_name 'X' \ <...> ``` Because nuclear contamination estimates can only be provided for males, it is possible that we will need to get mitochondrial DNA contamination estimates for any females in our dataset. This cannot be done within nf-core/eager (v2.2.0) and we will need to do this manually at a later time. However, mtDNA contamination estimates have been shown to only be reliable for nuclear contamination when the ratio of mitochondrial to nuclear reads is low ([Furtwängler et al. 2018](https://doi.org/10.1038/s41598-018-32083-0)). We can have nf-core/eager calculate that ratio for us with `--run_mtnucratio`, and providing the name of the mitochondrial DNA contig in our reference genome with `--mtnucratio_header`. Within hs37d5, the mitochondrial contig is named 'MT'. ```bash nextflow run nf-core/eager \ -r 2.2.0 \ -profile singularity,shh \ -name 'projectX_preprocessing20200727' \ --input 'preprocessing20200727.tsv' \ --fasta '../Reference/genome/hs37d5.fa' \ --bwa_index '../Reference/genome/hs37d5/' \ --fasta_index '../Reference/genome/hs37d5.fa.fai' \ --seq_dict '../Reference/genome/hs37d5.dict' \ --outdir './results/' \ -w './work/' \ --complexity_filter_poly_g \ --preserve5p \ --mergedonly \ --dedupper 'dedup' \ --bwaalnn 0.01 \ --run_bam_filtering \ --bam_mapping_quality_threshold 25 \ --bam_unmapped_type 'discard' \ --run_trim_bam \ --bamutils_clip_double_stranded_half_udg_left 2 \ --bamutils_clip_double_stranded_half_udg_right 2 \ --run_sexdeterrmine \ --sexdeterrmine_bedfile '../Reference/genome/1240k.sites.bed' \ --run_nuclear_contamination \ --contamination_chrom_name 'X' \ --run_mtnucratio \ --mtnucratio_header 'MT' \ <...> ``` Finally, we need to specify genotyping parameters. First, we need to activate genotyping with `--run_genotyping`. It is also important to specify we wish to use the **trimmed** data for genotyping, to avoid the effects of DNA damage. To do this, we will specify the `--genotyping_source` as `'trimmed'`. Then we can specify the genotyping tool to use with `--genotyping_tool`. We will be using `'pileupCaller'` to produce random draw genotypes in eigenstrat format. For this process we will need to specify a BED file of the sites of interest (the same as before) with `--pileupcaller_bedfile`, as well as an eigenstrat formatted `.snp` file of these sites that is specified with `--pileupcaller_snpfile`. ```bash nextflow run nf-core/eager \ -r 2.2.0 \ -profile singularity,shh \ -name 'projectX_preprocessing20200727' \ --input 'preprocessing20200727.tsv' \ --fasta '../Reference/genome/hs37d5.fa' \ --bwa_index '../Reference/genome/hs37d5/' \ --fasta_index '../Reference/genome/hs37d5.fa.fai' \ --seq_dict '../Reference/genome/hs37d5.dict' \ --outdir './results/' \ -w './work/' \ --complexity_filter_poly_g \ --preserve5p \ --mergedonly \ --dedupper 'dedup' \ --bwaalnn 0.01 \ --run_bam_filtering \ --bam_mapping_quality_threshold 25 \ --bam_unmapped_type 'discard' \ --run_trim_bam \ --bamutils_clip_double_stranded_half_udg_left 2 \ --bamutils_clip_double_stranded_half_udg_right 2 \ --run_sexdeterrmine \ --sexdeterrmine_bedfile '../Reference/genome/1240k.sites.bed' \ --run_nuclear_contamination \ --contamination_chrom_name 'X' \ --run_mtnucratio \ --mtnucratio_header 'MT' \ --run_genotyping \ --genotyping_source 'trimmed' \ --genotyping_tool 'pileupcaller' \ --pileupcaller_bedfile '../Reference/genome/1240k.sites.bed' \ --pileupcaller_snpfile '../Datasets/1240k/1240k.snp' ``` With this, we are ready to submit! If running on a remote cluster/server, Make sure to run this in a `screen` session or similar, so that if you get a `ssh` signal drop or want to log off, Nextflow will not crash. #### Tutorial Human Pop-Gen - Results Assuming the run completed without any crashes (if problems do occur, check against [parameters](https://nf-core/eager/parameters) that all parameters are as expected, or check the [FAQ](#troubleshooting-and-faqs)), we can now check our results in `results/`. ##### Tutorial Human Pop-Gen - MultiQC Report In here there are many different directories containing different output files. The first directory to check is the `MultiQC/` directory. You should find a `multiqc_report.html` file. You will need to view this in a web browser, so I recommend either mounting your server to your file browser, or downloading it to your own local machine (PC/Laptop etc.). Once you've opened this you can go through each section and evaluate all the results. You will likely want to check these for artefacts (e.g. weird damage patterns on the human DNA, or weirdly skewed coverage distributions). For example, I normally look for things like: General Stats Table: * Do I see the expected number of raw sequencing reads (summed across each set of FASTQ files per library) that was requested for sequencing? * Does the percentage of trimmed reads look normal for aDNA, and do lengths after trimming look short as expected of aDNA? * Does ClusterFactor or 'Dups' look high (e.g. >2 or >10% respectively) suggesting over-amplified or badly preserved samples? * Do the mapped reads show increased frequency of C>Ts on the 5' end of molecules? * Is the number of SNPs used for nuclear contamination really low for any individuals (e.g. < 100)? If so, then the estimates might not be very accurate. FastQC (pre-AdapterRemoval): * Do I see any very early drop off of sequence quality scores suggesting a problematic sequencing run? * Do I see outlier GC content distributions? * Do I see high sequence duplication levels? AdapterRemoval: * Do I see high numbers of singletons or discarded read pairs? FastQC (post-AdapterRemoval): * Do I see improved sequence quality scores along the length of reads? * Do I see reduced adapter content levels? Samtools Flagstat (pre/post Filter): * Do I see outliers, e.g. with unusually high levels of human DNA, (indicative of contamination) that require downstream closer assessment? Are your samples exceptionally preserved? If not, a value higher than e.g. 50% might require your attention. DeDup/Picard MarkDuplicates: * Do I see large numbers of duplicates being removed, possibly indicating over-amplified or badly preserved samples? DamageProfiler: * Do I see evidence of damage on human DNA? * High numbers of mapped reads but no damage may indicate significant modern contamination. * Was the read trimming I specified enough to overcome damage effects? SexDetERRmine: * Do the relative coverages on the X and Y chromosome fall within the expected areas of the plot? * Do all individuals have enough data for accurate sex determination? * Do the proportions of autosomal/X/Y reads make sense? If there is an overrepresentation of reads within one bin, is the data enriched for that bin? > Detailed documentation and descriptions for all MultiQC modules can be seen in > the the 'Documentation' folder of the results directory or here in the [output > documentation](output.md) If you're happy everything looks good in terms of sequencing, we then look at specific directories to find any files you might want to use for downstream processing. Note that when you get back to writing up your publication, all the versions of the tools can be found under the 'nf-core/eager Software Versions' section of the MultiQC report. But be careful! All tools in the container are listed, so you may have to remove some of them that you didn't actually use in the set up. For example, in this example, we have used: Nextflow, nf-core/eager, FastQC, AdapterRemoval, fastP, BWA, Samtools, endorS.py, DeDup, Qualimap, PreSeq, DamageProfiler, bamUtil, sexdeterrmine, angsd, MTNucRatioCalculator, sequenceTools, and MultiQC. Citations to all used tools can be seen [here](https://nf-co.re/eager#tool-references) ##### Tutorial Human Pop-Gen - Files for Downstream Analysis You will find the eigenstrat dataset containing the random draw genotypes of your run in the `genotyping/` directory. Genotypes from double stranded libraries, like the ones in this example, are found in the dataset `pileupcaller.double.{geno,snp,ind}.txt`, while genotypes for any single stranded libraries will instead be in `pileupcaller.single.{geno,snp,ind}.txt`. #### Tutorial Human Pop-Gen - Clean up Finally, I would recommend cleaning up your `work/` directory of any intermediate files (if your `-profile` does not already do so). You can do this by going to above your `results/` and `work/` directory, e.g. ```bash cd ///projectX_preprocessing20200727 ``` and running ```bash nextflow clean -f -k ``` #### Tutorial Human Pop-Gen - Summary In this this tutorial we have described an example on how to set up an nf-core/eager run to preprocess human aDNA for population genetic studies, preform some simple quality control checks, and generate random draw genotypes for downstream analysis of the data. Additionally, we described what to look for in the run summary report generated by MultiQC and where to find output files that can be used for downstream analysis. ### Tutorial - How to set up nf-core/eager for metagenomic screening #### Tutorial Metagenomics - Introduction The field of archaeogenetics is now expanding out from analysing the genomes of single organisms but to whole communities of taxa. One particular example is of human associated microbiomes, as preserved in ancient palaeofaeces (gut) or dental calculus (oral). This tutorial will give a basic example on how to set up nf-core/eager to perform initial screening of samples in the context of ancient microbiome research. > :warning: this tutorial does not describe how to install and set up > nf-core/eager For this please see other documentation on the > [nf-co.re](https://nf-co.re/usage/installation) website. We will describe how to set up mapping of ancient dental calculus samples against the human reference genome to allow sequencing and library quality-control, but additionally perform taxonomic profiling of the off-target reads from this mapping using MALT, and perform aDNA authentication with HOPS. > :warning: Please be aware that the settings used in this tutorial may not use > settings nor produce files you would actually use in 'real' analysis. The > settings are only specified for demonstration purposes. Please consult the > your colleagues, communities and the literature for optimal parameters. #### Tutorial Metagenomics - Preparation Prior setting up an nf-core/eager run for metagenomic screening, we will need: 1. Raw sequencing data in FASTQ format 2. Reference genome in FASTA format, with associated pre-made `bwa`, `samtools` and `picard SequenceDictionary` indices 3. A MALT database of your choice (see [MALT manual](https://software-ab.informatik.uni-tuebingen.de/download/malt/manual.pdf) for set-up) 4. A list of (NCBI) taxa containing well-known taxa of your microbiome (see below) 5. HOPS resources `.map` and `.tre` files (available [here](https://github.com/rhuebler/HOPS/tree/external/Resources)) We should also ensure we have the very latest version of the nf-core/eager pipeline so we have all latest bugfixes etc. In this case we will be using nf-core/eager version 2.2.0. You should always check on the [nf-core](https://nf-co.re/eager) website whether a newer release has been made (particularly point releases e.g. 2.2.1). ```bash nextflow pull nf-core/eager -r 2.2.0 ``` It is important to note that if you are planning on running multiple runs of nf-core/eager for a given project, that the version should be **kept the same** for all runs to ensure consistency in settings for all of your libraries. #### Tutorial Metagenomics - Inputs and Outputs To start, lets make a directory where all your nf-core/eager related files for this run will go, and change into it. ```bash mkdir projectX_screening20200720 cd projectX_screening20200720 ``` The first part of constructing any nf-core/eager run is specifying a few generic parameters that will often be common across all runs. This will be which pipeline, version and _profile_ we will use. We will also specify a unique name of the run to help us keep track of all the nf-core/eager runs you may be running. ```bash nextflow run nf-core/eager \ -r 2.2.0 \ -profile singularity,shh \ -name 'projectX_screening20200720' \ <...> ``` For the `-profile` parameter, I have indicated that I wish to use Singularity as my software container environment, and I will use the MPI-SHH institutional config as listed on [nf-core/configs](https://github.com/nf-core/configs/blob/master/conf/shh.config). These profiles specify settings optimised for the specific cluster/institution, such as maximum memory available or which scheduler queues to submit to. More explanations about configs and profiles can be seen in the [nf-core website](https://nf-co.re/usage/configuration) and the [profile tutorial](#tutorial---what-are-profiles-and-how-to-use-them). Next we need to specify our input data. nf-core/eager can accept input FASTQs files in two main ways, either with direct paths to files (with wildcards), or with a Tab-Separate-Value (TSV) file which contains the paths and extra metadata. In this example, we will use the TSV method, as to simulate a realistic use-case, such as receiving paired-end data from an Illumina NextSeq of double-stranded libraries. Illumina NextSeqs sequence a given library across four different 'lanes', so for each library you will receive four FASTQ files. The TSV input method is more useful for this context, as it allows 'merging' of these lanes after preprocessing prior mapping (whereas direct paths will consider each pair of FASTQ files as independent libraries/samples). Our TSV file will look something like the following: ```bash Sample_Name Library_ID Lane Colour_Chemistry SeqType Organism Strandedness UDG_Treatment R1 R2 BAM EGR001 EGR001.B0101.SG1 1 2 PE homo_sapiens double half ../../02-raw_data/EGR001.B0101.SG1.1/EGR001.B0101.SG1.1_S0_L001_R1_001.fastq.gz ../../02-raw_data/EGR001.B0101.SG1.1/EGR001.B0101.SG1.1_S0_L001_R2_001.fastq.gz NA EGR001 EGR001.B0101.SG1 2 2 PE homo_sapiens double half ../../02-raw_data/EGR001.B0101.SG1.1/EGR001.B0101.SG1.1_S0_L002_R1_001.fastq.gz ../../02-raw_data/EGR001.B0101.SG1.1/EGR001.B0101.SG1.1_S0_L002_R2_001.fastq.gz NA EGR001 EGR001.B0101.SG1 3 2 PE homo_sapiens double half ../../02-raw_data/EGR001.B0101.SG1.1/EGR001.B0101.SG1.1_S0_L003_R1_001.fastq.gz ../../02-raw_data/EGR001.B0101.SG1.1/EGR001.B0101.SG1.1_S0_L003_R2_001.fastq.gz NA EGR001 EGR001.B0101.SG1 4 2 PE homo_sapiens double half ../../02-raw_data/EGR001.B0101.SG1.1/EGR001.B0101.SG1.1_S0_L004_R1_001.fastq.gz ../../02-raw_data/EGR001.B0101.SG1.1/EGR001.B0101.SG1.1_S0_L004_R2_001.fastq.gz NA EGR001 EGR001.B0101.SG1 5 2 PE homo_sapiens double half ../../02-raw_data/EGR001.B0101.SG1.2/EGR001.B0101.SG1.2_S0_L001_R1_001.fastq.gz ../../02-raw_data/EGR001.B0101.SG1.2/EGR001.B0101.SG1.2_S0_L001_R2_001.fastq.gz NA EGR001 EGR001.B0101.SG1 6 2 PE homo_sapiens double half ../../02-raw_data/EGR001.B0101.SG1.2/EGR001.B0101.SG1.2_S0_L002_R1_001.fastq.gz ../../02-raw_data/EGR001.B0101.SG1.2/EGR001.B0101.SG1.2_S0_L002_R2_001.fastq.gz NA EGR001 EGR001.B0101.SG1 7 2 PE homo_sapiens double half ../../02-raw_data/EGR001.B0101.SG1.2/EGR001.B0101.SG1.2_S0_L003_R1_001.fastq.gz ../../02-raw_data/EGR001.B0101.SG1.2/EGR001.B0101.SG1.2_S0_L003_R2_001.fastq.gz NA EGR002 EGR002.B0201.SG1 8 2 PE homo_sapiens double half ../../02-raw_data/EGR001.B0101.SG1.2/EGR001.B0101.SG1.2_S0_L004_R1_001.fastq.gz ../../02-raw_data/EGR001.B0101.SG1.2/EGR001.B0101.SG1.2_S0_L004_R2_001.fastq.gz NA EGR002 EGR002.B0201.SG1 1 2 PE homo_sapiens double half ../../02-raw_data/EGR002.B0201.SG1.1/EGR002.B0201.SG1.1_S0_L001_R1_001.fastq.gz ../../02-raw_data/EGR002.B0201.SG1.1/EGR002.B0201.SG1.1_S0_L001_R2_001.fastq.gz NA EGR002 EGR002.B0201.SG1 2 2 PE homo_sapiens double half ../../02-raw_data/EGR002.B0201.SG1.1/EGR002.B0201.SG1.1_S0_L002_R1_001.fastq.gz ../../02-raw_data/EGR002.B0201.SG1.1/EGR002.B0201.SG1.1_S0_L002_R2_001.fastq.gz NA EGR002 EGR002.B0201.SG1 3 2 PE homo_sapiens double half ../../02-raw_data/EGR002.B0201.SG1.1/EGR002.B0201.SG1.1_S0_L003_R1_001.fastq.gz ../../02-raw_data/EGR002.B0201.SG1.1/EGR002.B0201.SG1.1_S0_L003_R2_001.fastq.gz NA EGR002 EGR002.B0201.SG1 4 2 PE homo_sapiens double half ../../02-raw_data/EGR002.B0201.SG1.1/EGR002.B0201.SG1.1_S0_L004_R1_001.fastq.gz ../../02-raw_data/EGR002.B0201.SG1.1/EGR002.B0201.SG1.1_S0_L004_R2_001.fastq.gz NA EGR002 EGR002.B0201.SG1 5 2 PE homo_sapiens double half ../../02-raw_data/EGR002.B0201.SG1.2/EGR002.B0201.SG1.2_S0_L001_R1_001.fastq.gz ../../02-raw_data/EGR002.B0201.SG1.2/EGR002.B0201.SG1.2_S0_L001_R2_001.fastq.gz NA EGR002 EGR002.B0201.SG1 6 2 PE homo_sapiens double half ../../02-raw_data/EGR002.B0201.SG1.2/EGR002.B0201.SG1.2_S0_L002_R1_001.fastq.gz ../../02-raw_data/EGR002.B0201.SG1.2/EGR002.B0201.SG1.2_S0_L002_R2_001.fastq.gz NA EGR002 EGR002.B0201.SG1 7 2 PE homo_sapiens double half ../../02-raw_data/EGR002.B0201.SG1.2/EGR002.B0201.SG1.2_S0_L003_R1_001.fastq.gz ../../02-raw_data/EGR002.B0201.SG1.2/EGR002.B0201.SG1.2_S0_L003_R2_001.fastq.gz NA EGR002 EGR002.B0201.SG1 8 2 PE homo_sapiens double half ../../02-raw_data/EGR002.B0201.SG1.2/EGR002.B0201.SG1.2_S0_L004_R1_001.fastq.gz ../../02-raw_data/EGR002.B0201.SG1.2/EGR002.B0201.SG1.2_S0_L004_R2_001.fastq.gz NA ``` You can see that we have a single line for each pair of FASTQ files representing each `Lane`, but the `Sample_Name` and `Library_ID` columns identify and group them together accordingly. Secondly, as we have NextSeq data, we have specified we have `2` for `Colour_Chemistry`, which is important for downstream processing (see below). The other columns are less important for this particular context of metagenomic screening. See the nf-core/eager [parameters](https://nf-core/eager/parameters) documentation for more specifications on how to set up a TSV file (e.g. why despite NextSeqs only having 4 lanes, we go up to 8 in the example above). Alongside our input TSV file, we will also specify the paths to our reference FASTA file and the corresponding indices. ```bash nextflow run nf-core/eager \ -r 2.2.0 \ -profile singularity,shh \ -name 'projectX_screening20200720' \ --input 'screening20200720.tsv' \ --fasta '../Reference/genome/GRCh38.fa' \ --bwa_index '../Reference/genome/GRCh38/' \ --fasta_index '../Reference/genome/GRCh38.fa.fai' \ --seq_dict '../Reference/genome/GRCh38.dict' \ <...> ``` We specify the paths to each reference genome and it's corresponding tool specific index. Paths should always be encapsulated in quotes to ensure Nextflow evaluates them, rather than your shell! Also note that as `bwa` generates multiple index files, nf-core/eager takes a _directory_ that must contain these indices instead. > Note the difference between single and double `-` parameters. The former > represent Nextflow flags, while double are nf-core/eager specific flags. Finally, we can also specify the output directory and the Nextflow `work/` directory (which contains 'intermediate' working files and directories). ```bash nextflow run nf-core/eager \ -r 2.2.0 \ -profile singularity,shh \ -name 'projectX_screening20200720' \ --input 'screening20200720.tsv' \ --fasta '../Reference/genome/GRCh38.fa' \ --bwa_index '../Reference/genome/GRCh38/' \ --fasta_index '../Reference/genome/GRCh38.fa.fai' \ --seq_dict '../Reference/genome/GRCh38.dict' \ --outdir './results/' \ -w './work/' \ <...> ``` #### Tutorial Metagenomics - Pipeline Configuration Now that we have specified the input data, we can start moving onto specifying settings for each different module we will be running. As mentioned above, we are pretending to run with NextSeq data, which is generated with a two-colour imaging technique. What this means is when you have shorter molecules than the number of cycles of the sequencing chemistry, the sequencer will repeatedly see 'G' calls (no colour) at the last few cycles, and you get long poly-G 'tails' on your reads. We therefore will turn on the poly-G clipping functionality offered by [`fastp`](https://github.com/OpenGene/fastp), and any pairs of files indicated in the TSV file as having `2` in the `Colour_Chemistry` column will be passed to `fastp`. We will not change the default minimum length of a poly-G string to be clipped. ```bash nextflow run nf-core/eager \ -r 2.2.0 \ -profile singularity,shh \ -name 'projectX_screening20200720' \ --input 'screening20200720.tsv' \ --fasta '../Reference/genome/GRCh38.fa' \ --bwa_index '../Reference/genome/GRCh38/' \ --fasta_index '../Reference/genome/GRCh38.fa.fai' \ --seq_dict '../Reference/genome/GRCh38.dict' \ --outdir './results/' \ -w './work/' \ --complexity_filter_poly_g \ <...> ``` We will keep the default settings for mapping etc. against the reference genome as we will only use this for sequencing quality control, however we now need to specify that we want to run metagenomic screening. To do this we firstly need to tell nf-core/eager what to do with the off target reads from the mapping. ```bash nextflow run nf-core/eager \ -r 2.2.0 \ -profile singularity,shh \ -name 'projectX_screening20200720' \ --input 'screening20200720.tsv' \ --fasta '../Reference/genome/GRCh38.fa' \ --bwa_index '../Reference/genome/GRCh38/' \ --fasta_index '../Reference/genome/GRCh38.fa.fai' \ --seq_dict '../Reference/genome/GRCh38.dict' \ --outdir './results/' \ -w './work/' \ --complexity_filter_poly_g \ --run_bam_filtering \ --bam_unmapped_type 'fastq' \ <...> ``` nf-core/eager will now take all unmapped reads after mapping and convert the BAM file back to FASTQ, which can be accepted by MALT. But of course, we also then need to tell nf-core/eager we actually want to run MALT. We will also specify the location of the [pre-built database](#tutorial-metagenomics---preparation) and which 'min support' method we want to use (this specifies the minimum number of alignments is needed to a particular taxonomic node to be 'kept' in the MALT output files). Otherwise we will keep all other parameters as default. For example using BlastN mode, requiring a minimum of 85% identity, requiring at least 0.01% alignments for a taxon to be saved (as specified with the `--malt_min_support_mode`). More documentation describing each parameters can be seen in the usage [documentation](usage.md) ```bash nextflow run nf-core/eager \ -r 2.2.0 \ -profile singularity,shh \ -name 'projectX_screening20200720' \ --input 'screening20200720.tsv' \ --fasta '../Reference/genome/GRCh38.fa' \ --bwa_index '../Reference/genome/GRCh38/' \ --fasta_index '../Reference/genome/GRCh38.fa.fai' \ --seq_dict '../Reference/genome/GRCh38.dict' \ --outdir './results/' \ -w './work/' \ --complexity_filter_poly_g \ --run_bam_filtering \ --bam_unmapped_type 'fastq' \ --run_metagenomic_screening \ --metagenomic_tool 'malt' \ --database '../Reference/database/refseq-bac-arch-homo-2018_11' \ --malt_min_support_mode 'percent' \ <...> ``` Finally, to help quickly assess whether we our sample has taxa that are known to exist in (modern samples of) our expected microbiome, and that these alignments have indicators of true aDNA, we will run 'maltExtract' of the [HOPS](https://github.com/rhuebler/HOPS) pipeline. ```bash nextflow run nf-core/eager \ -r 2.2.0 \ -profile singularity,shh \ -name 'projectX_screening20200720' \ --input 'screening20200720.tsv' \ --fasta '../Reference/genome/GRCh38.fa' \ --bwa_index '../Reference/genome/GRCh38/' \ --fasta_index '../Reference/genome/GRCh38.fa.fai' \ --seq_dict '../Reference/genome/GRCh38.dict' \ --outdir './results/' \ -w './work/' \ --complexity_filter_poly_g \ --run_bam_filtering \ --bam_unmapped_type 'fastq' \ --run_metagenomic_screening \ --metagenomic_tool 'malt' \ --database '../Reference/database/refseq-bac-arch-homo-2018_11' \ --malt_min_support_mode 'percent' \ --run_maltextract \ --maltextract_taxon_list '../Reference/taxa_list/core_genera-anthropoids_hominids_panhomo-20180131.txt' \ --maltextract_ncbifiles '../Reference/hops' \ --maltextract_destackingoff ``` In the last parameters above we've specified the path to our list of taxa. This contains something like (for oral microbiomes): ```text Actinomyces Streptococcus Tannerella Porphyromonas ``` We have also specified the path to the HOPS resources [downloaded earlier](#tutorial-metagenomics---preparation), and that I want to turn off 'destacking' (removal of any read that overlaps the positions of another - something only recommended to keep on when you have high coverage data). With this, we are ready to submit! If running on a remote cluster/server, Make sure to run this in a `screen` session or similar, so that if you get a `ssh` signal drop or want to log off, Nextflow will not crash. #### Tutorial Metagenomics - Results Assuming the run completed without any crashes (if problems do occur, check against [parameters](https://nf-core/eager/parameters) that all parameters are as expected, or check the [FAQ](#troubleshooting-and-faqs)), we can now check our results in `results/`. ##### Tutorial Metagenomics - MultiQC Report In here there are many different directories containing different output files. The first directory to check is the `MultiQC/` directory. You should find a `multiqc_report.html` file. You will need to view this in a web browser, so I recommend either mounting your server to your file browser, or downloading it to your own local machine (PC/Laptop etc.). Once you've opened this you can go through each section and evaluate all the results. You will likely not want to concern yourself too much with anything after MALT - however you should check these for other artefacts (e.g. weird damage patterns on the human DNA, or weirdly skewed coverage distributions). For example, I normally look for things like: General Stats Table: * Do I see the expected number of raw sequencing reads (summed across each set of FASTQ files per library) that was requested for sequencing? * Does the percentage of trimmed reads look normal for aDNA, and do lengths after trimming look short as expected of aDNA? * Does ClusterFactor or 'Dups' look high suggesting over-amplified or badly preserved samples (e.g. >2 or >10% respectively - however given this is on the human reads this is just a rule of thumb and may not reflect the quality of the metagenomic profile) ? * Does the human DNA show increased frequency of C>Ts on the 5' end of molecules? FastQC (pre-AdapterRemoval): * Do I see any very early drop off of sequence quality scores suggesting problematic sequencing run? * Do I see outlier GC content distributions? * Do I see high sequence duplication levels? AdapterRemoval: * Do I see high numbers of singletons or discarded read pairs? FastQC (post-AdapterRemoval): * Do I see improved sequence quality scores along the length of reads? * Do I see reduced adapter content levels? MALT: * Do I have a reasonable level of mappability? * Somewhere between 10-30% can be pretty normal for aDNA, whereas e.g. <1% requires careful manual assessment * Do I have a reasonable taxonomic assignment success? * You hope to have a large number of the mapped reads (from the mappability plot) that also have taxonomic assignment. Samtools Flagstat (pre/post Filter): * Do I see outliers, e.g. with unusually high levels of human DNA, (indicative of contamination) that require downstream closer assessment? DeDup/Picard MarkDuplicates: * Do I see large numbers of duplicates being removed, possibly indicating over-amplified or badly preserved samples? DamageProfiler: * Do I see evidence of damage on human DNA? Note this is just a rule-of-thumb/corroboration of any signals you might find in the metagenomic screening and not essential. * If you have high numbers of human DNA reads but no damage may indicate significant modern contamination. > Detailed documentation and descriptions for all MultiQC modules can be seen in > the the 'Documentation' folder of the results directory or here in the [output > documentation](output.md) If you're happy everything looks good in terms of sequencing, we then look at specific directories to find any files you might want to use for downstream processing. Note that when you get back to writing up your publication, all the versions of the tools can be found under the 'nf-core/eager Software Versions' section of the MultiQC report. Note that all tools in the container are listed, so you may have to remove some of them that you didn't actually use in the set up. For example, in the example above, we have used: Nextflow, nf-core/eager, FastQC, AdapterRemoval, fastP, BWA, Samtools, endorS.py, Picard Markduplicates, Qualimap, PreSeq, DamageProfiler, MALT, MaltExtract and MultiQC. Citations to all used tools can be seen [here](https://nf-co.re/eager#tool-references) ##### Tutorial Metagenomics - Files for Downstream Analysis If you wanted to look at the output of MALT more closely, such as in the GUI based tool [MEGAN6](https://software-ab.informatik.uni-tuebingen.de/download/megan6/welcome.html), you can find the `.rma6` files that is accepted by MEGAN under `metagenomic_classification/malt/`. The log file containing the information printed to screen while MALT is running can also be found in this directory. As we ran the HOPS pipeline (primarily the MaltExtract tool), we can look in `MaltExtract/results/` to find all the corresponding output files for the authentication validation of the metagenomic screening (against the taxa you specified in your `--maltextract_taxon_list`). First you can check the `heatmap_overview_Wevid.pdf` summary PDF from HOPS (again you will need to either mount the server or download), but to get the actual per-sample/taxon damage patterns etc., you can look in `pdf_candidate_profiles`. In some cases there maybe valid results that the HOPS 'postprocessing' script doesn't pick up. In these cases you can go into the `default` directory to find all the raw text files which you can use to visualise and assess the authentication results yourself. Finally, if you want to re-run the taxonomic classification with a new database or tool, to find the raw `fastq/` files containing only unmapped reads that went into MALT, you should go into `samtools/filter`. In here you will find files ending in `unmapped.fastq.gz` for each library. #### Tutorial Metagenomics - Clean up Finally, I would recommend cleaning up your `work/` directory of any intermediate files (if your `-profile` does not already do so). You can do this by going to above your `results/` and `work/` directory, e.g. ```bash cd ///projectX_screening20200720 ``` and running ```bash nextflow clean -f -k ``` #### Tutorial Metagenomics - Summary In this this tutorial we have described an example on how to set up a metagenomic screening run of ancient microbiome samples. We have covered how to set up nf-core/eager to extract off-target reads in a form that can be used for MALT, and how to additionally run HOPS to authenticate expected taxa to be found in the human oral microbiome. Finally we have also described what to look for in the MultiQC run summary report and where to find output files that can be used for downstream analysis. ### Tutorial - How to set up nf-core/eager for pathogen genomics #### Tutorial Pathogen Genomics - Introduction This tutorial will give a basic example on how to set up nf-core/eager to perform bacterial genome reconstruction from samples in the context of ancient pathogenomic research. > :warning: this tutorial does not describe how to install and set-up > nf-core/eager For this please see other documentation on the > [nf-co.re](https://nf-co.re/usage/installation) website. We will describe how to set up mapping ancient pathogen samples against the reference of a targeted organism genome, to check sequencing and library quality-control, calculation of depth and breath of coverage, check for damage profiles, feature-annotation statistics (e.g. for gene presence and absence), SNP calling, and producing an SNP alignment for its usage in downstream phylogenetic analysis. I will use as an example data from [Andrades Valtueña et al 2017](https://doi.org/10.1016/j.cub.2017.10.025), who retrieved Late Neolithic/Bronze Age _Yersinia pestis_ genomes. This data is **very large shotgun data** and is not ideal for testing, so running on your own data is recommended as otherwise running this data will require a lot of computing resources and time. However, note the same procedure can equally be applied on shallow-shotgun and also whole-genome enrichment data, so other than the TSV file you can apply this command explained below. > :warning: Please be aware that the settings used in this tutorial may not use > settings nor produce files you would actually use in 'real' analysis. The > settings are only specified for demonstration purposes. Please consult the > your colleagues, communities and the literature for optimal parameters. #### Tutorial Pathogen Genomics - Preparation Prior setting up the nf-core/eager run, we will need: 1. Raw sequencing data in FASTQ format 2. Reference genome in FASTA format, with associated pre-made `bwa`, `samtools` and `picard SequenceDictionary` indices (however note these can be made for you with nf-core/eager, but this can make a pipeline run take much longer!) 3. A GFF file of gene sequence annotations (normally supplied with reference genomes downloaded from NCBI Genomes, in this context from [here](https://www.ncbi.nlm.nih.gov/genome/?term=Yersinia+pestis)) 4. [Optional] Previously made VCF GATK 3.5 files (see below for settings), of previously published _Y. pestis_ genomes. We should also ensure we have the very latest version of the nf-core/eager pipeline so we have all latest bugfixes etc. In this case we will be using nf-core/eager version 2.2.0. You should always check on the [nf-core](https://nf-co.re/eager) website whether a newer release has been made (particularly point releases e.g. 2.2.1). ```bash nextflow pull nf-core/eager -r 2.2.0 ``` It is important to note that if you are planning on running multiple runs of nf-core/eager for a given project, that the version should be **kept the same** for all runs to ensure consistency in settings for all of your libraries. #### Tutorial Pathogen Genomics - Inputs and Outputs To start, lets make a directory where all your nf-core/eager related files for this run will go, and change into it. ```bash mkdir projectX_preprocessing20200727 cd projectX_preprocessing20200727 ``` The first part of constructing any nf-core/eager run is specifying a few generic parameters that will often be common across all runs. This will be which pipeline, version and _profile_ we will use. We will also specify a unique name of the run to help us keep track of all the nf-core/eager runs you may be running. ```bash nextflow run nf-core/eager \ -r 2.2.0 \ -profile singularity,shh \ -name 'projectX_preprocessing20200727' \ <...> ``` For the `-profile` parameter, I have indicated that I wish to use Singularity as my software container environment, and I will use the MPI-SHH institutional config as listed on [nf-core/configs](https://github.com/nf-core/configs/blob/master/conf/shh.config). These profiles specify settings optimised for the specific cluster/institution, such as maximum memory available or which scheduler queues to submit to. More explanations about configs and profiles can be seen in the [nf-core website](https://nf-co.re/usage/configuration) and the [profile tutorial](#tutorial---what-are-profiles-and-how-to-use-them). Next we need to specify our input data. nf-core/eager can accept input FASTQs files in two main ways, either with direct paths to files (with wildcards), or with a Tab-Separate-Value (TSV) file which contains the paths and extra metadata. In this example, we will use the TSV method, as to simulate a realistic use-case, such as both receiving single-end and paired-end data from Illumina NextSeq _and_ Illumina HiSeqs of double-stranded libraries. Illumina NextSeqs sequence a given library across four different 'lanes', so for each library you will receive four FASTQ files. Sometimes samples will be sequenced across multiple HiSeq lanes to maintain complexity to improve imaging by of base calls. The TSV input method is more useful for this context, as it allows 'merging' of these lanes after preprocessing prior mapping (whereas direct paths will consider each pair of FASTQ files as independent libraries/samples). ```bash Sample_Name Library_ID Lane Colour_Chemistry SeqType Organism Strandedness UDG_Treatment R1 R2 BAM KunilaII KunilaII_nonUDG 4 4 PE Yersinia pestis double none ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/007/ERR2112547/ERR2112547_1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/007/ERR2112547/ERR2112547_2.fastq.gz NA KunilaII KunilaII_UDG 4 4 PE Yersinia pestis double full ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/008/ERR2112548/ERR2112548_1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/008/ERR2112548/ERR2112548_2.fastq.gz NA 6Post 6Post_PE 1 2 PE Yersinia pestis double half ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/009/ERR2112549/ERR2112549_1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/009/ERR2112549/ERR2112549_2.fastq.gz NA 6Post 6Post_PE 2 2 PE Yersinia pestis double half ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/000/ERR2112550/ERR2112550_1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/000/ERR2112550/ERR2112550_2.fastq.gz NA 6Post 6Post_PE 3 2 PE Yersinia pestis double half ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/001/ERR2112551/ERR2112551_1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/001/ERR2112551/ERR2112551_2.fastq.gz NA 6Post 6Post_PE 4 2 PE Yersinia pestis double half ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/002/ERR2112552/ERR2112552_1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/002/ERR2112552/ERR2112552_2.fastq.gz NA 6Post 6Post_SE 1 4 SE Yersinia pestis double half ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/009/ERR2112569/ERR2112569.fastq.gz NA NA 6Post 6Post_SE 2 4 SE Yersinia pestis double half ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/000/ERR2112570/ERR2112570.fastq.gz NA NA 6Post 6Post_SE 3 4 SE Yersinia pestis double half ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/001/ERR2112571/ERR2112571.fastq.gz NA NA 6Post 6Post_SE 4 4 SE Yersinia pestis double half ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/002/ERR2112572/ERR2112572.fastq.gz NA NA 6Post 6Post_SE 8 4 SE Yersinia pestis double half ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/003/ERR2112573/ERR2112573.fastq.gz NA NA ``` > Note we also have a mixture of non-UDG and half-UDG treated libraries. You can see that we have a single line for each set of FASTQ files representing each `Lane`, but the `Sample_Name` and `Library_ID` columns identify and group them together accordingly. Secondly, as we have NextSeq data, we have specified we have `2` for `Colour_Chemistry` vs `4` for HiSeq; something that is important for downstream processing (see below). See the nf-core/eager parameter documentation above for more specifications on how to set up a TSV file (e.g. why despite NextSeqs only having 4 lanes, we can also go up to 8 or more when having a sample sequenced on two NextSeq runs). Alongside our input TSV file, we will also specify the paths to our reference FASTA file and the corresponding indices. ```bash nextflow run nf-core/eager \ -r 2.2.0 \ -profile singularity,shh \ -name 'projectX_preprocessing20200727' \ --input 'preprocessing20200727.tsv' \ --fasta '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa' \ --bwa_index '../Reference/genome/' \ --fasta_index '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.fai' \ --seq_dict '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.dict' \ <...> ``` We specify the paths to each reference genome and it's corresponding tool specific index. Paths should always be encapsulated in quotes to ensure Nextflow evaluates them, rather than your shell! Also note that as `bwa` generates multiple index files, nf-core/eager takes a _directory_ that must contain these indices instead. > Note the difference between single and double `-` parameters. The former > represent Nextflow flags, while the latter are nf-core/eager specific flags. Finally, we can also specify the output directory and the Nextflow `work/` directory (which contains 'intermediate' working files and directories). ```bash nextflow run nf-core/eager \ -r 2.2.0 \ -profile singularity,shh \ -name 'projectX_preprocessing20200727' \ --input 'preprocessing20200727.tsv' \ --fasta '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa' \ --bwa_index '../Reference/genome/' \ --fasta_index '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.fai' \ --seq_dict '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.dict' \ --outdir './results/' \ -w './work/' \ <...> ``` #### Tutorial Pathogen Genomics - Pipeline Configuration Now that we have specified the input data, we can start moving onto specifying settings for each different module we will be running. As mentioned above, some of our samples were generated as NextSeq data, which is generated with a two-colour imaging technique. What this means is when you have shorter molecules than the number of cycles of the sequencing chemistry, the sequencer will repeatedly see 'G' calls (no colour) at the last few cycles, and you get long poly-G 'tails' on your reads. We therefore will turn on the poly-G clipping functionality offered by [`fastp`](https://github.com/OpenGene/fastp), and any pairs of files indicated in the TSV file as having `2` in the `Colour_Chemistry` column will be passed to `fastp` (the HiSeq data will not). We will not change the default minimum length of a poly-G string to be clipped. ```bash nextflow run nf-core/eager \ -r 2.2.0 \ -profile singularity,shh \ -name 'projectX_preprocessing20200727' \ --input 'preprocessing20200727.tsv' \ --fasta '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa' \ --bwa_index '../Reference/genome/' \ --fasta_index '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.fai' \ --seq_dict '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.dict' \ --outdir './results/' \ -w './work/' \ --complexity_filter_poly_g \ <...> ``` We then need to specify the mapping parameters for this run. Typically, to account for damage of very old aDNA libraries and also sometimes for evolutionary divergence of the ancient genome to the modern reference, we should relax the mapping thresholds that specify how many mismatches a read can have from the reference to be considered 'mapped'. We will also speed up the seeding step of the seed-and-extend approach by specifying the length of the seed. We will do this with `--bwaalnn` and `--bwaalnl` respectively. ```bash nextflow run nf-core/eager \ -r 2.2.0 \ -profile singularity,shh \ -name 'projectX_preprocessing20200727' \ --input 'preprocessing20200727.tsv' \ --fasta '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa' \ --bwa_index '../Reference/genome/' \ --fasta_index '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.fai' \ --seq_dict '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.dict' \ --outdir './results/' \ -w './work/' \ --complexity_filter_poly_g \ --bwaalnn 0.01 \ --bwaalnl 16 \ <...> ``` As we are also interested at checking for gene presence/absence (see below), we will ensure no mapping quality filter is applied (to account for gene duplication that may cause a read to map equally to to places) by setting the threshold to 0. In addition, we will discard unmapped reads to reduce our hard-drive footprint. ```bash nextflow run nf-core/eager \ -r 2.2.0 \ -profile singularity,shh \ -name 'projectX_preprocessing20200727' \ --input 'preprocessing20200727.tsv' \ --fasta '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa' \ --bwa_index '../Reference/genome/' \ --fasta_index '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.fai' \ --seq_dict '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.dict' \ --outdir './results/' \ -w './work/' \ --complexity_filter_poly_g \ --bwaalnn 0.01 \ --bwaalnl 16 \ --run_bam_filtering \ --bam_mapping_quality_threshold 0 \ --bam_unmapped_type 'discard' \ <...> ``` While some of our input data is paired-end, we will keep with the default of Picard's MarkDuplicates'for duplicate removal, as DeDup takes into account both the start and end of a _merged_ read before flagging it as a duplicate - something that isn't valid for a single-end read (where the true end of the molecule might not have been sequenced). We can then specify which dedupper we want to use with `--dedupper`. While we are using the default (which does not need to be directly specified), we will put it explicitly in our command for clarity. ```bash nextflow run nf-core/eager \ -r 2.2.0 \ -profile singularity,shh \ -name 'projectX_preprocessing20200727' \ --input 'preprocessing20200727.tsv' \ --fasta '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa' \ --bwa_index '../Reference/genome/' \ --fasta_index '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.fai' \ --seq_dict '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.dict' \ --outdir './results/' \ -w './work/' \ --complexity_filter_poly_g \ --bwaalnn 0.01 \ --bwaalnl 16 \ --run_bam_filtering \ --bam_mapping_quality_threshold 0 \ --bam_unmapped_type 'discard' \ --dedupper 'markduplicates' \ <...> ``` Alongside making a SNP table for downstream phylogenetic analysis (we will get to this in a bit), you may be interested in generating some summary statistics of annotated parts of your reference genome, e.g. to see whether certain virulence factors are present or absent. nf-core/eager offers some basic statistics (percent and and depth coverage) of these via Bedtools. We will therefore turn on this module and specify the GFF file we downloaded alongside our reference fasta. Note that this GFF file has a _lot_ of redundant data, so often a custom BED file with just genes of interest is recommended. Furthermore ```bash nextflow run nf-core/eager \ -r 2.2.0 \ -profile singularity,shh \ -name 'projectX_preprocessing20200727' \ --input 'preprocessing20200727.tsv' \ --fasta '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa' \ --bwa_index '../Reference/genome/' \ --fasta_index '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.fai' \ --seq_dict '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.dict' \ --outdir './results/' \ -w './work/' \ --complexity_filter_poly_g \ --bwaalnn 0.01 \ --bwaalnl 16 \ --run_bam_filtering \ --bam_mapping_quality_threshold 0 \ --bam_unmapped_type 'discard' \ --dedupper 'markduplicates' \ --run_bedtools_coverage \ --anno_file '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.gff' <...> ``` Next, we will set up trimming of the mapped reads to alleviate the effects of DNA damage during genotyping. To do this we will activate trimming with `--run_trim_bam`. The libraries in this example underwent either no or 'half'-UDG treatment. The latter will generally restrict all remaining DNA damage to the first 2 base pairs of a fragment. We will therefore use `--bamutils_clip_half_udg_left` and `--bamutils_clip_half_udg_right` to trim 2 bp on either side of each fragment. For the non-UDG treated libraries we can trim a little more to remove most damage with the `--bamutils_clip_none_udg_<*>` variants of the flag. Note that there is a tendency in ancient pathogenomics to trim damage _prior_ mapping, as it allows mapping with stricter parameters to improve removal of reads deriving from potential evolutionary diverged contaminants (this can be done nf-core/eager with the Bowtie2 aligner), however we do BAM trimming instead here as another demonstration of functionality. ```bash nextflow run nf-core/eager \ -r 2.2.0 \ -profile singularity,shh \ -name 'projectX_preprocessing20200727' \ --input 'preprocessing20200727.tsv' \ --fasta '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa' \ --bwa_index '../Reference/genome/' \ --fasta_index '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.fai' \ --seq_dict '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.dict' \ --outdir './results/' \ -w './work/' \ --complexity_filter_poly_g \ --bwaalnn 0.01 \ --bwaalnl 16 \ --run_bam_filtering \ --bam_mapping_quality_threshold 0 \ --bam_unmapped_type 'discard' \ --dedupper 'markduplicates' \ --run_bedtools_coverage \ --anno_file '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.gff' --run_trim_bam \ --bamutils_clip_double_stranded_half_udg_left 2 \ --bamutils_clip_double_stranded_half_udg_right 2 \ --bamutils_clip_double_stranded_none_udg_left 3 \ --bamutils_clip_double_stranded_none_udg_right 3 \ <...> ``` Here we will use MultiVCFAnalyzer for the generation of our SNP table. A MultiVCFAnalyzer SNP table allows downstream assessment of the level of multi-allelic positions, something not expected when dealing with a single ploidy organism and thus may reflect cross-mapping from multiple-strains, environmental relatives or other contaminants. For this we need to run genotyping, but specifically with GATK UnifiedGenotyper 3.5 (as MultiVCFAnalyzer requires this particular format of VCF files). We will therefore turn on Genotyping, and check ploidy is set 2 so 'heterozygous' positions can be reported. We will also need to specify that we want to use the trimmed bams from the previous step. ```bash nextflow run nf-core/eager \ -r 2.2.0 \ -profile singularity,shh \ -name 'projectX_preprocessing20200727' \ --input 'preprocessing20200727.tsv' \ --fasta '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa' \ --bwa_index '../Reference/genome/' \ --fasta_index '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.fai' \ --seq_dict '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.dict' \ --outdir './results/' \ -w './work/' \ --complexity_filter_poly_g \ --bwaalnn 0.01 \ --bwaalnl 16 \ --run_bam_filtering \ --bam_mapping_quality_threshold 0 \ --bam_unmapped_type 'discard' \ --dedupper 'markduplicates' \ --run_trim_bam \ --bamutils_clip_double_stranded_half_udg_left 2 \ --bamutils_clip_double_stranded_half_udg_right 2 \ --bamutils_clip_double_stranded_none_udg_left 3 \ --bamutils_clip_double_stranded_none_udg_right 3 \ --run_bedtools_coverage \ --anno_file '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.gff' \ --run_genotyping \ --genotyping_tool 'ug' \ --genotyping_source 'trimmed' \ --gatk_ploidy 2 \ --gatk_ug_mode 'EMIT_ALL_SITES' \ --gatk_ug_genotype_model 'SNP' \ <...> ``` Finally we can set up MultiVCFAnalyzer itself. First we want to make sure we specified that we want to report the frequency of the given called allele at each position so we can assess cross mapping. Then, often with ancient pathogens, such as _Y. pestis_, we also want to include to the SNP table comparative data from previously published and ancient genomes. For this we specify additional VCF files that have been generated in previous runs with the same settings and reference genome. We can do this as follows. ```bash nextflow run nf-core/eager \ -r 2.2.0 \ -profile singularity,shh \ -name 'projectX_preprocessing20200727' \ --input 'preprocessing20200727.tsv' \ --fasta '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa' \ --bwa_index '../Reference/genome/' \ --fasta_index '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.fai' \ --seq_dict '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.dict' \ --outdir './results/' \ -w './work/' \ --complexity_filter_poly_g \ --bwaalnn 0.01 \ --bwaalnl 16 \ --run_bam_filtering \ --bam_mapping_quality_threshold 0 \ --bam_unmapped_type 'discard' \ --dedupper 'markduplicates' \ --run_trim_bam \ --bamutils_clip_double_stranded_half_udg_left 2 \ --bamutils_clip_double_stranded_half_udg_right 2 \ --bamutils_clip_double_stranded_none_udg_left 3 \ --bamutils_clip_double_stranded_none_udg_right 3 \ --run_bedtools_coverage \ --anno_file '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.gff' \ --run_genotyping \ --genotyping_tool 'ug' \ --genotyping_source 'trimmed' \ --gatk_ploidy 2 \ --gatk_ug_mode 'EMIT_ALL_SITES' \ --gatk_ug_genotype_model 'SNP' \ --run_multivcfanalyzer \ --write_allele_frequencies \ --min_base_coverage 5 \ --min_allele_freq_hom 0.9 \ --min_allele_freq_het 0.1 \ --additional_vcf_files '../vcfs/*.vcf.gz' ``` For the two `min_allele_freq` parameters we specify that anything above 90% frequency is considered 'homozygous', and anything above 10% (but below 90%) is considered an ambiguous call and the frequency will be reported. Note that you would not normally use this SNP table with these parameters for downstream phylogenetic analysis, as the table will include ambiguous IUPAC codes, making it only useful for fine-comb checking of multi-allelic positions. Instead, set both parameters to the same value (e.g. 0.8) and use that table for downstream phylogenetic analysis. With this, we are ready to submit! If running on a remote cluster/server, Make sure to run this in a `screen` session or similar, so that if you get a `ssh` signal drop or want to log off, Nextflow will not crash. #### Tutorial Pathogen Genomics - Results Assuming the run completed without any crashes (if problems do occur, check against [parameters](https://nf-core/eager/parameters) that all parameters are as expected, or check the [FAQ](#troubleshooting-and-faqs)), we can now check our results in `results/`. ##### Tutorial Pathogen Genomics - MultiQC Report In here there are many different directories containing different output files. The first directory to check is the `MultiQC/` directory. You should find a `multiqc_report.html` file. You will need to view this in a web browser, so I recommend either mounting your server to your file browser, or downloading it to your own local machine (PC/Laptop etc.). Once you've opened this you can go through each section and evaluate all the results. For example, I normally look for things like: General Stats Table: * Do I see the expected number of raw sequencing reads (summed across each set of FASTQ files per library) that was requested for sequencing? * Does the percentage of trimmed reads look normal for aDNA, and do lengths after trimming look short as expected of aDNA? * Does the Endogenous DNA (%) columns look reasonable (high enough to indicate you have received enough coverage for downstream, and/or do you lose an unusually high reads after filtering ) * Does ClusterFactor or '% Dups' look high (e.g. >2 or >10% respectively - high values suggesting over-amplified or badly preserved samples i.e. low complexity; note that genome-enrichment libraries may by their nature look higher). * Do you see an increased frequency of C>Ts on the 5' end of molecules in the mapped reads? * Do median read lengths look relatively low (normally <= 100 bp) indicating typically fragmented aDNA? * Does the % coverage decrease relatively gradually at each depth coverage, and does not drop extremely drastically * Does the Median coverage and percent >3x (or whatever you set) show sufficient coverage for reliable SNP calls and that a good proportion of the genome is covered indicating you have the right reference genome? * Do you see a high proportion of % Hets, indicating many multi-allelic sites (and possibly presence of cross-mapping from other species, that may lead to false positive or less confident SNP calls)? FastQC (pre-AdapterRemoval): * Do I see any very early drop off of sequence quality scores suggesting problematic sequencing run? * Do I see outlier GC content distributions? * Do I see high sequence duplication levels? AdapterRemoval: * Do I see high numbers of singletons or discarded read pairs? FastQC (post-AdapterRemoval): * Do I see improved sequence quality scores along the length of reads? * Do I see reduced adapter content levels? Samtools Flagstat (pre/post Filter): * Do I see outliers, e.g. with unusually low levels of mapped reads, (indicative of badly preserved samples) that require downstream closer assessment? DeDup/Picard MarkDuplicates: * Do I see large numbers of duplicates being removed, possibly indicating over-amplified or badly preserved samples? PreSeq: * Do I see a large drop off of a sample's curve away from the theoretical complexity? If so, this may indicate it's not worth performing deeper sequencing as you will get few unique reads (vs. duplicates that are not any more informative than the reads you've already sequenced) DamageProfiler: * Do I see evidence of damage on the microbial DNA (i.e. a % C>T of more than ~5% in the first few nucleotide positions?) ? If not, possibly your mapped reads are deriving from modern contamination. QualiMap: * Do you see a peak of coverage (X) at a good level, e.g. >= 3x, indicating sufficient coverage for reliable SNP calls? MultiVCFAnalyzer: * Do I have a good number of called SNPs that suggest the samples have genomes with sufficient nucleotide diversity to inform phylogenetic analysis? * Do you have a large number of discarded SNP calls? * Are the % Hets very high indicating possible cross-mapping from off-target organisms that may confounding variant calling? > Detailed documentation and descriptions for all MultiQC modules can be seen in > the the 'Documentation' folder of the results directory or here in the [output > documentation](output.md) If you're happy everything looks good in terms of sequencing, we then look at specific directories to find any files you might want to use for downstream processing. Note that when you get back to writing up your publication, all the versions of the tools can be found under the 'nf-core/eager Software Versions' section of the MultiQC report. Note that all tools in the container are listed, so you may have to remove some of them that you didn't actually use in the set up. For example, in the example above, we have used: Nextflow, nf-core/eager, FastQC, AdapterRemoval, fastP, BWA, Samtools, endorS.py, Picard Markduplicates, Bedtools, Qualimap, PreSeq, DamageProfiler, MultiVCFAnalyzer and MultiQC. Citations to all used tools can be seen [here](https://nf-co.re/eager#tool-references) ##### Tutorial Pathogen Genomics - Files for Downstream Analysis You will find the most relevant output files in your `results/` directory. Each directory generally corresponds to a specific step or tool of the pipeline. Most importantly you should look in `deduplication` for your de-duplicated BAM files (e.g. for viewing in IGV), bedtools for depth (X) and breadth (%) coverages of annotations of your reference (e.g. genes), `multivcfanalyzer` for final SNP tables etc that can be used for downstream phylogenetic applications. #### Tutorial Pathogen Genomics - Clean up Finally, I would recommend cleaning up your `work/` directory of any intermediate files (if your `-profile` does not already do so). You can do this by going to above your `results/` and `work/` directory, e.g. ```bash cd ///projectX_preprocessing20200727 ``` and running ```bash nextflow clean -f -k ``` #### Tutorial Pathogen Genomics - Summary In this this tutorial we have described an example on how to set up an nf-core/eager run to process microbial aDNA for a relatively standard pathogen genomics study for phylogenetics and basic functional screening. This includes preform some simple quality control checks, mapping, genotyping, and SNP table generation for downstream analysis of the data. Additionally, we described what to look for in the run summary report generated by MultiQC and where to find output files that can be used for downstream analysis. ================================================ FILE: environment.yml ================================================ # You can use this file to create a conda environment for this pipeline: # conda env create -f environment.yml name: nf-core-eager-2.5.3 channels: - conda-forge - bioconda - defaults dependencies: - conda-forge::python=3.9.4 - conda-forge::markdown=3.3.4 - conda-forge::pymdown-extensions=8.2 - conda-forge::pygments=2.14.0 - bioconda::rename=1.601 - conda-forge::openjdk=8.0.144 # Don't upgrade - required for GATK - bioconda::fastqc=0.11.9 - bioconda::adapterremoval=2.3.2 - bioconda::adapterremovalfixprefix=0.0.5 - bioconda::bwa=0.7.17 - bioconda::picard=2.26.0 - bioconda::samtools=1.12 - bioconda::dedup=0.12.8 - bioconda::angsd=0.935 - bioconda::circularmapper=1.93.5 - bioconda::gatk4=4.2.0.0 - bioconda::gatk=3.5 ## Don't upgrade - required for MultiVCFAnalyzer - bioconda::qualimap=2.2.2d - bioconda::vcf2genome=0.91 - bioconda::damageprofiler=0.4.9 # Don't upgrade - later versions don't allow java 8 - bioconda::multiqc=1.16 - bioconda::pmdtools=0.60 - bioconda::bedtools=2.30.0 - conda-forge::libiconv=1.16 - conda-forge::pigz=2.6 - bioconda::sequencetools=1.5.2 - bioconda::preseq=3.1.2 - bioconda::fastp=0.20.1 - bioconda::bamutil=1.0.15 - bioconda::mtnucratio=0.7 - bioconda::pysam=0.16.0 - bioconda::kraken2=2.1.2 - conda-forge::pandas=1.2.4 - bioconda::freebayes=1.3.5 - bioconda::sexdeterrmine=1.1.2 - bioconda::multivcfanalyzer=0.85.2 - bioconda::hops=0.35 - bioconda::malt=0.61 - conda-forge::biopython=1.79 - conda-forge::xopen=1.1.0 - bioconda::bowtie2=2.4.4 - bioconda::eigenstratdatabasetools=1.0.2 - bioconda::mapdamage2=2.2.1 - bioconda::bbmap=38.92 - bioconda::bcftools=1.12 ================================================ FILE: lib/Checks.groovy ================================================ import org.yaml.snakeyaml.Yaml /* * This file holds several functions used to perform standard checks for the nf-core pipeline template. */ class Checks { static void check_conda_channels(log) { Yaml parser = new Yaml() def channels = [] try { def config = parser.load("conda config --show channels".execute().text) channels = config.channels } catch(NullPointerException | IOException e) { log.warn "Could not verify conda channel configuration." return } // Check that all channels are present def required_channels = ['conda-forge', 'bioconda', 'defaults'] def conda_check_failed = !required_channels.every { ch -> ch in channels } // Check that they are in the right order conda_check_failed |= !(channels.indexOf('conda-forge') < channels.indexOf('bioconda')) conda_check_failed |= !(channels.indexOf('bioconda') < channels.indexOf('defaults')) if (conda_check_failed) { log.warn "=============================================================================\n" + " There is a problem with your Conda configuration!\n\n" + " You will need to set-up the conda-forge and bioconda channels correctly.\n" + " Please refer to https://bioconda.github.io/user/install.html#set-up-channels\n" + " NB: The order of the channels matters!\n" + "===================================================================================" } } static void aws_batch(workflow, params) { if (workflow.profile.contains('awsbatch')) { assert (params.awsqueue && params.awsregion) : "Specify correct --awsqueue and --awsregion parameters on AWSBatch!" // Check outdir paths to be S3 buckets if running on AWSBatch // related: https://github.com/nextflow-io/nextflow/issues/813 assert params.outdir.startsWith('s3:') : "Outdir not on S3 - specify S3 Bucket to run on AWSBatch!" // Prevent trace files to be stored on S3 since S3 does not support rolling files. assert !params.tracedir.startsWith('s3:') : "Specify a local tracedir or run without trace! S3 cannot be used for tracefiles." } } static void hostname(workflow, params, log) { Map colors = Headers.log_colours(params.monochrome_logs) if (params.hostnames) { def hostname = "hostname".execute().text.trim() params.hostnames.each { prof, hnames -> hnames.each { hname -> if (hostname.contains(hname) && !workflow.profile.contains(prof)) { log.info "=${colors.yellow}====================================================${colors.reset}=\n" + "${colors.yellow}WARN: You are running with `-profile $workflow.profile`\n" + " but your machine hostname is ${colors.white}'$hostname'${colors.reset}.\n" + " ${colors.yellow_bold}Please use `-profile $prof${colors.reset}`\n" + "=${colors.yellow}====================================================${colors.reset}=" } } } } } // Citation string private static String citation(workflow) { return "If you use ${workflow.manifest.name} for your analysis please cite:\n\n" + "* The pipeline\n" + " https://doi.org/10.1101/2020.06.11.145615\n\n" + "* The nf-core framework\n" + " https://dx.doi.org/10.1038/s41587-020-0439-x\n" + " https://rdcu.be/b1GjZ\n\n" + "* Software dependencies\n" + " https://github.com/${workflow.manifest.name}/blob/master/CITATIONS.md" } } ================================================ FILE: lib/Completion.groovy ================================================ /* * Functions to be run on completion of pipeline */ class Completion { static void email(workflow, params, summary_params, projectDir, log, multiqc_report=[]) { // Set up the e-mail variables def subject = "[$workflow.manifest.name] Successful: $workflow.runName" if (!workflow.success) { subject = "[$workflow.manifest.name] FAILED: $workflow.runName" } def summary = [:] for (group in summary_params.keySet()) { summary << summary_params[group] } def misc_fields = [:] misc_fields['Date Started'] = workflow.start misc_fields['Date Completed'] = workflow.complete misc_fields['Pipeline script file path'] = workflow.scriptFile misc_fields['Pipeline script hash ID'] = workflow.scriptId if (workflow.repository) misc_fields['Pipeline repository Git URL'] = workflow.repository if (workflow.commitId) misc_fields['Pipeline repository Git Commit'] = workflow.commitId if (workflow.revision) misc_fields['Pipeline Git branch/tag'] = workflow.revision misc_fields['Nextflow Version'] = workflow.nextflow.version misc_fields['Nextflow Build'] = workflow.nextflow.build misc_fields['Nextflow Compile Timestamp'] = workflow.nextflow.timestamp def email_fields = [:] email_fields['version'] = workflow.manifest.version email_fields['runName'] = workflow.runName email_fields['success'] = workflow.success email_fields['dateComplete'] = workflow.complete email_fields['duration'] = workflow.duration email_fields['exitStatus'] = workflow.exitStatus email_fields['errorMessage'] = (workflow.errorMessage ?: 'None') email_fields['errorReport'] = (workflow.errorReport ?: 'None') email_fields['commandLine'] = workflow.commandLine email_fields['projectDir'] = workflow.projectDir email_fields['summary'] = summary << misc_fields // On success try attach the multiqc report def mqc_report = null try { if (workflow.success) { mqc_report = multiqc_report.getVal() if (mqc_report.getClass() == ArrayList && mqc_report.size() >= 1) { if (mqc_report.size() > 1) { log.warn "[$workflow.manifest.name] Found multiple reports from process 'MULTIQC', will use only one" } mqc_report = mqc_report[0] } } } catch (all) { log.warn "[$workflow.manifest.name] Could not attach MultiQC report to summary email" } // Check if we are only sending emails on failure def email_address = params.email if (!params.email && params.email_on_fail && !workflow.success) { email_address = params.email_on_fail } // Render the TXT template def engine = new groovy.text.GStringTemplateEngine() def tf = new File("$projectDir/assets/email_template.txt") def txt_template = engine.createTemplate(tf).make(email_fields) def email_txt = txt_template.toString() // Render the HTML template def hf = new File("$projectDir/assets/email_template.html") def html_template = engine.createTemplate(hf).make(email_fields) def email_html = html_template.toString() // Render the sendmail template def max_multiqc_email_size = params.max_multiqc_email_size as nextflow.util.MemoryUnit def smail_fields = [ email: email_address, subject: subject, email_txt: email_txt, email_html: email_html, projectDir: "$projectDir", mqcFile: mqc_report, mqcMaxSize: max_multiqc_email_size.toBytes()] def sf = new File("$projectDir/assets/sendmail_template.txt") def sendmail_template = engine.createTemplate(sf).make(smail_fields) def sendmail_html = sendmail_template.toString() // Send the HTML e-mail Map colors = Headers.log_colours(params.monochrome_logs) if (email_address) { try { if (params.plaintext_email) { throw GroovyException('Send plaintext e-mail, not HTML') } // Try to send HTML e-mail using sendmail [ 'sendmail', '-t' ].execute() << sendmail_html log.info "-${colors.purple}[$workflow.manifest.name]${colors.green} Sent summary e-mail to $email_address (sendmail)-" } catch (all) { // Catch failures and try with plaintext def mail_cmd = [ 'mail', '-s', subject, '--content-type=text/html', email_address ] if ( mqc_report.size() <= max_multiqc_email_size.toBytes() ) { mail_cmd += [ '-A', mqc_report ] } mail_cmd.execute() << email_html log.info "-${colors.purple}[$workflow.manifest.name]${colors.green} Sent summary e-mail to $email_address (mail)-" } } // Write summary e-mail HTML to a file def output_d = new File("${params.outdir}/pipeline_info/") if (!output_d.exists()) { output_d.mkdirs() } def output_hf = new File(output_d, "pipeline_report.html") output_hf.withWriter { w -> w << email_html } def output_tf = new File(output_d, "pipeline_report.txt") output_tf.withWriter { w -> w << email_txt } } static void summary(workflow, params, log, fail_percent_mapped=[:], pass_percent_mapped=[:]) { Map colors = Headers.log_colours(params.monochrome_logs) if (workflow.success) { if (workflow.stats.ignoredCount == 0) { log.info "-${colors.purple}[$workflow.manifest.name]${colors.green} Pipeline completed successfully${colors.reset}-" } else { log.info "-${colors.purple}[$workflow.manifest.name]${colors.red} Pipeline completed successfully, but with errored process(es) ${colors.reset}-" } } else { Checks.hostname(workflow, params, log) log.info "-${colors.purple}[$workflow.manifest.name]${colors.red} Pipeline completed with errors${colors.reset}-" } } } ================================================ FILE: lib/Headers.groovy ================================================ /* * This file holds several functions used to render the nf-core ANSI header. */ class Headers { private static Map log_colours(Boolean monochrome_logs) { Map colorcodes = [:] colorcodes['reset'] = monochrome_logs ? '' : "\033[0m" colorcodes['dim'] = monochrome_logs ? '' : "\033[2m" colorcodes['black'] = monochrome_logs ? '' : "\033[0;30m" colorcodes['green'] = monochrome_logs ? '' : "\033[0;32m" colorcodes['yellow'] = monochrome_logs ? '' : "\033[0;33m" colorcodes['yellow_bold'] = monochrome_logs ? '' : "\033[1;93m" colorcodes['blue'] = monochrome_logs ? '' : "\033[0;34m" colorcodes['purple'] = monochrome_logs ? '' : "\033[0;35m" colorcodes['cyan'] = monochrome_logs ? '' : "\033[0;36m" colorcodes['white'] = monochrome_logs ? '' : "\033[0;37m" colorcodes['red'] = monochrome_logs ? '' : "\033[1;91m" return colorcodes } static String dashed_line(monochrome_logs) { Map colors = log_colours(monochrome_logs) return "-${colors.dim}----------------------------------------------------${colors.reset}-" } static String nf_core(workflow, monochrome_logs) { Map colors = log_colours(monochrome_logs) String.format( """\n ${dashed_line(monochrome_logs)} ${colors.green},--.${colors.black}/${colors.green},-.${colors.reset} ${colors.blue} ___ __ __ __ ___ ${colors.green}/,-._.--~\'${colors.reset} ${colors.blue} |\\ | |__ __ / ` / \\ |__) |__ ${colors.yellow}} {${colors.reset} ${colors.blue} | \\| | \\__, \\__/ | \\ |___ ${colors.green}\\`-._,-`-,${colors.reset} ${colors.green}`._,._,\'${colors.reset} ${colors.purple} ${workflow.manifest.name} v${workflow.manifest.version}${colors.reset} ${dashed_line(monochrome_logs)} """.stripIndent() ) } } ================================================ FILE: lib/NfcoreSchema.groovy ================================================ /* * This file holds several functions used to perform JSON parameter validation, help and summary rendering for the nf-core pipeline template. */ import org.everit.json.schema.Schema import org.everit.json.schema.loader.SchemaLoader import org.everit.json.schema.ValidationException import org.json.JSONObject import org.json.JSONTokener import org.json.JSONArray import groovy.json.JsonSlurper import groovy.json.JsonBuilder class NfcoreSchema { /* * Function to loop over all parameters defined in schema and check * whether the given paremeters adhere to the specificiations */ /* groovylint-disable-next-line UnusedPrivateMethodParameter */ private static void validateParameters(params, jsonSchema, log) { def has_error = false //=====================================================================// // Check for nextflow core params and unexpected params def json = new File(jsonSchema).text def Map schemaParams = (Map) new JsonSlurper().parseText(json).get('definitions') def nf_params = [ // Options for base `nextflow` command 'bg', 'c', 'C', 'config', 'd', 'D', 'dockerize', 'h', 'log', 'q', 'quiet', 'syslog', 'v', 'version', // Options for `nextflow run` command 'ansi', 'ansi-log', 'bg', 'bucket-dir', 'c', 'cache', 'config', 'dsl2', 'dump-channels', 'dump-hashes', 'E', 'entry', 'latest', 'lib', 'main-script', 'N', 'name', 'offline', 'params-file', 'pi', 'plugins', 'poll-interval', 'pool-size', 'profile', 'ps', 'qs', 'queue-size', 'r', 'resume', 'revision', 'stdin', 'stub', 'stub-run', 'test', 'w', 'with-charliecloud', 'with-conda', 'with-dag', 'with-docker', 'with-mpi', 'with-notification', 'with-podman', 'with-report', 'with-singularity', 'with-timeline', 'with-tower', 'with-trace', 'with-weblog', 'without-docker', 'without-podman', 'work-dir' ] def unexpectedParams = [] // Collect expected parameters from the schema def expectedParams = [] for (group in schemaParams) { for (p in group.value['properties']) { expectedParams.push(p.key) } } for (specifiedParam in params.keySet()) { // nextflow params if (nf_params.contains(specifiedParam)) { log.error "ERROR: You used a core Nextflow option with two hyphens: '--${specifiedParam}'. Please resubmit with '-${specifiedParam}'" has_error = true } // unexpected params def params_ignore = params.schema_ignore_params.split(',') + 'schema_ignore_params' def expectedParamsLowerCase = expectedParams.collect{ it.replace("-", "").toLowerCase() } def specifiedParamLowerCase = specifiedParam.replace("-", "").toLowerCase() if (!expectedParams.contains(specifiedParam) && !params_ignore.contains(specifiedParam) && !expectedParamsLowerCase.contains(specifiedParamLowerCase)) { // Temporarily remove camelCase/camel-case params #1035 def unexpectedParamsLowerCase = unexpectedParams.collect{ it.replace("-", "").toLowerCase()} if (!unexpectedParamsLowerCase.contains(specifiedParamLowerCase)){ unexpectedParams.push(specifiedParam) } } } //=====================================================================// // Validate parameters against the schema InputStream inputStream = new File(jsonSchema).newInputStream() JSONObject rawSchema = new JSONObject(new JSONTokener(inputStream)) // Remove anything that's in params.schema_ignore_params rawSchema = removeIgnoredParams(rawSchema, params) Schema schema = SchemaLoader.load(rawSchema) // Clean the parameters def cleanedParams = cleanParameters(params) // Convert to JSONObject def jsonParams = new JsonBuilder(cleanedParams) JSONObject paramsJSON = new JSONObject(jsonParams.toString()) // Validate try { schema.validate(paramsJSON) } catch (ValidationException e) { println '' log.error 'ERROR: Validation of pipeline parameters failed!' JSONObject exceptionJSON = e.toJSON() printExceptions(exceptionJSON, paramsJSON, log) println '' has_error = true } // Check for unexpected parameters if (unexpectedParams.size() > 0) { Map colors = log_colours(params.monochrome_logs) println '' def warn_msg = 'Found unexpected parameters:' for (unexpectedParam in unexpectedParams) { warn_msg = warn_msg + "\n* --${unexpectedParam}: ${params[unexpectedParam].toString()}" } log.warn warn_msg log.info "- ${colors.dim}Ignore this warning: params.schema_ignore_params = \"${unexpectedParams.join(',')}\" ${colors.reset}" println '' } if (has_error) { System.exit(1) } } // Loop over nested exceptions and print the causingException private static void printExceptions(exJSON, paramsJSON, log) { def causingExceptions = exJSON['causingExceptions'] if (causingExceptions.length() == 0) { def m = exJSON['message'] =~ /required key \[([^\]]+)\] not found/ // Missing required param if (m.matches()) { log.error "* Missing required parameter: --${m[0][1]}" } // Other base-level error else if (exJSON['pointerToViolation'] == '#') { log.error "* ${exJSON['message']}" } // Error with specific param else { def param = exJSON['pointerToViolation'] - ~/^#\// def param_val = paramsJSON[param].toString() log.error "* --${param}: ${exJSON['message']} (${param_val})" } } for (ex in causingExceptions) { printExceptions(ex, paramsJSON, log) } } // Remove an element from a JSONArray private static JSONArray removeElement(jsonArray, element){ def list = [] int len = jsonArray.length() for (int i=0;i if(rawSchema.keySet().contains('definitions')){ rawSchema.definitions.each { definition -> for (key in definition.keySet()){ if (definition[key].get("properties").keySet().contains(ignore_param)){ // Remove the param to ignore definition[key].get("properties").remove(ignore_param) // If the param was required, change this if (definition[key].has("required")) { def cleaned_required = removeElement(definition[key].required, ignore_param) definition[key].put("required", cleaned_required) } } } } } if(rawSchema.keySet().contains('properties') && rawSchema.get('properties').keySet().contains(ignore_param)) { rawSchema.get("properties").remove(ignore_param) } if(rawSchema.keySet().contains('required') && rawSchema.required.contains(ignore_param)) { def cleaned_required = removeElement(rawSchema.required, ignore_param) rawSchema.put("required", cleaned_required) } } return rawSchema } private static Map cleanParameters(params) { def new_params = params.getClass().newInstance(params) for (p in params) { // remove anything evaluating to false if (!p['value']) { new_params.remove(p.key) } // Cast MemoryUnit to String if (p['value'].getClass() == nextflow.util.MemoryUnit) { new_params.replace(p.key, p['value'].toString()) } // Cast Duration to String if (p['value'].getClass() == nextflow.util.Duration) { new_params.replace(p.key, p['value'].toString().replaceFirst(/d(?!\S)/, "day")) } // Cast LinkedHashMap to String if (p['value'].getClass() == LinkedHashMap) { new_params.replace(p.key, p['value'].toString()) } } return new_params } /* * This method tries to read a JSON params file */ private static LinkedHashMap params_load(String json_schema) { def params_map = new LinkedHashMap() try { params_map = params_read(json_schema) } catch (Exception e) { println "Could not read parameters settings from JSON. $e" params_map = new LinkedHashMap() } return params_map } private static Map log_colours(Boolean monochrome_logs) { Map colorcodes = [:] // Reset / Meta colorcodes['reset'] = monochrome_logs ? '' : "\033[0m" colorcodes['bold'] = monochrome_logs ? '' : "\033[1m" colorcodes['dim'] = monochrome_logs ? '' : "\033[2m" colorcodes['underlined'] = monochrome_logs ? '' : "\033[4m" colorcodes['blink'] = monochrome_logs ? '' : "\033[5m" colorcodes['reverse'] = monochrome_logs ? '' : "\033[7m" colorcodes['hidden'] = monochrome_logs ? '' : "\033[8m" // Regular Colors colorcodes['black'] = monochrome_logs ? '' : "\033[0;30m" colorcodes['red'] = monochrome_logs ? '' : "\033[0;31m" colorcodes['green'] = monochrome_logs ? '' : "\033[0;32m" colorcodes['yellow'] = monochrome_logs ? '' : "\033[0;33m" colorcodes['blue'] = monochrome_logs ? '' : "\033[0;34m" colorcodes['purple'] = monochrome_logs ? '' : "\033[0;35m" colorcodes['cyan'] = monochrome_logs ? '' : "\033[0;36m" colorcodes['white'] = monochrome_logs ? '' : "\033[0;37m" // Bold colorcodes['bblack'] = monochrome_logs ? '' : "\033[1;30m" colorcodes['bred'] = monochrome_logs ? '' : "\033[1;31m" colorcodes['bgreen'] = monochrome_logs ? '' : "\033[1;32m" colorcodes['byellow'] = monochrome_logs ? '' : "\033[1;33m" colorcodes['bblue'] = monochrome_logs ? '' : "\033[1;34m" colorcodes['bpurple'] = monochrome_logs ? '' : "\033[1;35m" colorcodes['bcyan'] = monochrome_logs ? '' : "\033[1;36m" colorcodes['bwhite'] = monochrome_logs ? '' : "\033[1;37m" // Underline colorcodes['ublack'] = monochrome_logs ? '' : "\033[4;30m" colorcodes['ured'] = monochrome_logs ? '' : "\033[4;31m" colorcodes['ugreen'] = monochrome_logs ? '' : "\033[4;32m" colorcodes['uyellow'] = monochrome_logs ? '' : "\033[4;33m" colorcodes['ublue'] = monochrome_logs ? '' : "\033[4;34m" colorcodes['upurple'] = monochrome_logs ? '' : "\033[4;35m" colorcodes['ucyan'] = monochrome_logs ? '' : "\033[4;36m" colorcodes['uwhite'] = monochrome_logs ? '' : "\033[4;37m" // High Intensity colorcodes['iblack'] = monochrome_logs ? '' : "\033[0;90m" colorcodes['ired'] = monochrome_logs ? '' : "\033[0;91m" colorcodes['igreen'] = monochrome_logs ? '' : "\033[0;92m" colorcodes['iyellow'] = monochrome_logs ? '' : "\033[0;93m" colorcodes['iblue'] = monochrome_logs ? '' : "\033[0;94m" colorcodes['ipurple'] = monochrome_logs ? '' : "\033[0;95m" colorcodes['icyan'] = monochrome_logs ? '' : "\033[0;96m" colorcodes['iwhite'] = monochrome_logs ? '' : "\033[0;97m" // Bold High Intensity colorcodes['biblack'] = monochrome_logs ? '' : "\033[1;90m" colorcodes['bired'] = monochrome_logs ? '' : "\033[1;91m" colorcodes['bigreen'] = monochrome_logs ? '' : "\033[1;92m" colorcodes['biyellow'] = monochrome_logs ? '' : "\033[1;93m" colorcodes['biblue'] = monochrome_logs ? '' : "\033[1;94m" colorcodes['bipurple'] = monochrome_logs ? '' : "\033[1;95m" colorcodes['bicyan'] = monochrome_logs ? '' : "\033[1;96m" colorcodes['biwhite'] = monochrome_logs ? '' : "\033[1;97m" return colorcodes } static String dashed_line(monochrome_logs) { Map colors = log_colours(monochrome_logs) return "-${colors.dim}----------------------------------------------------${colors.reset}-" } /* Method to actually read in JSON file using Groovy. Group (as Key), values are all parameters - Parameter1 as Key, Description as Value - Parameter2 as Key, Description as Value .... Group - */ private static LinkedHashMap params_read(String json_schema) throws Exception { def json = new File(json_schema).text def Map schema_definitions = (Map) new JsonSlurper().parseText(json).get('definitions') def Map schema_properties = (Map) new JsonSlurper().parseText(json).get('properties') /* Tree looks like this in nf-core schema * definitions <- this is what the first get('definitions') gets us group 1 title description properties parameter 1 type description parameter 2 type description group 2 title description properties parameter 1 type description * properties <- parameters can also be ungrouped, outside of definitions parameter 1 type description */ // Grouped params def params_map = new LinkedHashMap() schema_definitions.each { key, val -> def Map group = schema_definitions."$key".properties // Gets the property object of the group def title = schema_definitions."$key".title def sub_params = new LinkedHashMap() group.each { innerkey, value -> sub_params.put(innerkey, value) } params_map.put(title, sub_params) } // Ungrouped params def ungrouped_params = new LinkedHashMap() schema_properties.each { innerkey, value -> ungrouped_params.put(innerkey, value) } params_map.put("Other parameters", ungrouped_params) return params_map } /* * Get maximum number of characters across all parameter names */ private static Integer params_max_chars(params_map) { Integer max_chars = 0 for (group in params_map.keySet()) { def group_params = params_map.get(group) // This gets the parameters of that particular group for (param in group_params.keySet()) { if (param.size() > max_chars) { max_chars = param.size() } } } return max_chars } /* * Beautify parameters for --help */ private static String params_help(workflow, params, json_schema, command) { Map colors = log_colours(params.monochrome_logs) Integer num_hidden = 0 String output = '' output += 'Typical pipeline command:\n\n' output += " ${colors.cyan}${command}${colors.reset}\n\n" Map params_map = params_load(json_schema) Integer max_chars = params_max_chars(params_map) + 1 Integer desc_indent = max_chars + 14 Integer dec_linewidth = 160 - desc_indent for (group in params_map.keySet()) { Integer num_params = 0 String group_output = colors.underlined + colors.bold + group + colors.reset + '\n' def group_params = params_map.get(group) // This gets the parameters of that particular group for (param in group_params.keySet()) { if (group_params.get(param).hidden && !params.show_hidden_params) { num_hidden += 1 continue; } def type = '[' + group_params.get(param).type + ']' def description = group_params.get(param).description def defaultValue = group_params.get(param).default ? " [default: " + group_params.get(param).default.toString() + "]" : '' def description_default = description + colors.dim + defaultValue + colors.reset // Wrap long description texts // Loosely based on https://dzone.com/articles/groovy-plain-text-word-wrap if (description_default.length() > dec_linewidth){ List olines = [] String oline = "" // " " * indent description_default.split(" ").each() { wrd -> if ((oline.size() + wrd.size()) <= dec_linewidth) { oline += wrd + " " } else { olines += oline oline = wrd + " " } } olines += oline description_default = olines.join("\n" + " " * desc_indent) } group_output += " --" + param.padRight(max_chars) + colors.dim + type.padRight(10) + colors.reset + description_default + '\n' num_params += 1 } group_output += '\n' if (num_params > 0){ output += group_output } } output += dashed_line(params.monochrome_logs) if (num_hidden > 0){ output += colors.dim + "\n Hiding $num_hidden params, use --show_hidden_params to show.\n" + colors.reset output += dashed_line(params.monochrome_logs) } return output } /* * Groovy Map summarising parameters/workflow options used by the pipeline */ private static LinkedHashMap params_summary_map(workflow, params, json_schema) { // Get a selection of core Nextflow workflow options def Map workflow_summary = [:] if (workflow.revision) { workflow_summary['revision'] = workflow.revision } workflow_summary['runName'] = workflow.runName if (workflow.containerEngine) { workflow_summary['containerEngine'] = workflow.containerEngine } if (workflow.container) { workflow_summary['container'] = workflow.container } workflow_summary['launchDir'] = workflow.launchDir workflow_summary['workDir'] = workflow.workDir workflow_summary['projectDir'] = workflow.projectDir workflow_summary['userName'] = workflow.userName workflow_summary['profile'] = workflow.profile workflow_summary['configFiles'] = workflow.configFiles.join(', ') // Get pipeline parameters defined in JSON Schema def Map params_summary = [:] def blacklist = ['hostnames'] def params_map = params_load(json_schema) for (group in params_map.keySet()) { def sub_params = new LinkedHashMap() def group_params = params_map.get(group) // This gets the parameters of that particular group for (param in group_params.keySet()) { if (params.containsKey(param) && !blacklist.contains(param)) { def params_value = params.get(param) def schema_value = group_params.get(param).default def param_type = group_params.get(param).type if (schema_value != null) { if (param_type == 'string') { if (schema_value.contains('$projectDir') || schema_value.contains('${projectDir}')) { def sub_string = schema_value.replace('\$projectDir', '') sub_string = sub_string.replace('\${projectDir}', '') if (params_value.contains(sub_string)) { schema_value = params_value } } if (schema_value.contains('$params.outdir') || schema_value.contains('${params.outdir}')) { def sub_string = schema_value.replace('\$params.outdir', '') sub_string = sub_string.replace('\${params.outdir}', '') if ("${params.outdir}${sub_string}" == params_value) { schema_value = params_value } } } } // We have a default in the schema, and this isn't it if (schema_value != null && params_value != schema_value) { sub_params.put(param, params_value) } // No default in the schema, and this isn't empty else if (schema_value == null && params_value != "" && params_value != null && params_value != false) { sub_params.put(param, params_value) } } } params_summary.put(group, sub_params) } return [ 'Core Nextflow options' : workflow_summary ] << params_summary } /* * Beautify parameters for summary and return as string */ private static String params_summary_log(workflow, params, json_schema) { Map colors = log_colours(params.monochrome_logs) String output = '' def params_map = params_summary_map(workflow, params, json_schema) def max_chars = params_max_chars(params_map) for (group in params_map.keySet()) { def group_params = params_map.get(group) // This gets the parameters of that particular group if (group_params) { output += colors.bold + group + colors.reset + '\n' for (param in group_params.keySet()) { output += " " + colors.blue + param.padRight(max_chars) + ": " + colors.green + group_params.get(param) + colors.reset + '\n' } output += '\n' } } output += dashed_line(params.monochrome_logs) output += colors.dim + "\n Only displaying parameters that differ from defaults.\n" + colors.reset output += dashed_line(params.monochrome_logs) return output } } ================================================ FILE: main.nf ================================================ #!/usr/bin/env nextflow /* ------------------------------------------------------------------------------------------------------------ nf-core/eager ------------------------------------------------------------------------------------------------------------ EAGER Analysis Pipeline. Started 2018-06-05 #### Homepage / Documentation https://github.com/nf-core/eager #### Authors For a list of authors and contributors, see: https://github.com/nf-core/eager/tree/dev#authors-alphabetical ------------------------------------------------------------------------------------------------------------ */ nextflow.enable.dsl=1 log.info Headers.nf_core(workflow, params.monochrome_logs) //////////////////////////////////////////////////// /* -- PRINT HELP -- */ ////////////////////////////////////////////////////+ def json_schema = "$projectDir/nextflow_schema.json" if (params.help) { def command = "nextflow run nf-core/eager --input '*_R{1,2}.fastq.gz' -profile docker" log.info NfcoreSchema.params_help(workflow, params, json_schema, command) exit 0 } //////////////////////////////////////////////////// /* -- VALIDATE PARAMETERS -- */ ////////////////////////////////////////////////////+ if (params.validate_params) { NfcoreSchema.validateParameters(params, json_schema, log) } // Validate BAM input isn't set to paired_end if ( params.bam && !params.single_end ) { exit 1, "[nf-core/eager] error: bams can only be specified with --single_end. Please check input command." } // Do not allow input bams to be suffixed with '.unmapped.bam' if (params.bam && params.input.endsWith('.unmapped.bam')) { exit 1, "[nf-core/eager] error: Input BAM file names ending in '.unmapped.bam' are not allowed. Please rename your input BAM(s)." } // Validate that skip_collapse is only set to True for paired_end reads! if (!has_extension(params.input, "tsv") && params.skip_collapse && params.single_end){ exit 1, "[nf-core/eager] error: --skip_collapse can only be set for paired_end samples." } // Validate not trying to both skip collapse and skip trim if ( params.skip_collapse && params.skip_trim ) { exit 1, "[nf-core/eager error]: you have specified to skip both merging and trimming of paired end samples. Use --skip_adapterremoval instead." } // Bedtools validation if( params.run_bedtools_coverage && !params.anno_file ){ exit 1, "[nf-core/eager] error: you have turned on bedtools coverage, but not specified a BED or GFF file with --anno_file. Please validate your parameters." } // Bedtools validation if( !params.skip_preseq && !( params.preseq_mode == 'c_curve' || params.preseq_mode == 'lc_extrap' ) ) { exit 1, "[nf-core/eager] error: you are running preseq with a unsupported mode. See documentation for more information. You gave: ${params.preseq_mode}." } // BAM filtering validation if (!params.run_bam_filtering && params.bam_mapping_quality_threshold != 0) { exit 1, "[nf-core/eager] error: please turn on BAM filtering if you want to perform mapping quality filtering! Provide: --run_bam_filtering." } if (params.dedupper == 'dedup' && !params.mergedonly) { log.warn "[nf-core/eager] Warning: you are using DeDup but without specifying --mergedonly for AdapterRemoval, dedup will likely fail! See documentation for more information." } // Genotyping validation if (params.run_genotyping){ if (params.genotyping_tool == 'pileupcaller' && ( !params.pileupcaller_bedfile || !params.pileupcaller_snpfile ) ) { exit 1, "[nf-core/eager] error: please check your pileupCaller bed file and snp file parameters. You must supply a bed file and a snp file." } if (params.genotyping_tool == 'angsd' && ! ( params.angsd_glformat == 'text' || params.angsd_glformat == 'binary' || params.angsd_glformat == 'binary_three' || params.angsd_glformat == 'beagle' ) ) { exit 1, "[nf-core/eager] error: please check your ANGSD output format! Options: 'text', 'binary', 'binary_three', 'beagle'. Found parameter: --angsd_glformat '${params.angsd_glformat}'." } } // Consensus sequence generation validation if (params.run_vcf2genome) { if (!params.run_genotyping) { exit 1, "[nf-core/eager] error: consensus sequence generation requires genotyping via UnifiedGenotyper on be turned on with the parameter --run_genotyping and --genotyping_tool 'ug'. Please check your genotyping parameters." } if (params.genotyping_tool != 'ug') { exit 1, "[nf-core/eager] error: consensus sequence generation requires genotyping via UnifiedGenotyper on be turned on with the parameter --run_genotyping and --genotyping_tool 'ug'. Found parameter: --genotyping_tool '${params.genotyping_tool}'." } } // MultiVCFAnalyzer validation if (params.run_multivcfanalyzer) { if (!params.run_genotyping) { exit 1, "[nf-core/eager] error: MultiVCFAnalyzer requires genotyping to be turned on with the parameter --run_genotyping. Please check your genotyping parameters." } if (params.genotyping_tool != "ug") { exit 1, "[nf-core/eager] error: MultiVCFAnalyzer only accepts VCF files from GATK UnifiedGenotyper. Found parameter: --genotyping_tool '${params.genotyping_tool}'." } if (params.gatk_ploidy != 2) { exit 1, "[nf-core/eager] error: MultiVCFAnalyzer only accepts VCF files generated with a GATK ploidy set to 2. Found parameter: --gatk_ploidy ${params.gatk_ploidy}." } if (params.additional_vcf_files) { ch_extravcfs_for_multivcfanalyzer = Channel.fromPath(params.additional_vcf_files, checkIfExists: true) } } if (params.run_metagenomic_screening) { if ( !params.run_bam_filtering ) { exit 1, "[nf-core/eager] error: metagenomic classification can only run on unmapped reads. Please supply --run_bam_filtering --bam_unmapped_type 'fastq'." } if ( params.bam_unmapped_type != "fastq" ) { exit 1, "[nf-core/eager] error: metagenomic classification can only run on unmapped reads. Please supply --bam_unmapped_type 'fastq'. Supplied: --bam_unmapped_type '${params.bam_unmapped_type}'." } if (!params.database) { exit 1, "[nf-core/eager] error: metagenomic classification requires a path to a database directory. Please specify one with --database '/path/to/database/'." } if (params.metagenomic_tool == 'malt' && params.malt_min_support_mode == 'percent' && params.metagenomic_min_support_reads != 1) { exit 1, "[nf-core/eager] error: incompatible MALT min support configuration. Percent can only be used with --malt_min_support_percent. You modified: --metagenomic_min_support_reads." } if (params.metagenomic_tool == 'malt' && params.malt_min_support_mode == 'reads' && params.malt_min_support_percent != 0.01) { exit 1, "[nf-core/eager] error: incompatible MALT min support configuration. Reads can only be used with --malt_min_supportreads. You modified: --malt_min_support_percent." } if (!params.metagenomic_min_support_reads.toString().isInteger()){ exit 1, "[nf-core/eager] error: incompatible min_support_reads configuration. min_support_reads can only be used with integers. --metagenomic_min_support_reads Found parameter: ${params.metagenomic_min_support_reads}." } } // MaltExtract validation if (params.run_maltextract) { if (params.run_metagenomic_screening && params.metagenomic_tool != 'malt') { exit 1, "[nf-core/eager] error: MaltExtract can only accept MALT output. Please supply --metagenomic_tool 'malt'. Found parameter: --metagenomic_tool '${params.metagenomic_tool}'" } if (params.run_metagenomic_screening && params.metagenomic_tool != 'malt') { exit 1, "[nf-core/eager] error: MaltExtract can only accept MALT output. Please supply --metagenomic_tool 'malt'. Found parameter: --metagenomic_tool '${params.metagenomic_tool}'" } if (!params.maltextract_taxon_list) { exit 1, "[nf-core/eager] error: MaltExtract requires a taxon list specifying the target taxa of interest. Specify the file with --params.maltextract_taxon_list." } } ///////////////////////////////////////////////////////// /* -- VALIDATE INPUT FILES -- */ ///////////////////////////////////////////////////////// // Set up channels for annotation file if (!params.run_bedtools_coverage){ ch_anno_for_bedtools = Channel.empty() } else { ch_anno_for_bedtools = Channel.fromPath(params.anno_file, checkIfExists: true) .ifEmpty { exit 1, "[nf-core/eager] error: bedtools annotation file not found. Supplied parameter: --anno_file ${params.anno_file}."} } if (params.fasta) { file(params.fasta, checkIfExists: true) lastPath = params.fasta.lastIndexOf(File.separator) lastExt = params.fasta.lastIndexOf(".") fasta_base = params.fasta.substring(lastPath+1) index_base = params.fasta.substring(lastPath+1,lastExt) if (params.fasta.endsWith('.gz')) { fasta_base = params.fasta.substring(lastPath+1,lastExt) index_base = fasta_base.substring(0,fasta_base.lastIndexOf(".")) } } else { exit 1, "[nf-core/eager] error: please specify --fasta with the path to your reference" } // Validate reference inputs if("${params.fasta}".endsWith(".gz")){ process unzip_reference{ tag "${zipped_fasta}" input: path zipped_fasta from file(params.fasta) // path doesn't like it if a string of an object is not prefaced with a root dir (/), so use file() to resolve string before parsing to `path` output: path "$unzip" into ch_fasta into ch_fasta_for_bwaindex,ch_fasta_for_bt2index,ch_fasta_for_faidx,ch_fasta_for_seqdict,ch_fasta_for_circulargenerator,ch_fasta_for_circularmapper,ch_fasta_for_damageprofiler, ch_fasta_for_mapdamage ,ch_fasta_for_qualimap,ch_unmasked_fasta_for_masking,ch_unmasked_fasta_for_pmdtools,ch_fasta_for_genotyping_ug,ch_fasta_for_genotyping_hc,ch_fasta_for_genotyping_freebayes,ch_fasta_for_genotyping_pileupcaller,ch_fasta_for_vcf2genome,ch_fasta_for_multivcfanalyzer,ch_fasta_for_genotyping_angsd,ch_fasta_for_damagerescaling,ch_fasta_for_bcftools_stats script: unzip = zipped_fasta.toString() - '.gz' """ pigz -f -d -p ${task.cpus} $zipped_fasta """ } } else { fasta_for_indexing = Channel .fromPath("${params.fasta}", checkIfExists: true) .into{ ch_fasta_for_bwaindex; ch_fasta_for_bt2index; ch_fasta_for_faidx; ch_fasta_for_seqdict; ch_fasta_for_circulargenerator; ch_fasta_for_circularmapper; ch_fasta_for_damageprofiler; ch_fasta_for_mapdamage; ch_fasta_for_qualimap; ch_unmasked_fasta_for_masking; ch_unmasked_fasta_for_pmdtools; ch_fasta_for_genotyping_ug; ch_fasta__for_genotyping_hc; ch_fasta_for_genotyping_hc; ch_fasta_for_genotyping_freebayes; ch_fasta_for_genotyping_pileupcaller; ch_fasta_for_vcf2genome; ch_fasta_for_multivcfanalyzer; ch_fasta_for_genotyping_angsd; ch_fasta_for_damagerescaling; ch_fasta_for_bcftools_stats } } // Check that fasta index file path ends in '.fai' if (params.fasta_index && !params.fasta_index.endsWith(".fai")) { exit 1, "The specified fasta index file (${params.fasta_index}) is not valid. Fasta index files should end in '.fai'." } // Check if genome exists in the config file. params.genomes is from igenomes.conf, params.genome specified by user if (params.genomes && params.genome && !params.genomes.containsKey(params.genome)) { exit 1, "The provided genome '${params.genome}' is not available in the iGenomes file. Currently the available genomes are ${params.genomes.keySet().join(', ')}" } // Index files provided? Then check whether they are correct and complete if( params.bwa_index && (params.mapper == 'bwaaln' | params.mapper == 'bwamem' | params.mapper == 'circularmapper')){ Channel .fromPath(params.bwa_index, checkIfExists: true) .ifEmpty { exit 1, "[nf-core/eager] error: bwa indices not found in: ${index_base}." } .into {bwa_index; bwa_index_bwamem} bt2_index = Channel.empty() } if( params.bt2_index && params.mapper == 'bowtie2' ){ lastPath = params.bt2_index.lastIndexOf(File.separator) bt2_dir = params.bt2_index.substring(0,lastPath+1) bt2_base = params.bt2_index.substring(lastPath+1) Channel .fromPath(params.bt2_index, checkIfExists: true) .ifEmpty { exit 1, "[nf-core/eager] error: bowtie2 indices not found in: ${bt2_dir}." } .into {bt2_index; bt2_index_bwamem} bwa_index = Channel.empty() bwa_index_bwamem = Channel.empty() } // Adapter removal adapter-list setup if ( !params.clip_adapters_list ) { Channel .fromPath("$projectDir/assets/nf-core_eager_dummy2.txt", checkIfExists: true) .ifEmpty { exit 1, "[nf-core/eager] error: adapters list file not found. Please check input. Supplied: --clip_adapters_list '${params.clip_adapters_list}'." } .collect() .set {ch_adapterlist} } else { Channel .fromPath("${params.clip_adapters_list}", checkIfExists: true) .ifEmpty { exit 1, "[nf-core/eager] error: adapters list file not found. Please check input. Supplied: --clip_adapters_list '${params.clip_adapters_list}'." } .collect() .set {ch_adapterlist} } if ( params.snpcapture_bed ) { ch_snpcapture_bed = Channel.fromPath(params.snpcapture_bed, checkIfExists: true).collect() } else { ch_snpcapture_bed = Channel.fromPath("$projectDir/assets/nf-core_eager_dummy.txt").collect() } // Set up channel with pmdtools reference mask bedfile if (!params.pmdtools_reference_mask) { ch_bedfile_for_reference_masking = Channel.fromPath("$projectDir/assets/nf-core_eager_dummy.txt").collect() } else { ch_bedfile_for_reference_masking = Channel.fromPath(params.pmdtools_reference_mask, checkIfExists: true).collect() } // SexDetermination channel set up and bedfile validation if (!params.sexdeterrmine_bedfile) { ch_bed_for_sexdeterrmine = Channel.fromPath("$projectDir/assets/nf-core_eager_dummy.txt").collect() } else { ch_bed_for_sexdeterrmine = Channel.fromPath(params.sexdeterrmine_bedfile, checkIfExists: true).collect() } // pileupCaller channel generation and input checks for 'random sampling' genotyping if (!params.pileupcaller_bedfile) { ch_bed_for_pileupcaller = Channel.fromPath("$projectDir/assets/nf-core_eager_dummy.txt").collect() } else { ch_bed_for_pileupcaller = Channel.fromPath(params.pileupcaller_bedfile, checkIfExists: true).collect() } if (!params.pileupcaller_snpfile) { ch_snp_for_pileupcaller = Channel.fromPath("$projectDir/assets/nf-core_eager_dummy2.txt").collect() } else { ch_snp_for_pileupcaller = Channel.fromPath(params.pileupcaller_snpfile, checkIfExists: true).collect() } // Create input channel for MALT database directory, checking directory exists if ( !params.database ) { ch_db_for_malt = Channel.empty() } else { ch_db_for_malt = Channel.fromPath(params.database, checkIfExists: true) } // Create input channel for MaltExtract taxon list, to allow downloading of taxon list, checking file exists. if ( !params.maltextract_taxon_list ) { ch_taxonlist_for_maltextract = Channel.empty() } else { ch_taxonlist_for_maltextract = Channel.fromPath(params.maltextract_taxon_list, checkIfExists: true) } // Create input channel for MaltExtract NCBI files, checking files exists. if ( !params.maltextract_ncbifiles ) { ch_ncbifiles_for_maltextract = Channel.empty() } else { ch_ncbifiles_for_maltextract = Channel.fromPath(params.maltextract_ncbifiles, checkIfExists: true) } //////////////////////////////////////////////////// /* -- Collect configuration parameters -- */ //////////////////////////////////////////////////// // Check if genome exists in the config file if (params.genomes && params.genome && !params.genomes.containsKey(params.genome)) { exit 1, "The provided genome '${params.genome}' is not available in the iGenomes file. Currently the available genomes are ${params.genomes.keySet().join(', ')}" } // Check AWS batch settings if (workflow.profile.contains('awsbatch')) { // AWSBatch sanity checking if (!params.awsqueue || !params.awsregion) exit 1, 'Specify correct --awsqueue and --awsregion parameters on AWSBatch!' // Check outdir paths to be S3 buckets if running on AWSBatch // related: https://github.com/nextflow-io/nextflow/issues/813 if (!params.outdir.startsWith('s3:')) exit 1, 'Outdir not on S3 - specify S3 Bucket to run on AWSBatch!' // Prevent trace files to be stored on S3 since S3 does not support rolling files. if (params.tracedir.startsWith('s3:')) exit 1, 'Specify a local tracedir or run without trace! S3 cannot be used for tracefiles.' } ch_multiqc_config = file("$projectDir/assets/multiqc_config.yaml", checkIfExists: true) ch_multiqc_custom_config = params.multiqc_config ? Channel.fromPath(params.multiqc_config, checkIfExists: true) : Channel.empty() ch_eager_logo = file("$projectDir/docs/images/nf-core_eager_logo_outline_drop.png") ch_output_docs = file("$projectDir/docs/output.md", checkIfExists: true) ch_output_docs_images = file("$projectDir/docs/images/", checkIfExists: true) where_are_my_files = file("$projectDir/assets/where_are_my_files.txt") /////////////////////////////////////////////////// /* -- INPUT FILE LOADING AND VALIDATING -- */ /////////////////////////////////////////////////// // check if we have valid --reads or --input if (!params.input) { exit 1, "[nf-core/eager] error: --input was not supplied! Please check '--help' or documentation under 'running the pipeline' for details" } // Read in files properly from TSV file tsv_path = null if (params.input && (has_extension(params.input, "tsv"))) tsv_path = params.input ch_input_sample = Channel.empty() if (tsv_path) { tsv_file = file(tsv_path) if (tsv_file instanceof List) exit 1, "[nf-core/eager] error: can only accept one TSV file per run." if (!tsv_file.exists()) exit 1, "[nf-core/eager] error: input TSV file could not be found. Does the file exist and is it in the right place? You gave the path: ${params.input}" ch_input_sample = extract_data(tsv_path) } else if (params.input && !has_extension(params.input, "tsv")) { log.info "" log.info "No TSV file provided - creating TSV from supplied directory." log.info "Reading path(s): ${params.input}\n" inputSample = retrieve_input_paths(params.input, params.colour_chemistry, params.single_end, params.single_stranded, params.udg_type, params.bam) ch_input_sample = inputSample } else exit 1, "[nf-core/eager] error: --input file(s) not correctly not supplied or improperly defined, see '--help' flag and documentation under 'running the pipeline' for details." ch_input_sample .into { ch_input_sample_downstream; ch_input_sample_check } /////////////////////////////////////////////////// /* -- INPUT CHANNEL CREATION -- */ /////////////////////////////////////////////////// // Check we don't have any duplicate file names ch_input_sample_check .map { it -> def r1 = file(it[8]).getName() def r2 = file(it[9]).getName() def bam = file(it[10]).getName() // Throw error and exit if the input bam has a name ending in '.unmapped.bam' if (bam.endsWith('.unmapped.bam')) { exit 1, "[nf-core/eager] error: Input BAM file names ending in '.unmapped.bam' are not allowed. Please rename your input BAM(s)." } [r1, r2, bam] } .collect() .map{ file -> filenames = file filenames -= 'NA' if( filenames.size() != filenames.unique().size() ) exit 1, "[nf-core/eager] error: You have duplicate input FASTQ and/or BAM file names! All files must have unique names, different directories are not sufficent. Please check your input." } // Drop samples with R1/R2 to fastQ channel, BAM samples to other channel ch_branched_input = ch_input_sample_downstream.branch{ fastq: it[8] != 'NA' //These are all fastqs bam: it[10] != 'NA' //These are all BAMs } //Removing BAM/BAI in case of a FASTQ input ch_fastq_channel = ch_branched_input.fastq.map { samplename, libraryid, lane, colour, seqtype, organism, strandedness, udg, r1, r2, bam -> [samplename, libraryid, lane, colour, seqtype, organism, strandedness, udg, r1, r2] } //Removing R1/R2 in case of BAM input ch_bam_channel = ch_branched_input.bam.map { samplename, libraryid, lane, colour, seqtype, organism, strandedness, udg, r1, r2, bam -> [samplename, libraryid, lane, colour, seqtype, organism, strandedness, udg, bam] } // Prepare starting channels, here we go ch_input_for_convertbam = Channel.empty() ch_bam_channel .into { ch_input_for_convertbam; ch_input_for_indexbam; } // Also need to send raw files for lane merging, if we want to host removed fastq ch_fastq_channel .into { ch_input_for_skipconvertbam; ch_input_for_lanemerge_hostremovalfastq } //////////////////////////////////////////////////// /* -- PRINT PARAMETER SUMMARY -- */ //////////////////////////////////////////////////// log.info NfcoreSchema.params_summary_log(workflow, params, json_schema) // Header log info def summary = [:] if (workflow.revision) summary['Pipeline Release'] = workflow.revision summary['Run Name'] = workflow.runName summary['Input'] = params.input summary['Fasta Ref'] = params.fasta summary['Max Resources'] = "$params.max_memory memory, $params.max_cpus cpus, $params.max_time time per job" if (workflow.containerEngine) summary['Container'] = "$workflow.containerEngine - $workflow.container" summary['Output dir'] = params.outdir summary['Launch dir'] = workflow.launchDir summary['Working dir'] = workflow.workDir summary['Script dir'] = workflow.projectDir summary['User'] = workflow.userName if (workflow.profile.contains('awsbatch')) { summary['AWS Region'] = params.awsregion summary['AWS Queue'] = params.awsqueue summary['AWS CLI'] = params.awscli } summary['Config Profile'] = workflow.profile if (params.config_profile_description) summary['Config Profile Description'] = params.config_profile_description if (params.config_profile_contact) summary['Config Profile Contact'] = params.config_profile_contact if (params.config_profile_url) summary['Config Profile URL'] = params.config_profile_url summary['Config Files'] = workflow.configFiles.join(', ') if (params.email || params.email_on_fail) { summary['E-mail Address'] = params.email summary['E-mail on failure'] = params.email_on_fail summary['MultiQC maxsize'] = params.max_multiqc_email_size } Channel.from(summary.collect{ [it.key, it.value] }) .map { k,v -> "
$k
${v ?: 'N/A'}
" } .reduce { a, b -> return [a, b].join("\n ") } .map { x -> """ id: 'nf-core-eager-summary' description: " - this information is collected when the pipeline is started." section_name: 'nf-core/eager Workflow Summary' section_href: 'https://github.com/nf-core/eager' plot_type: 'html' data: |
$x
""".stripIndent() } .set { ch_workflow_summary } // Check the hostnames against configured profiles checkHostname() log.info "Schaffa, Schaffa, Genome Baua!" /////////////////////////////////////////////////// /* -- REFERENCE FASTA INDEXING -- */ /////////////////////////////////////////////////// // BWA Index if( !params.bwa_index && params.fasta && (params.mapper == 'bwaaln' || params.mapper == 'bwamem' || params.mapper == 'circularmapper')){ process makeBWAIndex { label 'sc_medium' tag "${fasta}" publishDir path: "${params.outdir}/reference_genome/bwa_index", mode: params.publish_dir_mode, saveAs: { filename -> if (params.save_reference) filename else if(!params.save_reference && filename == "where_are_my_files.txt") filename else null } input: path fasta from ch_fasta_for_bwaindex path where_are_my_files output: path "BWAIndex" into (bwa_index, bwa_index_bwamem) path "where_are_my_files.txt" script: """ bwa index $fasta mkdir BWAIndex && mv ${fasta}* BWAIndex """ } bt2_index = Channel.empty() } // bowtie2 Index if( !params.bt2_index && params.fasta && params.mapper == "bowtie2"){ process makeBT2Index { label 'mc_medium' tag "${fasta}" publishDir path: "${params.outdir}/reference_genome/bt2_index", mode: params.publish_dir_mode, saveAs: { filename -> if (params.save_reference) filename else if(!params.save_reference && filename == "where_are_my_files.txt") filename else null } input: path fasta from ch_fasta_for_bt2index path where_are_my_files output: path "BT2Index" into (bt2_index) path "where_are_my_files.txt" script: """ bowtie2-build --threads ${task.cpus} $fasta $fasta mkdir BT2Index && mv ${fasta}* BT2Index """ } bwa_index = Channel.empty() bwa_index_bwamem = Channel.empty() } // FASTA Index (FAI) if (params.fasta_index) { Channel .fromPath( params.fasta_index ) .set { ch_fai_for_skipfastaindexing } } else { Channel .empty() .set { ch_fai_for_skipfastaindexing } } process makeFastaIndex { label 'sc_small' tag "${fasta}" publishDir path: "${params.outdir}/reference_genome/fasta_index", mode: params.publish_dir_mode, saveAs: { filename -> if (params.save_reference) filename else if(!params.save_reference && filename == "where_are_my_files.txt") filename else null } when: !params.fasta_index && params.fasta input: path fasta from ch_fasta_for_faidx path where_are_my_files output: path "*.fai" into ch_fasta_faidx_index path "where_are_my_files.txt" script: """ samtools faidx $fasta """ } ch_fai_for_skipfastaindexing.mix(ch_fasta_faidx_index) .into { ch_fai_for_damageprofiler; ch_fai_for_ug; ch_fai_for_hc; ch_fai_for_freebayes; ch_fai_for_pileupcaller; ch_fai_for_angsd } // Stage dict index file if supplied, else load it into the channel if (params.seq_dict) { Channel .fromPath( params.seq_dict ) .set { ch_dict_for_skipdict } } else { Channel .empty() .set { ch_dict_for_skipdict } } process makeSeqDict { label 'sc_medium' tag "${fasta}" publishDir path: "${params.outdir}/reference_genome/seq_dict", mode: params.publish_dir_mode, saveAs: { filename -> if (params.save_reference) filename else if(!params.save_reference && filename == "where_are_my_files.txt") filename else null } when: !params.seq_dict && params.fasta input: path fasta from ch_fasta_for_seqdict path where_are_my_files output: path "*.dict" into ch_seq_dict path "where_are_my_files.txt" script: """ picard -Xmx${task.memory.toMega()}M CreateSequenceDictionary R=$fasta O="${fasta.baseName}.dict" """ } ch_dict_for_skipdict.mix(ch_seq_dict) .into { ch_dict_for_ug; ch_dict_for_hc; ch_dict_for_freebayes; ch_dict_for_pileupcaller; ch_dict_for_angsd } ////////////////////////////////////////////////// /* -- BAM INPUT PREPROCESSING -- */ ////////////////////////////////////////////////// // Convert to FASTQ if re-mapping is requested process convertBam { label 'mc_small' tag "$libraryid" when: params.run_convertinputbam input: tuple samplename, libraryid, lane, colour, seqtype, organism, strandedness, udg, path(bam) from ch_input_for_convertbam output: tuple samplename, libraryid, lane, colour, seqtype, organism, strandedness, udg, path("*fastq.gz"), val('NA') into ch_output_from_convertbam script: base = "${bam.baseName}" """ samtools fastq -t ${bam} | pigz -p ${task.cpus} > ${base}.converted.fastq.gz """ } // If not converted to FASTQ generate pipeline compatible BAM index file (i.e. with correct samtools version) process indexinputbam { label 'sc_small' tag "$libraryid" when: bam != 'NA' && !params.run_convertinputbam input: tuple samplename, libraryid, lane, colour, seqtype, organism, strandedness, udg, path(bam) from ch_input_for_indexbam output: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(bam), file("*.{bai,csi}") into ch_indexbam_for_filtering script: def size = params.large_ref ? '-c' : '' """ samtools index ${bam} ${size} """ } // convertbam bypass ch_input_for_skipconvertbam.mix(ch_output_from_convertbam) .into { ch_convertbam_for_fastp; ch_convertbam_for_fastqc } ////////////////////////////////////////////////// /* -- SEQUENCING QC AND FASTQ PREPROCESSING -- */ ////////////////////////////////////////////////// // Raw sequencing QC - allow user evaluate if sequencing any good? process fastqc { label 'mc_small' tag "${libraryid}_L${lane}" publishDir "${params.outdir}/fastqc/input_fastq", mode: params.publish_dir_mode, saveAs: { filename -> filename.indexOf(".zip") > 0 ? "zips/$filename" : "$filename" } input: tuple samplename, libraryid, lane, colour, seqtype, organism, strandedness, udg, file(r1), file(r2) from ch_convertbam_for_fastqc output: path "*_fastqc.{zip,html}" into ch_prefastqc_for_multiqc when: !params.skip_fastqc script: if ( seqtype == 'PE' ) { """ fastqc -t ${task.cpus} -q $r1 $r2 rename 's/_fastqc\\.zip\$/_raw_fastqc.zip/' *_fastqc.zip rename 's/_fastqc\\.html\$/_raw_fastqc.html/' *_fastqc.html """ } else { """ fastqc -t ${task.cpus} -q $r1 rename 's/_fastqc\\.zip\$/_raw_fastqc.zip/' *_fastqc.zip rename 's/_fastqc\\.html\$/_raw_fastqc.html/' *_fastqc.html """ } } // Poly-G clipping for 2-colour chemistry sequencers, to reduce erroenous mapping of sequencing artefacts if (params.complexity_filter_poly_g) { ch_input_for_fastp = ch_convertbam_for_fastp.branch{ twocol: it[3] == '2' // Nextseq/Novaseq data with possible sequencing artefact fourcol: it[3] == '4' // HiSeq/MiSeq data where polyGs would be true } } else { ch_input_for_fastp = ch_convertbam_for_fastp.branch{ twocol: it[3] == "dummy" // seq/Novaseq data with possible sequencing artefact fourcol: it[3] == '4' || it[3] == '2' // HiSeq/MiSeq data where polyGs would be true } } process fastp { label 'mc_small' tag "${libraryid}_L${lane}" publishDir "${params.outdir}/FastP", mode: params.publish_dir_mode when: params.complexity_filter_poly_g input: tuple samplename, libraryid, lane, colour, seqtype, organism, strandedness, udg, file(r1), file(r2) from ch_input_for_fastp.twocol output: tuple samplename, libraryid, lane, colour, seqtype, organism, strandedness, udg, path("*.pG.fq.gz") into ch_output_from_fastp path("*.json") into ch_fastp_for_multiqc script: if( seqtype == 'SE' ){ """ fastp --in1 ${r1} --out1 "${r1.baseName}.pG.fq.gz" -A -g --poly_g_min_len "${params.complexity_filter_poly_g_min}" -Q -L -w ${task.cpus} --json "${r1.baseName}"_L${lane}_fastp.json """ } else { """ fastp --in1 ${r1} --in2 ${r2} --out1 "${r1.baseName}.pG.fq.gz" --out2 "${r2.baseName}.pG.fq.gz" -A -g --poly_g_min_len "${params.complexity_filter_poly_g_min}" -Q -L -w ${task.cpus} --json "${libraryid}"_L${lane}_polyg_fastp.json """ } } // Colour column only useful for fastp, so dropping now to reduce complexity downstream ch_input_for_fastp.fourcol .map { def samplename = it[0] def libraryid = it[1] def lane = it[2] def seqtype = it[4] def organism = it[5] def strandedness = it[6] def udg = it[7] def r1 = it[8] def r2 = seqtype == "PE" ? it[9] : file("$projectDir/assets/nf-core_eager_dummy.txt") [ samplename, libraryid, lane, seqtype, organism, strandedness, udg, r1, r2 ] } .set { ch_skipfastp_for_merge } ch_output_from_fastp .map{ def samplename = it[0] def libraryid = it[1] def lane = it[2] def seqtype = it[4] def organism = it[5] def strandedness = it[6] def udg = it[7] def r1 = it[8] instanceof ArrayList ? it[8].sort()[0] : it[8] def r2 = seqtype == "PE" ? it[8].sort()[1] : file("$projectDir/assets/nf-core_eager_dummy.txt") [ samplename, libraryid, lane, seqtype, organism, strandedness, udg, r1, r2 ] } .set{ ch_fastp_for_merge } ch_skipfastp_for_merge.mix(ch_fastp_for_merge) .into { ch_fastp_for_adapterremoval; ch_fastp_for_skipadapterremoval } // Sequencing adapter clipping and optional paired-end merging in preparation for mapping process adapter_removal { label 'mc_small' tag "${libraryid}_L${lane}" publishDir "${params.outdir}/adapterremoval", mode: params.publish_dir_mode input: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, file(r1), file(r2) from ch_fastp_for_adapterremoval path adapterlist from ch_adapterlist.collect() output: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("output/*{combined.fq,.se.truncated,pair1.truncated}.gz") into ch_output_from_adapterremoval_r1 tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("output/*pair2.truncated.gz") optional true into ch_output_from_adapterremoval_r2 tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("output/*.settings") into ch_adapterremoval_logs when: !params.skip_adapterremoval script: def base = "${r1.baseName}_L${lane}" def adapters_to_remove = !params.clip_adapters_list ? "--adapter1 ${params.clip_forward_adaptor} --adapter2 ${params.clip_reverse_adaptor}" : "--adapter-list ${adapterlist}" //This checks whether we skip trimming and defines a variable respectively def preserve5p = params.preserve5p ? '--preserve5p' : '' // applies to any AR command - doesn't affect output file combination if ( seqtype == 'PE' && !params.skip_collapse && !params.skip_trim && !params.mergedonly && !params.preserve5p ) { """ mkdir -p output AdapterRemoval --file1 ${r1} --file2 ${r2} --basename ${base}.pe --gzip --threads ${task.cpus} --qualitymax ${params.qualitymax} --collapse ${preserve5p} --trimns --trimqualities ${adapters_to_remove} --minlength ${params.clip_readlength} --minquality ${params.clip_min_read_quality} --minadapteroverlap ${params.min_adap_overlap} cat *.collapsed.gz *.collapsed.truncated.gz *.singleton.truncated.gz *.pair1.truncated.gz *.pair2.truncated.gz > output/${base}.pe.combined.tmp.fq.gz mv *.settings output/ ## Add R_ and L_ for unmerged reads for DeDup compatibility AdapterRemovalFixPrefix -Xmx${task.memory.toGiga()}g output/${base}.pe.combined.tmp.fq.gz | pigz -p ${task.cpus - 1} > output/${base}.pe.combined.fq.gz """ //PE mode, collapse and trim, outputting all reads, preserving 5p } else if (seqtype == 'PE' && !params.skip_collapse && !params.skip_trim && !params.mergedonly && params.preserve5p) { """ mkdir -p output AdapterRemoval --file1 ${r1} --file2 ${r2} --basename ${base}.pe --gzip --threads ${task.cpus} --qualitymax ${params.qualitymax} --collapse ${preserve5p} --trimns --trimqualities ${adapters_to_remove} --minlength ${params.clip_readlength} --minquality ${params.clip_min_read_quality} --minadapteroverlap ${params.min_adap_overlap} cat *.collapsed.gz *.singleton.truncated.gz *.pair1.truncated.gz *.pair2.truncated.gz > output/${base}.pe.combined.tmp.fq.gz mv *.settings output/ ## Add R_ and L_ for unmerged reads for DeDup compatibility AdapterRemovalFixPrefix -Xmx${task.memory.toGiga()}g output/${base}.pe.combined.tmp.fq.gz | pigz -p ${task.cpus - 1} > output/${base}.pe.combined.fq.gz """ // PE mode, collapse and trim but only output collapsed reads } else if ( seqtype == 'PE' && !params.skip_collapse && !params.skip_trim && params.mergedonly && !params.preserve5p ) { """ mkdir -p output AdapterRemoval --file1 ${r1} --file2 ${r2} --basename ${base}.pe --gzip --threads ${task.cpus} --qualitymax ${params.qualitymax} --collapse ${preserve5p} --trimns --trimqualities ${adapters_to_remove} --minlength ${params.clip_readlength} --minquality ${params.clip_min_read_quality} --minadapteroverlap ${params.min_adap_overlap} cat *.collapsed.gz *.collapsed.truncated.gz > output/${base}.pe.combined.tmp.fq.gz ## Add R_ and L_ for unmerged reads for DeDup compatibility AdapterRemovalFixPrefix -Xmx${task.memory.toGiga()}g output/${base}.pe.combined.tmp.fq.gz | pigz -p ${task.cpus - 1} > output/${base}.pe.combined.fq.gz mv *.settings output/ """ // PE mode, collapse and trim but only output collapsed reads, preserving 5p } else if ( seqtype == 'PE' && !params.skip_collapse && !params.skip_trim && params.mergedonly && params.preserve5p ) { """ mkdir -p output AdapterRemoval --file1 ${r1} --file2 ${r2} --basename ${base}.pe --gzip --threads ${task.cpus} --qualitymax ${params.qualitymax} --collapse ${preserve5p} --trimns --trimqualities ${adapters_to_remove} --minlength ${params.clip_readlength} --minquality ${params.clip_min_read_quality} --minadapteroverlap ${params.min_adap_overlap} cat *.collapsed.gz > output/${base}.pe.combined.tmp.fq.gz ## Add R_ and L_ for unmerged reads for DeDup compatibility AdapterRemovalFixPrefix -Xmx${task.memory.toGiga()}g output/${base}.pe.combined.tmp.fq.gz | pigz -p ${task.cpus - 1} > output/${base}.pe.combined.fq.gz mv *.settings output/ """ // PE mode, collapsing but skip trim, (output all reads). Note: seems to still generate `truncated` files for some reason, so merging for safety. // Will still do default AR length filtering I guess } else if ( seqtype == 'PE' && !params.skip_collapse && params.skip_trim && !params.mergedonly ) { """ mkdir -p output AdapterRemoval --file1 ${r1} --file2 ${r2} --basename ${base}.pe --gzip --threads ${task.cpus} --qualitymax ${params.qualitymax} --collapse ${preserve5p} --adapter1 "" --adapter2 "" cat *.collapsed.gz *.pair1.truncated.gz *.pair2.truncated.gz > output/${base}.pe.combined.tmp.fq.gz ## Add R_ and L_ for unmerged reads for DeDup compatibility AdapterRemovalFixPrefix -Xmx${task.memory.toGiga()}g output/${base}.pe.combined.tmp.fq.gz | pigz -p ${task.cpus - 1} > output/${base}.pe.combined.fq.gz mv *.settings output/ """ // PE mode, collapsing but skip trim, and only output collapsed reads. Note: seems to still generate `truncated` files for some reason, so merging for safety. // Will still do default AR length filtering I guess } else if ( seqtype == 'PE' && !params.skip_collapse && params.skip_trim && params.mergedonly ) { """ mkdir -p output AdapterRemoval --file1 ${r1} --file2 ${r2} --basename ${base}.pe --gzip --threads ${task.cpus} --qualitymax ${params.qualitymax} --collapse ${preserve5p} --adapter1 "" --adapter2 "" cat *.collapsed.gz > output/${base}.pe.combined.tmp.fq.gz ## Add R_ and L_ for unmerged reads for DeDup compatibility AdapterRemovalFixPrefix -Xmx${task.memory.toGiga()}g output/${base}.pe.combined.tmp.fq.gz | pigz -p ${task.cpus - 1} > output/${base}.pe.combined.fq.gz mv *.settings output/ """ // PE mode, skip collapsing but trim (output all reads, as merging not possible) - activates paired-end mapping! } else if ( seqtype == 'PE' && params.skip_collapse && !params.skip_trim ) { """ mkdir -p output AdapterRemoval --file1 ${r1} --file2 ${r2} --basename ${base}.pe --gzip --threads ${task.cpus} --qualitymax ${params.qualitymax} ${preserve5p} --trimns --trimqualities ${adapters_to_remove} --minlength ${params.clip_readlength} --minquality ${params.clip_min_read_quality} --minadapteroverlap ${params.min_adap_overlap} mv ${base}.pe.pair*.truncated.gz *.settings output/ """ } else if ( seqtype != 'PE' && !params.skip_trim ) { //SE, collapse not possible, trim reads only """ mkdir -p output AdapterRemoval --file1 ${r1} --basename ${base}.se --gzip --threads ${task.cpus} --qualitymax ${params.qualitymax} ${preserve5p} --trimns --trimqualities ${adapters_to_remove} --minlength ${params.clip_readlength} --minquality ${params.clip_min_read_quality} --minadapteroverlap ${params.min_adap_overlap} mv *.settings *.se.truncated.gz output/ """ } else if ( seqtype != 'PE' && params.skip_trim ) { //SE, collapse not possible, trim reads only """ mkdir -p output AdapterRemoval --file1 ${r1} --basename ${base}.se --gzip --threads ${task.cpus} --qualitymax ${params.qualitymax} ${preserve5p} --adapter1 "" --adapter2 "" mv *.settings *.se.truncated.gz output/ """ } } // When not collapsing paired-end data, re-merge the R1 and R2 files into single map. Otherwise if SE or collapsed PE, R2 now becomes NA // Sort to make sure we get consistent R1 and R2 ordered when using `-resume`, even if not needed for FastQC if ( params.skip_collapse ){ ch_output_from_adapterremoval_r1 .mix(ch_output_from_adapterremoval_r2) .groupTuple(by: [0,1,2,3,4,5,6]) .map{ it -> def samplename = it[0] def libraryid = it[1] def lane = it[2] def seqtype = it[3] def organism = it[4] def strandedness = it[5] def udg = it[6] def r1 = file(it[7].sort()[0]) def r2 = seqtype == "PE" ? file(it[7].sort()[1]) : file("$projectDir/assets/nf-core_eager_dummy.txt") [ samplename, libraryid, lane, seqtype, organism, strandedness, udg, r1, r2 ] } .into { ch_output_from_adapterremoval; ch_adapterremoval_for_postfastqc } } else { ch_output_from_adapterremoval_r1 .map{ it -> def samplename = it[0] def libraryid = it[1] def lane = it[2] def seqtype = it[3] def organism = it[4] def strandedness = it[5] def udg = it[6] def r1 = file(it[7]) def r2 = file("$projectDir/assets/nf-core_eager_dummy.txt") [ samplename, libraryid, lane, seqtype, organism, strandedness, udg, r1, r2 ] } .into { ch_output_from_adapterremoval; ch_adapterremoval_for_postfastqc } } // AdapterRemoval bypass when not running it if (!params.skip_adapterremoval) { ch_output_from_adapterremoval.mix(ch_fastp_for_skipadapterremoval) .filter { it =~/.*combined.fq.gz|.*truncated.gz/ } .into { ch_adapterremoval_for_post_ar_trimming; ch_adapterremoval_for_skip_post_ar_trimming; } } else { ch_fastp_for_skipadapterremoval .into { ch_adapterremoval_for_post_ar_trimming; ch_adapterremoval_for_skip_post_ar_trimming; } } // Post AR fastq trimming process post_ar_fastq_trimming { label 'mc_small' tag "${libraryid}" publishDir "${params.outdir}/post_ar_fastq_trimmed", mode: params.publish_dir_mode when: params.run_post_ar_trimming input: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(r1), path(r2) from ch_adapterremoval_for_post_ar_trimming output: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*_R1_postartrimmed.fq.gz") into ch_post_ar_trimming_for_lanemerge_r1 tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*_R2_postartrimmed.fq.gz") optional true into ch_post_ar_trimming_for_lanemerge_r2 script: if ( seqtype == 'SE' | (seqtype == 'PE' && !params.skip_collapse) ) { """ fastp --in1 ${r1} --trim_front1 ${params.post_ar_trim_front} --trim_tail1 ${params.post_ar_trim_tail} -A -G -Q -L -w ${task.cpus} --out1 "${libraryid}"_L"${lane}"_R1_postartrimmed.fq.gz """ } else if ( seqtype == 'PE' && params.skip_collapse ) { """ fastp --in1 ${r1} --in2 ${r2} --trim_front1 ${params.post_ar_trim_front} --trim_tail1 ${params.post_ar_trim_tail} --trim_front2 ${params.post_ar_trim_front2} --trim_tail2 ${params.post_ar_trim_tail2} -A -G -Q -L -w ${task.cpus} --out1 "${libraryid}"_L"${lane}"_R1_postartrimmed.fq.gz --out2 "${libraryid}"_L"${lane}"_R2_postartrimmed.fq.gz """ } } // When not collapsing paired-end data, re-merge the R1 and R2 files into single map. Otherwise if SE or collapsed PE, R2 now becomes NA // Sort to make sure we get consistent R1 and R2 ordered when using `-resume`, even if not needed for FastQC if ( params.skip_collapse ){ ch_post_ar_trimming_for_lanemerge_r1 .mix(ch_post_ar_trimming_for_lanemerge_r2) .groupTuple(by: [0,1,2,3,4,5,6]) .map{ it -> def samplename = it[0] def libraryid = it[1] def lane = it[2] def seqtype = it[3] def organism = it[4] def strandedness = it[5] def udg = it[6] def r1 = file(it[7].sort()[0]) def r2 = seqtype == "PE" ? file(it[7].sort()[1]) : file("$projectDir/assets/nf-core_eager_dummy.txt") [ samplename, libraryid, lane, seqtype, organism, strandedness, udg, r1, r2 ] } .set { ch_post_ar_trimming_for_lanemerge; } } else { ch_post_ar_trimming_for_lanemerge_r1 .map{ it -> def samplename = it[0] def libraryid = it[1] def lane = it[2] def seqtype = it[3] def organism = it[4] def strandedness = it[5] def udg = it[6] def r1 = file(it[7]) def r2 = file("$projectDir/assets/nf-core_eager_dummy.txt") [ samplename, libraryid, lane, seqtype, organism, strandedness, udg, r1, r2 ] } .set { ch_post_ar_trimming_for_lanemerge; } } // Inline barcode removal bypass when not running it if (params.run_post_ar_trimming) { ch_post_ar_trimming_for_lanemerge .into { ch_inlinebarcoderemoval_for_fastqc_after_clipping; ch_inlinebarcoderemoval_for_lanemerge; } } else { ch_adapterremoval_for_skip_post_ar_trimming .into { ch_inlinebarcoderemoval_for_fastqc_after_clipping; ch_inlinebarcoderemoval_for_lanemerge; } } // Lane merging for libraries sequenced over multiple lanes (e.g. NextSeq) ch_branched_for_lanemerge = ch_inlinebarcoderemoval_for_lanemerge .groupTuple(by: [0,1,3,4,5,6]) .map { it -> def samplename = it[0] def libraryid = it[1] def lane = it[2] def seqtype = it[3] def organism = it[4] def strandedness = it[5] def udg = it[6] def r1 = it[7] def r2 = it[8] [ samplename, libraryid, lane, seqtype, organism, strandedness, udg, r1, r2 ] } .branch { skip_merge: it[7].size() == 1 // Can skip merging if only single lanes merge_me: it[7].size() > 1 } ch_branched_for_lanemerge_skipme = ch_branched_for_lanemerge.skip_merge .map{ it -> def samplename = it[0] def libraryid = it[1] def lane = it[2] def seqtype = it[3] def organism = it[4] def strandedness = it[5] def udg = it[6] def r1 = it[7][0] def r2 = it[8][0] [ samplename, libraryid, lane, seqtype, organism, strandedness, udg, r1, r2 ] } ch_branched_for_lanemerge_ready = ch_branched_for_lanemerge.merge_me .map{ it -> def samplename = it[0] def libraryid = it[1] def lane = it[2] def seqtype = it[3] def organism = it[4] def strandedness = it[5] def udg = it[6] def r1 = it[7] // find and remove duplicate dummies to prevent file collision error def r2 = it[8]*.toString() r2.removeAll{ it == "$projectDir/assets/nf-core_eager_dummy.txt" } [ samplename, libraryid, lane, seqtype, organism, strandedness, udg, r1, r2 ] } process lanemerge { label 'sc_tiny' tag "${libraryid}" publishDir "${params.outdir}/lanemerging", mode: params.publish_dir_mode input: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(r1), path(r2) from ch_branched_for_lanemerge_ready output: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*_R1_lanemerged.fq.gz") into ch_lanemerge_for_mapping_r1 tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*_R2_lanemerged.fq.gz") optional true into ch_lanemerge_for_mapping_r2 script: if ( seqtype == 'PE' && ( params.skip_collapse || params.skip_adapterremoval ) ){ def lane = 0 """ cat ${r1} > "${libraryid}"_R1_lanemerged.fq.gz cat ${r2} > "${libraryid}"_R2_lanemerged.fq.gz """ } else { """ cat ${r1} > "${libraryid}"_R1_lanemerged.fq.gz """ } } // Ensuring always valid R2 file even if doesn't exist for AWS if ( ( params.skip_collapse || params.skip_adapterremoval ) ) { ch_lanemerge_for_mapping_r1 .mix(ch_lanemerge_for_mapping_r2) .groupTuple(by: [0,1,2,3,4,5,6]) .map{ it -> def samplename = it[0] def libraryid = it[1] def lane = it[2] def seqtype = it[3] def organism = it[4] def strandedness = it[5] def udg = it[6] def r1 = file(it[7].sort()[0]) def r2 = seqtype == "PE" ? file(it[7].sort()[1]) : file("$projectDir/assets/nf-core_eager_dummy.txt") [ samplename, libraryid, lane, seqtype, organism, strandedness, udg, r1, r2 ] } .mix(ch_branched_for_lanemerge_skipme) .into { ch_lanemerge_for_skipmap; ch_lanemerge_for_bwa; ch_lanemerge_for_cm; ch_lanemerge_for_bwamem; ch_lanemerge_for_bt2 } } else { ch_lanemerge_for_mapping_r1 .map{ it -> def samplename = it[0] def libraryid = it[1] def lane = it[2] def seqtype = it[3] def organism = it[4] def strandedness = it[5] def udg = it[6] def r1 = file(it[7]) def r2 = file("$projectDir/assets/nf-core_eager_dummy.txt") [ samplename, libraryid, lane, seqtype, organism, strandedness, udg, r1, r2 ] } .mix(ch_branched_for_lanemerge_skipme) .into { ch_lanemerge_for_skipmap; ch_lanemerge_for_bwa; ch_lanemerge_for_cm; ch_lanemerge_for_bwamem; ch_lanemerge_for_bt2 } } // ENA upload doesn't do separate lanes, so merge raw FASTQs for mapped-reads removal // Per-library lane grouping done within process process lanemerge_hostremoval_fastq { label 'sc_tiny' tag "${libraryid}" when: params.hostremoval_input_fastq input: tuple samplename, libraryid, lane, colour, seqtype, organism, strandedness, udg, file(r1), file(r2) from ch_input_for_lanemerge_hostremovalfastq.groupTuple(by: [0,1,3,4,5,6,7]) output: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, file("*.fq.gz") into ch_fastqlanemerge_for_hostremovalfastq script: if ( seqtype == 'PE' ){ lane = 0 """ cat ${r1} > "${libraryid}"_R1_lanemerged.fq.gz cat ${r2} > "${libraryid}"_R2_lanemerged.fq.gz """ } else { """ cat ${r1} > "${libraryid}"_R1_lanemerged.fq.gz """ } } // Post-preprocessing QC to help user check pre-processing removed all sequencing artefacts. If doing post-AR trimming includes this step in output. process fastqc_after_clipping { label 'mc_small' tag "${libraryid}_L${lane}" publishDir "${params.outdir}/fastqc/after_clipping", mode: params.publish_dir_mode, saveAs: { filename -> filename.indexOf(".zip") > 0 ? "zips/$filename" : "$filename" } when: !params.skip_adapterremoval && !params.skip_fastqc input: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, file(r1), file(r2) from ch_inlinebarcoderemoval_for_fastqc_after_clipping output: path("*_fastqc.{zip,html}") into ch_fastqc_after_clipping script: if ( params.skip_collapse && seqtype == 'PE' ) { """ fastqc -t ${task.cpus} -q ${r1} ${r2} """ } else { """ fastqc -t ${task.cpus} -q ${r1} """ } } ////////////////////////////////////////////////// /* -- READ MAPPING AND POSTPROCESSING -- */ ////////////////////////////////////////////////// // bwa aln as standard aDNA mapper process bwa { label 'mc_medium' tag "${libraryid}" publishDir "${params.outdir}/mapping/bwa", mode: params.publish_dir_mode input: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(r1), path(r2) from ch_lanemerge_for_bwa path index from bwa_index.collect() output: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*.mapped.bam"), path("*.{bai,csi}") into ch_output_from_bwa when: params.mapper == 'bwaaln' script: def size = params.large_ref ? '-c' : '' def fasta = "${index}/${fasta_base}" //PE data without merging, PE data without any AR applied if ( seqtype == 'PE' && ( params.skip_collapse || params.skip_adapterremoval ) ){ """ bwa aln -t ${task.cpus} $fasta ${r1} -n ${params.bwaalnn} -l ${params.bwaalnl} -k ${params.bwaalnk} -o ${params.bwaalno} -f ${libraryid}.r1.sai bwa aln -t ${task.cpus} $fasta ${r2} -n ${params.bwaalnn} -l ${params.bwaalnl} -k ${params.bwaalnk} -o ${params.bwaalno} -f ${libraryid}.r2.sai bwa sampe -r "@RG\\tID:ILLUMINA-${samplename}_${libraryid}\\tSM:${samplename}\\tLB:${libraryid}\\tPL:illumina\\tPU:ILLUMINA-${libraryid}-${seqtype}" $fasta ${libraryid}.r1.sai ${libraryid}.r2.sai ${r1} ${r2} | samtools sort -@ ${task.cpus - 1} -O bam - > ${libraryid}_"${seqtype}".mapped.bam samtools index "${libraryid}"_"${seqtype}".mapped.bam ${size} """ } else { //PE collapsed, or SE data """ bwa aln -t ${task.cpus} ${fasta} ${r1} -n ${params.bwaalnn} -l ${params.bwaalnl} -k ${params.bwaalnk} -o ${params.bwaalno} -f ${libraryid}.sai bwa samse -r "@RG\\tID:ILLUMINA-${samplename}_${libraryid}\\tSM:${samplename}\\tLB:${libraryid}\\tPL:illumina\\tPU:ILLUMINA-${libraryid}-${seqtype}" $fasta ${libraryid}.sai $r1 | samtools sort -@ ${task.cpus - 1} -O bam - > "${libraryid}"_"${seqtype}".mapped.bam samtools index "${libraryid}"_"${seqtype}".mapped.bam ${size} """ } } // bwa mem for more complex or for modern data mapping process bwamem { label 'mc_medium' tag "$libraryid" publishDir "${params.outdir}/mapping/bwamem", mode: params.publish_dir_mode input: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, file(r1), file(r2) from ch_lanemerge_for_bwamem path index from bwa_index_bwamem.collect() output: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*.mapped.bam"), path("*.{bai,csi}") into ch_output_from_bwamem when: params.mapper == 'bwamem' script: def split_cpus = Math.floor(task.cpus/2) def fasta = "${index}/${fasta_base}" def size = params.large_ref ? '-c' : '' if (!params.single_end && params.skip_collapse){ """ bwa mem -t ${split_cpus} $fasta $r1 $r2 -R "@RG\\tID:ILLUMINA-${samplename}_${libraryid}\\tSM:${samplename}\\tLB:${libraryid}\\tPL:illumina\\tPU:ILLUMINA-${libraryid}-${seqtype}" | samtools sort -@ ${split_cpus} -O bam - > "${libraryid}"_"${seqtype}".mapped.bam samtools index ${size} -@ ${task.cpus} "${libraryid}"_"${seqtype}".mapped.bam """ } else { """ bwa mem -t ${split_cpus} $fasta $r1 -R "@RG\\tID:ILLUMINA-${samplename}_${libraryid}\\tSM:${samplename}\\tLB:${libraryid}\\tPL:illumina\\tPU:ILLUMINA-${libraryid}-${seqtype}" | samtools sort -@ ${split_cpus} -O bam - > "${libraryid}"_"${seqtype}".mapped.bam samtools index -@ ${task.cpus} "${libraryid}"_"${seqtype}".mapped.bam ${size} """ } } // CircularMapper reference preparation and mapping for circular genomes e.g. mtDNA process circulargenerator{ label 'sc_medium' tag "$prefix" publishDir "${params.outdir}/reference_genome/circularmapper_index", mode: params.publish_dir_mode, saveAs: { filename -> if (params.save_reference) filename else if(!params.save_reference && filename == "where_are_my_files.txt") filename else null } input: file fasta from ch_fasta_for_circulargenerator output: file "${prefix}.{amb,ann,bwt,sa,pac}" into ch_circularmapper_indices file "*_elongated" into ch_circularmapper_elongatedfasta when: params.mapper == 'circularmapper' script: prefix = "${fasta.baseName}_${params.circularextension}.${fasta.extension}" """ circulargenerator -Xmx${task.memory.toGiga()}g -e ${params.circularextension} -i $fasta -s ${params.circulartarget} bwa index $prefix """ } process circularmapper{ label 'mc_medium' tag "$libraryid" publishDir "${params.outdir}/mapping/circularmapper", mode: params.publish_dir_mode input: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, file(r1), file(r2) from ch_lanemerge_for_cm file index from ch_circularmapper_indices.collect() file fasta from ch_fasta_for_circularmapper.collect() file elongated from ch_circularmapper_elongatedfasta.collect() output: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, file("*.mapped.bam"), file("*.{bai,csi}") into ch_output_from_cm when: params.mapper == 'circularmapper' script: def filter = params.circularfilter ? '-f true -x true' : '' def elongated_root = "${fasta.baseName}_${params.circularextension}.${fasta.extension}" def size = params.large_ref ? '-c' : '' if (!params.single_end && params.skip_collapse ){ """ bwa aln -t ${task.cpus} $elongated_root $r1 -n ${params.bwaalnn} -l ${params.bwaalnl} -k ${params.bwaalnk} -f ${libraryid}.r1.sai bwa aln -t ${task.cpus} $elongated_root $r2 -n ${params.bwaalnn} -l ${params.bwaalnl} -k ${params.bwaalnk} -f ${libraryid}.r2.sai bwa sampe -r "@RG\\tID:ILLUMINA-${samplename}_${libraryid}\\tSM:${samplename}\\tLB:${libraryid}\\tPL:illumina\\tPU:ILLUMINA-${libraryid}-${seqtype}" $elongated_root ${libraryid}.r1.sai ${libraryid}.r2.sai $r1 $r2 > tmp.out realignsamfile -Xmx${task.memory.toGiga()}g -e ${params.circularextension} -i tmp.out -r $fasta $filter samtools sort -@ ${task.cpus} -O bam tmp_realigned.bam > ${libraryid}_"${seqtype}".mapped.bam samtools index "${libraryid}"_"${seqtype}".mapped.bam ${size} """ } else { """ bwa aln -t ${task.cpus} $elongated_root $r1 -n ${params.bwaalnn} -l ${params.bwaalnl} -k ${params.bwaalnk} -f ${libraryid}.sai bwa samse -r "@RG\\tID:ILLUMINA-${samplename}_${libraryid}\\tSM:${samplename}\\tLB:${libraryid}\\tPL:illumina\\tPU:ILLUMINA-${libraryid}-${seqtype}" $elongated_root ${libraryid}.sai $r1 > tmp.out realignsamfile -Xmx${task.memory.toGiga()}g -e ${params.circularextension} -i tmp.out -r $fasta $filter samtools sort -@ ${task.cpus} -O bam tmp_realigned.bam > "${libraryid}"_"${seqtype}".mapped.bam samtools index "${libraryid}"_"${seqtype}".mapped.bam ${size} """ } } process bowtie2 { label 'mc_medium' tag "${libraryid}" publishDir "${params.outdir}/mapping/bt2", mode: params.publish_dir_mode input: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, file(r1), file(r2) from ch_lanemerge_for_bt2 path index from bt2_index.collect() output: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*.mapped.bam"), path("*.{bai,csi}") into ch_output_from_bt2 tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*_bt2.log") into ch_bt2_for_multiqc when: params.mapper == 'bowtie2' script: def split_cpus = Math.floor(task.cpus/2) def size = params.large_ref ? '-c' : '' def fasta = "${index}/${fasta_base}" def trim5 = params.bt2_trim5 != 0 ? "--trim5 ${params.bt2_trim5}" : "" def trim3 = params.bt2_trim3 != 0 ? "--trim3 ${params.bt2_trim3}" : "" def bt2n = params.bt2n != 0 ? "-N ${params.bt2n}" : "" def bt2l = params.bt2l != 0 ? "-L ${params.bt2l}" : "" if ( "${params.bt2_alignmode}" == "end-to-end" ) { switch ( "${params.bt2_sensitivity}" ) { case "no-preset": sensitivity = ""; break case "very-fast": sensitivity = "--very-fast"; break case "fast": sensitivity = "--fast"; break case "sensitive": sensitivity = "--sensitive"; break case "very-sensitive": sensitivity = "--very-sensitive"; break default: sensitivity = ""; break } } else if ("${params.bt2_alignmode}" == "local") { switch ( "${params.bt2_sensitivity}" ) { case "no-preset": sensitivity = ""; break case "very-fast": sensitivity = "--very-fast-local"; break case "fast": sensitivity = "--fast-local"; break case "sensitive": sensitivity = "--sensitive-local"; break case "very-sensitive": sensitivity = "--very-sensitive-local"; break default: sensitivity = ""; break } } //PE data without merging, PE data without any AR applied if ( seqtype == 'PE' && ( params.skip_collapse || params.skip_adapterremoval ) ){ """ bowtie2 -x ${fasta} -1 ${r1} -2 ${r2} -p ${split_cpus} ${sensitivity} ${bt2n} ${bt2l} ${trim5} ${trim3} --maxins ${params.bt2_maxins} --rg-id ILLUMINA-${samplename}_${libraryid} --rg SM:${samplename} --rg LB:${libraryid} --rg PL:illumina --rg PU:ILLUMINA-${libraryid}-${seqtype} 2> "${libraryid}"_bt2.log | samtools sort -@ ${split_cpus} -O bam > "${libraryid}"_"${seqtype}".mapped.bam samtools index "${libraryid}"_"${seqtype}".mapped.bam ${size} """ } else { //PE collapsed, or SE data """ bowtie2 -x ${fasta} -U ${r1} -p ${split_cpus} ${sensitivity} ${bt2n} ${bt2l} ${trim5} ${trim3} --rg-id ILLUMINA-${samplename}_${libraryid} --rg SM:${samplename} --rg LB:${libraryid} --rg PL:illumina --rg PU:ILLUMINA-${libraryid}-${seqtype} 2> "${libraryid}"_bt2.log | samtools sort -@ ${split_cpus} -O bam > "${libraryid}"_"${seqtype}".mapped.bam samtools index "${libraryid}"_"${seqtype}".mapped.bam ${size} """ } } // Gather all mapped BAMs from all possible mappers into common channels to send downstream ch_output_from_bwa.mix(ch_output_from_bwamem, ch_output_from_cm, ch_indexbam_for_filtering, ch_output_from_bt2) .into { ch_mapping_for_hostremovalfastq; ch_mapping_for_seqtype_merging } // Synchronise the mapped input FASTQ and input non-remapped BAM channels ch_fastqlanemerge_for_hostremovalfastq .map { def samplename = it[0] def libraryid = it[1] def lane = it[2] def seqtype = it[3] def organism = it[4] def strandedness = it[5] def udg = it[6] def r1 = seqtype == "PE" ? file(it[7].sort()[0]) : file(it[7]) def r2 = seqtype == "PE" ? file(it[7].sort()[1]) : file("$projectDir/assets/nf-core_eager_dummy.txt") [ samplename, libraryid, lane, seqtype, organism, strandedness, udg, r1, r2 ] } .mix(ch_mapping_for_hostremovalfastq) .groupTuple(by: [0,1,3,4,5,6]) .map { def samplename = it[0] def libraryid = it[1] def lane = it[2] def seqtype = it[3] def organism = it[4] def strandedness = it[5] def udg = it[6] def r1 = it[7][0] def r2 = it[8][0] def bam = it[7][1] def bai = it[8][1] [ samplename, libraryid, seqtype, organism, strandedness, udg, r1, r2, bam, bai ] } .filter{ it[8] != null } .set { ch_synced_for_hostremovalfastq } // Remove mapped reads from original (lane merged) input FASTQ e.g. for sensitive host data when running metagenomic data process hostremoval_input_fastq { label 'mc_medium' tag "${libraryid}" publishDir "${params.outdir}/hostremoved_fastq", mode: params.publish_dir_mode when: params.hostremoval_input_fastq input: tuple samplename, libraryid, seqtype, organism, strandedness, udg, file(r1), file(r2), file(bam), file(bai) from ch_synced_for_hostremovalfastq output: tuple samplename, libraryid, seqtype, organism, strandedness, udg, file("*.fq.gz") into ch_output_from_hostremovalfastq script: def merged = params.skip_collapse ? "": "-merged" if ( seqtype == 'SE' ) { out_fwd = bam.baseName+'.hostremoved.fq.gz' """ samtools index $bam extract_map_reads.py $bam ${r1} -m ${params.hostremoval_mode} $merged -of $out_fwd -t ${task.cpus} """ } else { out_fwd = bam.baseName+'.hostremoved.fwd.fq.gz' out_rev = bam.baseName+'.hostremoved.rev.fq.gz' """ samtools index $bam extract_map_reads.py $bam ${r1} -rev ${r2} -m ${params.hostremoval_mode} $merged -of $out_fwd -or $out_rev -t ${task.cpus} """ } } // Seqtype merging to combine paired end with single end sequenceing data of the same libraries // goes here, goes into flagstat, filter etc. Important: This type of merge of this isn't technically valid for DeDup! // and should only be used with markduplicates! ch_branched_for_seqtypemerge = ch_mapping_for_seqtype_merging .groupTuple(by: [0,1,4,5,6]) .map { it -> def samplename = it[0] def libraryid = it[1] def lane = 0 def seqtype = it[3].unique() // How to deal with this? def organism = it[4] def strandedness = it[5] def udg = it[6] def r1 = it[7] def r2 = it[8] // 1. We will assume if mixing it is better to set as PE as this is informative // for DeDup (and markduplicates doesn't care), but will throw a warning! // 2. We will also flatten to a single value to address problems with 'unstable' // Nextflow ArrayBag object types not allowing the .join to work between resumes // See: https://github.com/nf-core/eager/issues/880 def seqtype_new = seqtype.flatten().size() > 1 ? 'PE' : seqtype.flatten()[0] if ( seqtype.flatten().size() > 1 && params.dedupper == 'dedup' ) { log.warn "[nf-core/eager] Warning: you are running DeDup on BAMs with a mixture of PE/SE data for library: ${libraryid}. DeDup is designed for PE data only, deduplication maybe suboptimal!" } [ samplename, libraryid, lane, seqtype_new, organism, strandedness, udg, r1, r2 ] } .branch { skip_merge: it[7].size() == 1 // Can skip merging if only single lanes merge_me: it[7].size() > 1 } process seqtype_merge { label 'sc_tiny' tag "$libraryid" input: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, file(bam), file(bai) from ch_branched_for_seqtypemerge.merge_me output: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, file("*_seqtypemerged.bam"), file("*_seqtypemerged*.{bai,csi}") into ch_seqtypemerge_for_filtering script: def size = params.large_ref ? '-c' : '' """ samtools merge ${libraryid}_seqtypemerged.bam ${bam} samtools index ${libraryid}_seqtypemerged.bam ${size} """ } ch_seqtypemerge_for_filtering .mix(ch_branched_for_seqtypemerge.skip_merge) .into { ch_seqtypemerged_for_skipfiltering; ch_seqtypemerged_for_samtools_filter; ch_seqtypemerged_for_samtools_flagstat } // Post-mapping QC process samtools_flagstat { label 'sc_tiny' tag "$libraryid" publishDir "${params.outdir}/samtools/stats", mode: params.publish_dir_mode input: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, file(bam), file(bai) from ch_seqtypemerged_for_samtools_flagstat output: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*stats") into ch_flagstat_for_multiqc,ch_flagstat_for_endorspy script: """ samtools flagstat $bam > ${libraryid}_flagstat.stats """ } // BAM filtering e.g. to extract unmapped reads for downstream or stricter mapping quality process samtools_filter { label 'mc_medium' tag "$libraryid" publishDir "${params.outdir}/samtools/filter", mode: params.publish_dir_mode, saveAs: {filename -> if (filename.indexOf(".fq.gz") > 0) "$filename" else if (filename.indexOf(".unmapped.bam") > 0) "$filename" else if (filename.indexOf(".filtered.bam")) "$filename" else null } when: params.run_bam_filtering input: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, file(bam), file(bai) from ch_seqtypemerged_for_samtools_filter output: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, file("*filtered.bam"), file("*.{bai,csi}") into ch_output_from_filtering tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, file("*.unmapped.fastq.gz") optional true into ch_bam_filtering_for_metagenomic,ch_metagenomic_for_skipentropyfilter tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, file("*.unmapped.bam") optional true script: def size = params.large_ref ? '-c' : '' // Unmapped/MAPQ Filtering WITHOUT min-length filtering if ( "${params.bam_unmapped_type}" == "keep" && params.bam_filter_minreadlength == 0 ) { """ samtools view -h ${bam} -@ ${task.cpus} -q ${params.bam_mapping_quality_threshold} -b > ${libraryid}.filtered.bam samtools index ${libraryid}.filtered.bam ${size} """ } else if ( "${params.bam_unmapped_type}" == "discard" && params.bam_filter_minreadlength == 0 ){ """ samtools view -h ${bam} -@ ${task.cpus} -F4 -q ${params.bam_mapping_quality_threshold} -b > ${libraryid}.filtered.bam samtools index ${libraryid}.filtered.bam ${size} """ } else if ( "${params.bam_unmapped_type}" == "bam" && params.bam_filter_minreadlength == 0 ){ """ samtools view -h ${bam} -@ ${task.cpus} -f4 -b > ${libraryid}.unmapped.bam samtools view -h ${bam} -@ ${task.cpus} -F4 -q ${params.bam_mapping_quality_threshold} -b > ${libraryid}.filtered.bam samtools index ${libraryid}.filtered.bam ${size} """ } else if ( "${params.bam_unmapped_type}" == "fastq" && params.bam_filter_minreadlength == 0 ){ """ samtools view -h ${bam} -@ ${task.cpus} -f4 -b > ${libraryid}.unmapped.bam samtools view -h ${bam} -@ ${task.cpus} -F4 -q ${params.bam_mapping_quality_threshold} -b > ${libraryid}.filtered.bam samtools index ${libraryid}.filtered.bam ${size} ## FASTQ samtools fastq -tN ${libraryid}.unmapped.bam | pigz -p ${task.cpus - 1} > ${libraryid}.unmapped.fastq.gz rm ${libraryid}.unmapped.bam """ } else if ( "${params.bam_unmapped_type}" == "both" && params.bam_filter_minreadlength == 0 ){ """ samtools view -h ${bam} -@ ${task.cpus} -f4 -b > ${libraryid}.unmapped.bam samtools view -h ${bam} -@ ${task.cpus} -F4 -q ${params.bam_mapping_quality_threshold} -b > ${libraryid}.filtered.bam samtools index ${libraryid}.filtered.bam ${size} ## FASTQ samtools fastq -tN ${libraryid}.unmapped.bam | pigz -p ${task.cpus -1} > ${libraryid}.unmapped.fastq.gz """ // Unmapped/MAPQ Filtering WITH min-length filtering } else if ( "${params.bam_unmapped_type}" == "keep" && params.bam_filter_minreadlength != 0 ) { """ samtools view -h ${bam} -@ ${task.cpus} -q ${params.bam_mapping_quality_threshold} -b > tmp_mapped.bam filter_bam_fragment_length.py -a -l ${params.bam_filter_minreadlength} -o ${libraryid} tmp_mapped.bam samtools index ${libraryid}.filtered.bam ${size} """ } else if ( "${params.bam_unmapped_type}" == "discard" && params.bam_filter_minreadlength != 0 ){ """ samtools view -h ${bam} -@ ${task.cpus} -F4 -q ${params.bam_mapping_quality_threshold} -b > tmp_mapped.bam filter_bam_fragment_length.py -a -l ${params.bam_filter_minreadlength} -o ${libraryid} tmp_mapped.bam samtools index ${libraryid}.filtered.bam ${size} """ } else if ( "${params.bam_unmapped_type}" == "bam" && params.bam_filter_minreadlength != 0 ){ """ samtools view -h ${bam} -@ ${task.cpus} -f4 -b > ${libraryid}.unmapped.bam samtools view -h ${bam} -@ ${task.cpus} -F4 -q ${params.bam_mapping_quality_threshold} -b > tmp_mapped.bam filter_bam_fragment_length.py -a -l ${params.bam_filter_minreadlength} -o ${libraryid} tmp_mapped.bam samtools index ${libraryid}.filtered.bam ${size} """ } else if ( "${params.bam_unmapped_type}" == "fastq" && params.bam_filter_minreadlength != 0 ){ """ samtools view -h ${bam} -@ ${task.cpus} -f4 -b > ${libraryid}.unmapped.bam samtools view -h ${bam} -@ ${task.cpus} -F4 -q ${params.bam_mapping_quality_threshold} -b > tmp_mapped.bam filter_bam_fragment_length.py -a -l ${params.bam_filter_minreadlength} -o ${libraryid} tmp_mapped.bam samtools index ${libraryid}.filtered.bam ${size} ## FASTQ samtools fastq -tN ${libraryid}.unmapped.bam | pigz -p ${task.cpus - 1} > ${libraryid}.unmapped.fastq.gz rm ${libraryid}.unmapped.bam """ } else if ( "${params.bam_unmapped_type}" == "both" && params.bam_filter_minreadlength != 0 ){ """ samtools view -h ${bam} -@ ${task.cpus} -f4 -b > ${libraryid}.unmapped.bam samtools view -h ${bam} -@ ${task.cpus} -F4 -q ${params.bam_mapping_quality_threshold} -b > tmp_mapped.bam filter_bam_fragment_length.py -a -l ${params.bam_filter_minreadlength} -o ${libraryid} tmp_mapped.bam samtools index ${libraryid}.filtered.bam ${size} ## FASTQ samtools fastq -tN ${libraryid}.unmapped.bam | pigz -p ${task.cpus} > ${libraryid}.unmapped.fastq.gz """ } } // samtools_filter bypass in case not run if (params.run_bam_filtering) { ch_seqtypemerged_for_skipfiltering.mix(ch_output_from_filtering) .filter { it =~/.*filtered.bam/ } .into { ch_filtering_for_skiprmdup; ch_filtering_for_dedup; ch_filtering_for_markdup; ch_filtering_for_flagstat; ch_skiprmdup_for_libeval; ch_mapped_for_preseq } } else { ch_seqtypemerged_for_skipfiltering .into { ch_filtering_for_skiprmdup; ch_filtering_for_dedup; ch_filtering_for_markdup; ch_filtering_for_flagstat; ch_skiprmdup_for_libeval; ch_mapped_for_preseq } } // Post filtering mapping QC - particularly to help see how much was removed from mapping quality filtering process samtools_flagstat_after_filter { label 'sc_tiny' tag "$libraryid" publishDir "${params.outdir}/samtools/filtered_stats", mode: params.publish_dir_mode when: params.run_bam_filtering input: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(bam), path(bai) from ch_filtering_for_flagstat output: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*.stats") into ch_bam_filtered_flagstat_for_multiqc, ch_bam_filtered_flagstat_for_endorspy script: """ samtools flagstat $bam > ${libraryid}_postfilterflagstat.stats """ } if (params.run_bam_filtering) { ch_flagstat_for_endorspy .join(ch_bam_filtered_flagstat_for_endorspy, by: [0,1,2,3,4,5,6]) .set{ ch_allflagstats_for_endorspy } } else { // Add a file entry to match expected no. tuple elements for endorS.py even if not giving second file ch_flagstat_for_endorspy .map { it -> def samplename = it[0] def libraryid = it[1] def lane = it[2] def seqtype = it[3] def organism = it[4] def strandedness = it[5] def udg = it[6] def stats = file(it[7]) def poststats = file("$projectDir/assets/nf-core_eager_dummy.txt") [samplename, libraryid, lane, seqtype, organism, strandedness, udg, stats, poststats ] } .set{ ch_allflagstats_for_endorspy } } // Endogenous DNA calculator to say how much of a library contained 'on-target' DNA process endorSpy { label 'sc_tiny' tag "$libraryid" publishDir "${params.outdir}/endorspy", mode: params.publish_dir_mode input: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(stats), path(poststats) from ch_allflagstats_for_endorspy output: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*.json") into ch_endorspy_for_multiqc script: if (params.run_bam_filtering) { """ endorS.py -o json -n ${libraryid} ${stats} ${poststats} """ } else { """ endorS.py -o json -n ${libraryid} ${stats} """ } } // Post-mapping PCR amplicon removal because these lab artefacts inflate coverage statistics process dedup{ label 'mc_small' tag "${libraryid}" publishDir "${params.outdir}/deduplication/", mode: params.publish_dir_mode, saveAs: {filename -> "${libraryid}/$filename"} when: !params.skip_deduplication && params.dedupper == 'dedup' input: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(bam), path(bai) from ch_filtering_for_dedup output: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*.hist") into ch_hist_for_preseq tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*.json") into ch_dedup_results_for_multiqc tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("${libraryid}_rmdup.bam"), path("*.{bai,csi}") into ch_output_from_dedup, ch_dedup_for_libeval script: def treat_merged = params.dedup_all_merged ? '-m' : '' def size = params.large_ref ? '-c' : '' if ( bam.baseName != libraryid ) { // To make sure direct BAMs have a clean name """ mv ${bam} ${libraryid}.bam dedup -Xmx${task.memory.toGiga()}g -i ${libraryid}.bam $treat_merged -o . -u mv *.log dedup.log samtools sort -@ ${task.cpus} "${libraryid}"_rmdup.bam -o "${libraryid}"_rmdup.bam samtools index "${libraryid}"_rmdup.bam ${size} """ } else { """ dedup -Xmx${task.memory.toGiga()}g -i ${libraryid}.bam $treat_merged -o . -u mv *.log dedup.log samtools sort -@ ${task.cpus} "${libraryid}"_rmdup.bam -o "${libraryid}"_rmdup.bam samtools index "${libraryid}"_rmdup.bam ${size} """ } } process markduplicates{ label 'mc_small' tag "${libraryid}" publishDir "${params.outdir}/deduplication/", mode: params.publish_dir_mode, saveAs: {filename -> "${libraryid}/$filename"} when: !params.skip_deduplication && params.dedupper == 'markduplicates' input: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(bam), path(bai) from ch_filtering_for_markdup output: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*.metrics") into ch_markdup_results_for_multiqc tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("${libraryid}_rmdup.bam"), path("*.{bai,csi}") into ch_output_from_markdup, ch_markdup_for_libeval script: def size = params.large_ref ? '-c' : '' if ( bam.baseName != libraryid ) { // To make sure direct BAMs have a clean name """ mv ${bam} ${libraryid}.bam picard -Xmx${task.memory.toMega()}M MarkDuplicates INPUT=${libraryid}.bam OUTPUT=${libraryid}_rmdup.bam REMOVE_DUPLICATES=TRUE AS=TRUE METRICS_FILE="${libraryid}_rmdup.metrics" VALIDATION_STRINGENCY=SILENT samtools index ${libraryid}_rmdup.bam ${size} """ } else { """ picard -Xmx${task.memory.toMega()}M MarkDuplicates INPUT=${libraryid}.bam OUTPUT=${libraryid}_rmdup.bam REMOVE_DUPLICATES=TRUE AS=TRUE METRICS_FILE="${libraryid}_rmdup.metrics" VALIDATION_STRINGENCY=SILENT samtools index ${libraryid}_rmdup.bam ${size} """ } } // This is for post-deduplcation per-library evaluation steps _without_ any // form of library merging. if ( params.skip_deduplication ) { ch_skiprmdup_for_libeval.mix(ch_dedup_for_libeval, ch_markdup_for_libeval) .into{ ch_rmdup_for_preseq; ch_rmdup_for_damageprofiler; ch_rmdup_for_mapdamage; ch_for_nuclear_contamination; ch_rmdup_formtnucratio } } else { ch_dedup_for_libeval.mix(ch_markdup_for_libeval) .into{ ch_rmdup_for_preseq; ch_rmdup_for_damageprofiler; ch_rmdup_for_mapdamage; ch_for_nuclear_contamination; ch_rmdup_formtnucratio } } // Merge independent libraries sequenced but with same treatment (often done to // improve complexity) with the same _sample_ name. Different strand/UDG libs // not merged because bamtrim/pmdtools/genotyping needs that info. // Step one: work out which are single libraries (from skipping rmdup and both dedups) that do not need merging and pass to a skipping if ( params.skip_deduplication ) { ch_input_for_librarymerging = ch_filtering_for_skiprmdup .groupTuple(by:[0,4,5,6]) .branch{ clean_libraryid: it[7].size() == 1 merge_me: it[7].size() > 1 } } else { ch_input_for_librarymerging = ch_output_from_dedup.mix(ch_output_from_markdup) .groupTuple(by:[0,4,5,6]) .branch{ clean_libraryid: it[7].size() == 1 merge_me: it[7].size() > 1 } } // For non-merging libraries, fix group libraryIDs into single values. // This is a bit hacky as theoretically could have different, but this should // rarely be the case. ch_input_for_librarymerging.clean_libraryid .map{ it -> def libraryid = it[1][0] def bam = it[7].flatten() def bai = it[8].flatten() [it[0], libraryid, it[2], it[3], it[4], it[5], it[6], bam, bai ] } .set { ch_input_for_skiplibrarymerging } ch_input_for_librarymerging.merge_me .map{ it -> def libraryid = it[1][0] def seqtype = "merged" def bam = it[7].flatten() def bai = it[8].flatten() [it[0], libraryid, it[2], seqtype, it[4], it[5], it[6], bam, bai ] } .set { ch_fixedinput_for_librarymerging } process library_merge { label 'sc_tiny' tag "${samplename}" publishDir "${params.outdir}/merged_bams/initial", mode: params.publish_dir_mode input: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, file(bam), file(bai) from ch_fixedinput_for_librarymerging output: tuple samplename, val("${samplename}_libmerged"), lane, seqtype, organism, strandedness, udg, path("*_libmerged_rmdup.bam"), path("*_libmerged_rmdup.bam.{bai,csi}") into ch_output_from_librarymerging script: def size = params.large_ref ? '-c' : '' """ samtools merge ${samplename}_udg${udg}_libmerged_rmdup.bam ${bam} samtools index ${samplename}_udg${udg}_libmerged_rmdup.bam ${size} """ } // Mix back in libraries from skipping dedup, skipping library merging if (!params.skip_deduplication) { ch_input_for_skiplibrarymerging.mix(ch_output_from_librarymerging) .filter { it =~/.*_rmdup.bam/ } .into { ch_rmdup_for_skipdamagemanipulation; ch_rmdup_for_pmdtools; ch_rmdup_for_bamutils; ch_rmdup_for_bedtools; ch_rmdup_for_damagerescaling } } else { ch_input_for_skiplibrarymerging.mix(ch_output_from_librarymerging) .into { ch_rmdup_for_skipdamagemanipulation; ch_rmdup_for_pmdtools; ch_rmdup_for_bamutils; ch_rmdup_for_bedtools; ch_rmdup_for_damagerescaling } } ////////////////////////////////////////////////// /* -- POST DEDUPLICATION EVALUATION -- */ ////////////////////////////////////////////////// // Library complexity calculation from mapped reads - could a user cost-effectively sequence deeper for more unique information? if ( params.skip_deduplication ) { ch_input_for_preseq = ch_rmdup_for_preseq.map{ it[0,1,2,3,4,5,6,7] } } else if ( !params.skip_deduplication && params.dedupper == "markduplicates" ) { ch_input_for_preseq = ch_mapped_for_preseq.map{ it[0,1,2,3,4,5,6,7] } } else if ( !params.skip_deduplication && params.dedupper == "dedup" ) { ch_input_for_preseq = ch_hist_for_preseq } process preseq { label 'sc_tiny' tag "${libraryid}" publishDir "${params.outdir}/preseq", mode: params.publish_dir_mode when: !params.skip_preseq input: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, file(input) from ch_input_for_preseq output: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("${input.baseName}.preseq") into ch_preseq_for_multiqc script: pe_mode = params.skip_collapse && seqtype == "PE" ? '-P' : '' if(!params.skip_deduplication && params.preseq_mode == 'c_curve' && params.dedupper == "dedup"){ """ preseq c_curve -s ${params.preseq_step_size} -o ${input.baseName}.preseq -H ${input} """ } else if( !params.skip_deduplication && params.preseq_mode == 'c_curve' && params.dedupper == "markduplicates"){ """ preseq c_curve -s ${params.preseq_step_size} -o ${input.baseName}.preseq -B ${input} ${pe_mode} """ } else if ( params.skip_deduplication && params.preseq_mode == 'c_curve' ) { """ preseq c_curve -s ${params.preseq_step_size} -o ${input.baseName}.preseq -B ${input} ${pe_mode} """ } else if(!params.skip_deduplication && params.preseq_mode == 'lc_extrap' && params.dedupper == "dedup"){ """ preseq lc_extrap -s ${params.preseq_step_size} -o ${input.baseName}.preseq -H ${input} -n ${params.preseq_bootstrap} -e ${params.preseq_maxextrap} -cval ${params.preseq_cval} -x ${params.preseq_terms} """ } else if( !params.skip_deduplication && params.preseq_mode == 'lc_extrap' && params.dedupper == "markduplicates"){ """ preseq lc_extrap -s ${params.preseq_step_size} -o ${input.baseName}.preseq -B ${input} ${pe_mode} -n ${params.preseq_bootstrap} -e ${params.preseq_maxextrap} -cval ${params.preseq_cval} -x ${params.preseq_terms} """ } else if ( params.skip_deduplication && params.preseq_mode == 'lc_extrap' ) { """ preseq lc_extrap -s ${params.preseq_step_size} -o ${input.baseName}.preseq -B ${input} ${pe_mode} -n ${params.preseq_bootstrap} -e ${params.preseq_maxextrap} -cval ${params.preseq_cval} -x ${params.preseq_terms} """ } } // Optional mapping statistics for specific annotations - e.g. genes in bacterial genome process bedtools { label 'mc_small' tag "${libraryid}" publishDir "${params.outdir}/bedtools", mode: params.publish_dir_mode when: params.run_bedtools_coverage input: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(bam), path(bai) from ch_rmdup_for_bedtools file anno_file from ch_anno_for_bedtools.collect() output: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*") script: sorting_of_anno = params.anno_file_is_unsorted ? "" : "-sorted" """ ## Create genome file from bam header samtools view -H ${bam} | grep '@SQ' | sed 's#@SQ\tSN:\\|LN:##g' > genome.txt ## Run bedtools bedtools coverage -nonamecheck -g genome.txt ${sorting_of_anno} -a ${anno_file} -b ${bam} | pigz -p ${task.cpus - 1} > "${bam.baseName}".breadth.gz bedtools coverage -nonamecheck -g genome.txt ${sorting_of_anno} -a ${anno_file} -b ${bam} -mean | pigz -p ${task.cpus - 1} > "${bam.baseName}".depth.gz """ } ////////////////////////////////////////////////////////////// /* -- ANCIENT DNA EVALUATION AND BAM MODIFICATION -- */ ////////////////////////////////////////////////////////////// // Calculate typical aDNA damage frequency distribution with DamageProfiler process damageprofiler { label 'sc_small' tag "${libraryid}" publishDir "${params.outdir}/damageprofiler", mode: params.publish_dir_mode when: !params.skip_damage_calculation && params.damage_calculation_tool == 'damageprofiler' input: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(bam), path(bai) from ch_rmdup_for_damageprofiler file fasta from ch_fasta_for_damageprofiler.collect() file fai from ch_fai_for_damageprofiler.collect() output: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("${base}/*.txt") optional true tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("${base}/*.log") tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("${base}/*.pdf") optional true tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("${base}/*.json") optional true into ch_damageprofiler_results script: base = "${bam.baseName}" """ damageprofiler -Xmx${task.memory.toGiga()}g -i $bam -r $fasta -l ${params.damageprofiler_length} -t ${params.damageprofiler_threshold} -o . -yaxis_damageplot ${params.damageprofiler_yaxis} """ } // Calculate typical aDNA damage frequency distribution with mapDamage process mapdamage_calculation { label 'sc_small' tag "${libraryid}" publishDir "${params.outdir}/mapdamage", mode: params.publish_dir_mode when: !params.skip_damage_calculation && params.damage_calculation_tool == 'mapdamage' input: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(bam), path(bai) from ch_rmdup_for_mapdamage file fasta from ch_fasta_for_mapdamage.collect() output: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("results_${base}") into ch_output_from_mapdamage path ("results_${base}") into ch_mapdamage_for_multiqc script: base = "${bam.baseName}" def singlestranded = strandedness == "single" ? '--single-stranded' : '' def downsample = params.mapdamage_downsample != 0 ? "-n ${params.mapdamage_downsample} --downsample-seed=1" : '' // Include seed to make results consistent between runs """ mapDamage -i ${bam} -r ${fasta} ${singlestranded} ${downsample} --ymax=${params.mapdamage_yaxis} --no-stats """ } // Damage rescaling with mapDamage process mapdamage_rescaling { label 'sc_small' tag "${libraryid}" publishDir "${params.outdir}/damage_rescaling", mode: params.publish_dir_mode when: params.run_mapdamage_rescaling input: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(bam), path(bai) from ch_rmdup_for_damagerescaling file fasta from ch_fasta_for_damagerescaling.collect() output: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*_rescaled.bam"), path("*rescaled.bam.{bai,csi}") into ch_output_from_damagerescaling script: def base = "${bam.baseName}" def singlestranded = strandedness == "single" ? '--single-stranded' : '' def size = params.large_ref ? '-c' : '' def rescale_length_3p = params.rescale_length_3p != 0 ? "--rescale-length-3p=${params.rescale_length_3p}" : "" def rescale_length_5p = params.rescale_length_5p != 0 ? "--rescale-length-5p=${params.rescale_length_5p}" : "" """ mapDamage -i ${bam} -r ${fasta} --rescale --rescale-out="${base}_rescaled.bam" --seq-length=${params.rescale_seqlength} ${rescale_length_5p} ${rescale_length_3p} ${singlestranded} samtools index ${base}_rescaled.bam ${size} """ } // Optionally perform further aDNA evaluation or filtering for just reads with damage etc. process mask_reference_for_pmdtools { label 'sc_tiny' tag "${fasta}" publishDir "${params.outdir}/reference_genome/masked_reference", mode: params.publish_dir_mode when: (params.pmdtools_reference_mask && params.run_pmdtools) input: file fasta from ch_unmasked_fasta_for_masking file bedfile from ch_bedfile_for_reference_masking output: file "${fasta.baseName}_masked.fa" into ch_masked_fasta_for_pmdtools script: log.info "[nf-core/eager]: Masking reference \'${fasta}\' at positions found in \'${bedfile}\'. Masked reference will be used for pmdtools." """ bedtools maskfasta -fi ${fasta} -bed ${bedfile} -fo ${fasta.baseName}_masked.fa """ } // If masking was requested, use masked reference for pmdtools, else use original reference if (params.pmdtools_reference_mask) { ch_masked_fasta_for_pmdtools.set{ch_fasta_for_pmdtools} } else { ch_unmasked_fasta_for_pmdtools.set{ch_fasta_for_pmdtools} } process pmdtools { label 'mc_medium' tag "${libraryid}" publishDir "${params.outdir}/pmdtools", mode: params.publish_dir_mode when: params.run_pmdtools input: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(bam), path(bai) from ch_rmdup_for_pmdtools file fasta from ch_fasta_for_pmdtools.collect() output: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*.pmd.bam"), path("*.pmd.bam.{bai,csi}") into ch_output_from_pmdtools file "*.cpg.range*.txt" script: //Check which treatment for the libraries was used def treatment = udg ? (udg == 'half' ? '--UDGhalf' : '--CpG') : '--UDGminus' def size = params.large_ref ? '-c' : '' def platypus = params.pmdtools_platypus ? '--platypus' : '' """ #Run Filtering step samtools calmd ${bam} ${fasta} | pmdtools --threshold ${params.pmdtools_threshold} ${treatment} --header | samtools view -Sb - > "${libraryid}".pmd.bam #Run Calc Range step ## To allow early shut off of pipe: https://github.com/nextflow-io/nextflow/issues/1564 trap 'if [[ \$? == 141 ]]; then echo "Shutting samtools early due to -n parameter" && samtools index ${libraryid}.pmd.bam ${size}; exit 0; fi' EXIT samtools calmd ${bam} ${fasta} | pmdtools --deamination ${platypus} --range ${params.pmdtools_range} ${treatment} -n ${params.pmdtools_max_reads} > "${libraryid}".cpg.range."${params.pmdtools_range}".txt samtools index ${libraryid}.pmd.bam ${size} """ } // BAM Trimming for just non-UDG or half-UDG libraries to remove damage prior genotyping if ( params.run_trim_bam ) { // You wouldn't want to make UDG treated reads even shorter, so skip trimming if UDG. // We assume same trim amount for both non-UDG/UDG half as could trim a bit more off half-UDG to match non-UDG if needed, with minimal effect // Note: Trimming of e.g. adapters are sequencing artefacts and should be removed before mapping, so we don't account for this here. ch_bamutils_decision = ch_rmdup_for_bamutils.branch{ totrim: it[6] == 'none' || it[6] == 'half' notrim: it[6] == 'full' } } else { ch_bamutils_decision = ch_rmdup_for_bamutils.branch{ totrim: it[6] == "dummy" notrim: it[6] == 'full' || it[6] == 'none' || it[6] == 'half' } } process bam_trim { label 'mc_small' tag "${libraryid}" publishDir "${params.outdir}/trimmed_bam", mode: params.publish_dir_mode when: params.run_trim_bam input: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(bam), path(bai) from ch_bamutils_decision.totrim output: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*.trimmed.bam"), path("*.trimmed.bam.{bai,csi}") into ch_trimmed_from_bamutils script: def softclip = params.bamutils_softclip ? '-c' : '' def size = params.large_ref ? '-c' : '' def left_clipping = strandedness == "double" ? (udg == "half" ? "${params.bamutils_clip_double_stranded_half_udg_left}" : "${params.bamutils_clip_double_stranded_none_udg_left}") : (udg == "half" ? "${params.bamutils_clip_single_stranded_half_udg_left}" : "${params.bamutils_clip_single_stranded_none_udg_left}") def right_clipping = strandedness == "double" ? (udg == "half" ? "${params.bamutils_clip_double_stranded_half_udg_right}" : "${params.bamutils_clip_double_stranded_none_udg_right}") : (udg == "half" ? "${params.bamutils_clip_single_stranded_half_udg_right}" : "${params.bamutils_clip_single_stranded_none_udg_right}") // def left_clipping = udg == "half" ? "${params.bamutils_clip_half_udg_left}" : "${params.bamutils_clip_none_udg_left}" // def right_clipping = udg == "half" ? "${params.bamutils_clip_half_udg_right}" : "${params.bamutils_clip_none_udg_right}" """ bam trimBam $bam tmp.bam -L ${left_clipping} -R ${right_clipping} ${softclip} samtools sort -@ ${task.cpus} tmp.bam -o ${libraryid}_udg${udg}.trimmed.bam samtools index ${libraryid}_udg${udg}.trimmed.bam ${size} """ } // Post-trimming merging of libraries to single samples, except for SS/DS // libraries as they should be genotyped separately, because we will assume // that if trimming is turned on, 'lab-removed' libraries can be combined with // merged with 'in-silico damage removed' libraries to improve genotyping ch_trimmed_formerge = ch_bamutils_decision.notrim .mix(ch_trimmed_from_bamutils) .groupTuple(by:[0,4,5]) .map{ def samplename = it[0] def libraryid = it[1] def lane = it[2] def seqtype = it[3] def organism = it[4] def strandedness = it[5] def udg = it[6] def bam = it[7].flatten() def bai = it[8].flatten() [samplename, libraryid, lane, seqtype, organism, strandedness, udg, bam, bai ] } .branch{ skip_merging: it[7].size() == 1 merge_me: it[7].size() > 1 } ////////////////////////////////////////////////////////////////////////// /* -- POST aDNA BAM MODIFICATION AND FINAL MAPPING STATISTICS -- */ ////////////////////////////////////////////////////////////////////////// process additional_library_merge { label 'sc_tiny' tag "${samplename}" publishDir "${params.outdir}/merged_bams/additional", mode: params.publish_dir_mode input: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(bam), path(bai) from ch_trimmed_formerge.merge_me output: tuple samplename, val("${samplename}_libmerged"), lane, seqtype, organism, strandedness, udg, path("*_libmerged_add.bam"), path("*_libmerged_add.bam.{bai,csi}") into ch_output_from_trimmerge script: def size = params.large_ref ? '-c' : '' """ samtools merge ${samplename}_libmerged_add.bam ${bam} samtools index ${samplename}_libmerged_add.bam ${size} """ } ch_trimmed_formerge.skip_merging .mix(ch_output_from_trimmerge) .into{ ch_output_from_bamutils; ch_addlibmerge_for_qualimap; ch_for_sexdeterrmine_prep } // General mapping quality statistics for whole reference sequence - e.g. X and % coverage process qualimap { label 'mc_small' tag "${samplename}" publishDir "${params.outdir}/qualimap", mode: params.publish_dir_mode when: !params.skip_qualimap input: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(bam), path(bai) from ch_addlibmerge_for_qualimap file fasta from ch_fasta_for_qualimap.collect() path snpcapture_bed from ch_snpcapture_bed output: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*") into ch_qualimap_results script: def snpcap = snpcapture_bed.getName() != 'nf-core_eager_dummy.txt' ? "-gff ${snpcapture_bed}" : '' """ qualimap bamqc -bam $bam -nt ${task.cpus} -outdir . -outformat "HTML" ${snpcap} --java-mem-size=${task.memory.toGiga()}G """ } ///////////////////////////// /* -- GENOTYPING -- */ ///////////////////////////// // Reroute files for genotyping; we have to ensure to select lib-merged BAMs, as input channel will also contain the un-merged ones resulting in unwanted multi-sample VCFs if ( params.run_genotyping && params.genotyping_source == 'raw' ) { ch_output_from_bamutils .into { ch_damagemanipulation_for_skipgenotyping; ch_damagemanipulation_for_readgroupreplacement; ch_damagemanipulation_for_genotyping_ug; ch_damagemanipulation_for_genotyping_hc; ch_damagemanipulation_for_genotyping_freebayes; ch_damagemanipulation_for_genotyping_pileupcaller; ch_damagemanipulation_for_genotyping_angsd } } else if ( params.run_genotyping && params.genotyping_source == "trimmed" && !params.run_trim_bam ) { exit 1, "[nf-core/eager] error: Cannot run genotyping with 'trimmed' source without running BAM trimming (--run_trim_bam)! Please check input parameters." } else if ( params.run_genotyping && params.genotyping_source == "trimmed" && params.run_trim_bam ) { ch_output_from_bamutils .into { ch_damagemanipulation_for_skipgenotyping; ch_damagemanipulation_for_readgroupreplacement; ch_damagemanipulation_for_genotyping_ug; ch_damagemanipulation_for_genotyping_hc; ch_damagemanipulation_for_genotyping_freebayes; ch_damagemanipulation_for_genotyping_pileupcaller; ch_damagemanipulation_for_genotyping_angsd } } else if ( params.run_genotyping && params.genotyping_source == "pmd" && !params.run_pmdtools ) { exit 1, "[nf-core/eager] error: Cannot run genotyping with 'pmd' source without running pmdtools (--run_pmdtools)! Please check input parameters." } else if ( params.run_genotyping && params.genotyping_source == "pmd" && params.run_pmdtools ) { ch_output_from_pmdtools .into { ch_damagemanipulation_for_skipgenotyping; ch_damagemanipulation_for_readgroupreplacement; ch_damagemanipulation_for_genotyping_ug; ch_damagemanipulation_for_genotyping_hc; ch_damagemanipulation_for_genotyping_freebayes; ch_damagemanipulation_for_genotyping_pileupcaller; ch_damagemanipulation_for_genotyping_angsd } } else if ( params.run_genotyping && params.genotyping_source == "rescaled" && params.run_mapdamage_rescaling) { ch_output_from_damagerescaling .into { ch_damagemanipulation_for_skipgenotyping; ch_damagemanipulation_for_readgroupreplacement; ch_damagemanipulation_for_genotyping_ug; ch_damagemanipulation_for_genotyping_hc; ch_damagemanipulation_for_genotyping_freebayes; ch_damagemanipulation_for_genotyping_pileupcaller; ch_damagemanipulation_for_genotyping_angsd } } else if ( params.run_genotyping && params.genotyping_source == "rescaled" && !params.run_mapdamage_rescaling) { exit 1, "[nf-core/eager] error: Cannot run genotyping with 'rescaled' source without running damage rescaling (--run_damagescaling)! Please check input parameters." } else if ( !params.run_genotyping && !params.run_trim_bam && !params.run_pmdtools ) { ch_rmdup_for_skipdamagemanipulation .into { ch_damagemanipulation_for_skipgenotyping; ch_damagemanipulation_for_readgroupreplacement; ch_damagemanipulation_for_genotyping_ug; ch_damagemanipulation_for_genotyping_hc; ch_damagemanipulation_for_genotyping_freebayes; ch_damagemanipulation_for_genotyping_pileupcaller; ch_damagemanipulation_for_genotyping_angsd } } else if ( !params.run_genotyping && !params.run_trim_bam && params.run_pmdtools ) { ch_rmdup_for_skipdamagemanipulation .into { ch_damagemanipulation_for_skipgenotyping; ch_damagemanipulation_for_readgroupreplacement; ch_damagemanipulation_for_genotyping_ug; ch_damagemanipulation_for_genotyping_hc; ch_damagemanipulation_for_genotyping_freebayes; ch_damagemanipulation_for_genotyping_pileupcaller; ch_damagemanipulation_for_genotyping_angsd } } else if ( !params.run_genotyping && params.run_trim_bam && !params.run_pmdtools ) { ch_rmdup_for_skipdamagemanipulation .into { ch_damagemanipulation_for_skipgenotyping; ch_damagemanipulation_for_readgroupreplacement; ch_damagemanipulation_for_genotyping_ug; ch_damagemanipulation_for_genotyping_hc; ch_damagemanipulation_for_genotyping_freebayes; ch_damagemanipulation_for_genotyping_pileupcaller; ch_damagemanipulation_for_genotyping_angsd } } // replace readgroups to ensure single 'sample' per VCF for MultiVCFAnalyzer only process picard_addorreplacereadgroups { label 'sc_tiny' tag "${samplename}" when: params.run_genotyping && params.genotyping_tool == 'ug' && params.run_multivcfanalyzer input: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(bam), path(bai) from ch_damagemanipulation_for_readgroupreplacement output: tuple samplename, val("${samplename}"), lane, seqtype, organism, strandedness, udg, path("*rg.bam"), path("*rg.bam.{bai,csi}") into ch_readgroup_replacement_for_ug script: def size = params.large_ref ? '-c' : '' """ picard -Xmx${task.memory.toGiga()}g AddOrReplaceReadGroups I=${bam} O=${samplename}_rg.bam RGID=1 RGLB="${samplename}_rg" RGPL=illumina RGPU=4410 RGSM="${samplename}_rg" VALIDATION_STRINGENCY=LENIENT samtools index ${samplename}_rg.bam ${size} """ } if ( params.run_genotyping && params.genotyping_tool == 'ug' && params.run_multivcfanalyzer ) { ch_input_for_ug = ch_readgroup_replacement_for_ug } else { ch_input_for_ug = ch_damagemanipulation_for_genotyping_ug } // Unified Genotyper - although not-supported, better for aDNA (because HC does de novo assembly which requires higher coverages), and needed for MultiVCFAnalyzer process genotyping_ug { label 'mc_small' tag "${samplename}" publishDir "${params.outdir}/genotyping", mode: params.publish_dir_mode, pattern: '*{.vcf.gz,.realign.bam,realign.bai}' when: params.run_genotyping && params.genotyping_tool == 'ug' input: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, file(bam), file(bai) from ch_input_for_ug file fasta from ch_fasta_for_genotyping_ug.collect() file fai from ch_fai_for_ug.collect() file dict from ch_dict_for_ug.collect() output: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, file("*vcf.gz") into ch_ug_for_multivcfanalyzer,ch_ug_for_vcf2genome,ch_ug_for_bcftools_stats tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, file("*.realign.{bam,bai}") optional true script: def defaultbasequalities = !params.gatk_ug_defaultbasequalities ? '' : " --defaultBaseQualities ${params.gatk_ug_defaultbasequalities}" def keep_realign = params.gatk_ug_keep_realign_bam ? "samtools index ${samplename}.realign.bam" : "rm ${samplename}.realign.{bam,bai}" if (!params.gatk_dbsnp) """ samtools index -b ${bam} gatk3 -Xmx${task.memory.toGiga()}g -T RealignerTargetCreator -R ${fasta} -I ${bam} -nt ${task.cpus} -o ${samplename}.intervals ${defaultbasequalities} gatk3 -Xmx${task.memory.toGiga()}g -T IndelRealigner -R ${fasta} -I ${bam} -targetIntervals ${samplename}.intervals -o ${samplename}.realign.bam ${defaultbasequalities} gatk3 -Xmx${task.memory.toGiga()}g -T UnifiedGenotyper -R ${fasta} -I ${samplename}.realign.bam -o ${samplename}.unifiedgenotyper.vcf -nt ${task.cpus} --genotype_likelihoods_model ${params.gatk_ug_genotype_model} -stand_call_conf ${params.gatk_call_conf} --sample_ploidy ${params.gatk_ploidy} -dcov ${params.gatk_downsample} --output_mode ${params.gatk_ug_out_mode} ${defaultbasequalities} $keep_realign bgzip -@ ${task.cpus} ${samplename}.unifiedgenotyper.vcf """ else if (params.gatk_dbsnp) """ samtools index ${bam} gatk3 -Xmx${task.memory.toGiga()}g -T RealignerTargetCreator -R ${fasta} -I ${bam} -nt ${task.cpus} -o ${samplename}.intervals ${defaultbasequalities} gatk3 -Xmx${task.memory.toGiga()}g -T IndelRealigner -R ${fasta} -I ${bam} -targetIntervals ${samplename}.intervals -o ${samplename}.realign.bam ${defaultbasequalities} gatk3 -Xmx${task.memory.toGiga()}g -T UnifiedGenotyper -R ${fasta} -I ${samplename}.realign.bam -o ${samplename}.unifiedgenotyper.vcf -nt ${task.cpus} --dbsnp ${params.gatk_dbsnp} --genotype_likelihoods_model ${params.gatk_ug_genotype_model} -stand_call_conf ${params.gatk_call_conf} --sample_ploidy ${params.gatk_ploidy} -dcov ${params.gatk_downsample} --output_mode ${params.gatk_ug_out_mode} ${defaultbasequalities} $keep_realign bgzip -@ ${task.cpus} ${samplename}.unifiedgenotyper.vcf """ } // HaplotypeCaller as 'best practise' tool for human DNA in particular process genotyping_hc { label 'mc_small' tag "${samplename}" publishDir "${params.outdir}/genotyping", mode: params.publish_dir_mode when: params.run_genotyping && params.genotyping_tool == 'hc' input: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, file(bam), file(bai) from ch_damagemanipulation_for_genotyping_hc file fasta from ch_fasta_for_genotyping_hc.collect() file fai from ch_fai_for_hc.collect() file dict from ch_dict_for_hc.collect() output: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*vcf.gz") into ch_hc_for_bcftools_stats script: if (!params.gatk_dbsnp) """ gatk HaplotypeCaller --java-options "-Xmx${task.memory.toGiga()}G" -R ${fasta} -I ${bam} -O ${samplename}.haplotypecaller.vcf -stand-call-conf ${params.gatk_call_conf} --sample-ploidy ${params.gatk_ploidy} --output-mode ${params.gatk_hc_out_mode} --emit-ref-confidence ${params.gatk_hc_emitrefconf} bgzip -@ ${task.cpus} ${samplename}.haplotypecaller.vcf """ else if (params.gatk_dbsnp) """ gatk HaplotypeCaller --java-options "-Xmx${task.memory.toGiga()}G" -R ${fasta} -I ${bam} -O ${samplename}.haplotypecaller.vcf --dbsnp ${params.gatk_dbsnp} -stand-call-conf ${params.gatk_call_conf} --sample_ploidy ${params.gatk_ploidy} --output_mode ${params.gatk_hc_out_mode} --emit-ref-confidence ${params.gatk_hc_emitrefconf} bgzip -@ ${task.cpus} ${samplename}.haplotypecaller.vcf """ } // Freebayes for 'more efficient/simple' and more generic genotyping (vs HC) process genotyping_freebayes { label 'mc_small' tag "${samplename}" publishDir "${params.outdir}/genotyping", mode: params.publish_dir_mode when: params.run_genotyping && params.genotyping_tool == 'freebayes' input: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, file(bam), file(bai) from ch_damagemanipulation_for_genotyping_freebayes file fasta from ch_fasta_for_genotyping_freebayes.collect() file fai from ch_fai_for_freebayes.collect() file dict from ch_dict_for_freebayes.collect() output: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*vcf.gz") into ch_fb_for_bcftools_stats script: def skip_coverage = "${params.freebayes_g}" == 0 ? "" : "-g ${params.freebayes_g}" """ freebayes -f ${fasta} -p ${params.freebayes_p} -C ${params.freebayes_C} ${skip_coverage} ${bam} > ${samplename}.freebayes.vcf bgzip -@ ${task.cpus} ${samplename}.freebayes.vcf """ } // Branch channel by strandedness ch_damagemanipulation_for_genotyping_pileupcaller .branch{ singleStranded: it[5] == "single" doubleStranded: it[5] == "double" } .set{ch_input_for_genotyping_pileupcaller} // Create pileupcaller input tuples ch_input_for_genotyping_pileupcaller.singleStranded .groupTuple(by:[5]) .map{ def samplename = it[0] def libraryid = it[1] def lane = it[2] def seqtype = it[3] def organism = it[4] def strandedness = it[5] def udg = it[6] def bam = it[7].flatten() def bai = it[8].flatten() [samplename, libraryid, lane, seqtype, organism, strandedness, udg, bam, bai ] } .set {ch_prepped_for_pileupcaller_single} ch_input_for_genotyping_pileupcaller.doubleStranded .groupTuple(by:[5]) .map{ def samplename = it[0] def libraryid = it[1] def lane = it[2] def seqtype = it[3] def organism = it[4] def strandedness = it[5] def udg = it[6] def bam = it[7].flatten() def bai = it[8].flatten() [samplename, libraryid, lane, seqtype, organism, strandedness, udg, bam, bai ] } .set {ch_prepped_for_pileupcaller_double} process genotyping_pileupcaller { label 'mc_small' tag "${strandedness}" publishDir "${params.outdir}/genotyping", mode: params.publish_dir_mode when: params.run_genotyping && params.genotyping_tool == 'pileupcaller' input: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(bam), path(bai) from ch_prepped_for_pileupcaller_double.mix(ch_prepped_for_pileupcaller_single) file fasta from ch_fasta_for_genotyping_pileupcaller.collect() file fai from ch_fai_for_pileupcaller.collect() file dict from ch_dict_for_pileupcaller.collect() path(bed) from ch_bed_for_pileupcaller.collect() path(snp) from ch_snp_for_pileupcaller.collect() output: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("pileupcaller.${strandedness}.*") into ch_for_eigenstrat_snp_coverage script: def use_bed = bed.getName() != 'nf-core_eager_dummy.txt' ? "-l ${bed}" : '' def use_snp = snp.getName() != 'nf-core_eager_dummy2.txt' ? "-f ${snp}" : '' def transitions_mode = strandedness == "single" ? "" : "${params.pileupcaller_transitions_mode}" == 'SkipTransitions' ? "--skipTransitions" : "${params.pileupcaller_transitions_mode}" == 'TransitionsMissing' ? "--transitionsMissing" : "" def caller = "--${params.pileupcaller_method}" def ssmode = strandedness == "single" ? "--singleStrandMode" : "" def bam_list = bam.flatten().join(" ") def sample_names = samplename.flatten().join(",") def map_q = params.pileupcaller_min_map_quality def base_q = params.pileupcaller_min_base_quality """ samtools mpileup -B --ignore-RG -q ${map_q} -Q ${base_q} ${use_bed} -f ${fasta} ${bam_list} | pileupCaller ${caller} ${ssmode} ${transitions_mode} --sampleNames ${sample_names} ${use_snp} -e pileupcaller.${strandedness} """ } process eigenstrat_snp_coverage { label 'mc_tiny' tag "${strandedness}" publishDir "${params.outdir}/genotyping", mode: params.publish_dir_mode when: params.run_genotyping && params.genotyping_tool == 'pileupcaller' input: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*") from ch_for_eigenstrat_snp_coverage output: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*.json") into ch_eigenstrat_snp_cov_for_multiqc path("*_eigenstrat_coverage.txt") script: /* The following code block can be swapped in once the eigenstratdatabasetools MultiQC module becomes available. """ eigenstrat_snp_coverage -i pileupcaller.${strandedness} >${strandedness}_eigenstrat_coverage.txt -j ${strandedness}_eigenstrat_coverage_mqc.json """ */ """ eigenstrat_snp_coverage -i pileupcaller.${strandedness} >${strandedness}_eigenstrat_coverage.txt parse_snp_cov.py ${strandedness}_eigenstrat_coverage.txt """ } process genotyping_angsd { label 'mc_small' tag "${samplename}" publishDir "${params.outdir}/genotyping", mode: params.publish_dir_mode when: params.run_genotyping && params.genotyping_tool == 'angsd' input: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, file(bam), file(bai) from ch_damagemanipulation_for_genotyping_angsd file fasta from ch_fasta_for_genotyping_angsd.collect() file fai from ch_fai_for_angsd.collect() file dict from ch_dict_for_angsd.collect() output: path("${samplename}*") script: switch ( "${params.angsd_glmodel}" ) { case "samtools": angsd_glmodel = "1"; break case "gatk": angsd_glmodel = "2"; break case "soapsnp": angsd_glmodel = "3"; break case "syk": angsd_glmodel = "4"; break } switch ( "${params.angsd_glformat}" ) { case "text": angsd_glformat = "4"; break case "binary": angsd_glformat = "1"; break case "beagle": angsd_glformat = "2"; break case "binary_three": angsd_glformat = "3"; break } def angsd_fasta = !params.angsd_createfasta ? '' : params.angsd_fastamethod == 'random' ? '-doFasta 1 -doCounts 1' : '-doFasta 2 -doCounts 1' def angsd_majorminor = params.angsd_glformat != "beagle" ? '' : '-doMajorMinor 1' """ echo ${bam} > bam.filelist mkdir angsd angsd -bam bam.filelist -nThreads ${task.cpus} -GL ${angsd_glmodel} -doGlF ${angsd_glformat} ${angsd_majorminor} ${angsd_fasta} -out ${samplename}.angsd """ } //////////////////////////////////// /* -- GENOTYPING STATS -- */ //////////////////////////////////// process bcftools_stats { label 'mc_small' tag "${samplename}" publishDir "${params.outdir}/bcftools/stats", mode: params.publish_dir_mode when: params.run_bcftools_stats input: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(vcf) from ch_ug_for_bcftools_stats.mix(ch_hc_for_bcftools_stats,ch_fb_for_bcftools_stats) file fasta from ch_fasta_for_bcftools_stats.collect() output: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*.vcf.stats") into ch_bcftools_stats_for_multiqc script: """ bcftools stats *.vcf.gz -F ${fasta} > ${samplename}.vcf.stats """ } //////////////////////////////////// /* -- CONSENSUS CALLING -- */ //////////////////////////////////// // Generate a simple consensus-called FASTA file based on genotype VCF process vcf2genome { label 'mc_small' tag "${samplename}" publishDir "${params.outdir}/consensus_sequence", mode: params.publish_dir_mode when: params.run_vcf2genome input: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(vcf) from ch_ug_for_vcf2genome file fasta from ch_fasta_for_vcf2genome.collect() output: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*.fasta.gz") script: def out = !params.vcf2genome_outfile ? "${samplename}.fasta" : "${samplename}_${params.vcf2genome_outfile}.fasta" def fasta_head = !params.vcf2genome_header ? "${samplename}" : "${params.vcf2genome_header}" """ pigz -d -f -p ${task.cpus} ${vcf} vcf2genome -Xmx${task.memory.toGiga()}g -draft ${out} -draftname "${fasta_head}" -in ${vcf.baseName} -minc ${params.vcf2genome_minc} -minfreq ${params.vcf2genome_minfreq} -minq ${params.vcf2genome_minq} -ref ${fasta} -refMod ${out}_refmod.fasta -uncertain ${out}_uncertainty.fasta pigz -f -p ${task.cpus} ${out}* bgzip -@ ${task.cpus} *.vcf """ } // More complex consensus caller with additional filtering functionality (e.g. for heterozygous calls) to generate SNP tables and other things sometimes used in aDNA bacteria studies // Create input channel for MultiVCFAnalyzer, possibly mixing with pre-made VCFs. if (!params.additional_vcf_files) { ch_vcfs_for_multivcfanalyzer = ch_ug_for_multivcfanalyzer.map{ it[-1] }.collect() } else { ch_vcfs_for_multivcfanalyzer = ch_ug_for_multivcfanalyzer.map{ it[-1] }.mix(ch_extravcfs_for_multivcfanalyzer).collect() } process multivcfanalyzer { label 'mc_small' publishDir "${params.outdir}/multivcfanalyzer", mode: params.publish_dir_mode when: params.genotyping_tool == 'ug' && params.run_multivcfanalyzer && params.gatk_ploidy.toString() == '2' input: file vcf from ch_vcfs_for_multivcfanalyzer file fasta from ch_fasta_for_multivcfanalyzer output: file('fullAlignment.fasta.gz') file('info.txt.gz') file('snpAlignment.fasta.gz') file('snpAlignmentIncludingRefGenome.fasta.gz') file('snpStatistics.tsv.gz') file('snpTable.tsv.gz') file('snpTableForSnpEff.tsv.gz') file('snpTableWithUncertaintyCalls.tsv.gz') file('structureGenotypes.tsv.gz') file('structureGenotypes_noMissingData-Columns.tsv.gz') file('MultiVCFAnalyzer.json') optional true into ch_multivcfanalyzer_for_multiqc script: def write_freqs = params.write_allele_frequencies ? "T" : "F" """ pigz -d -f -p ${task.cpus} ${vcf} multivcfanalyzer -Xmx${task.memory.toGiga()}g ${params.snp_eff_results} ${fasta} ${params.reference_gff_annotations} . ${write_freqs} ${params.min_genotype_quality} ${params.min_base_coverage} ${params.min_allele_freq_hom} ${params.min_allele_freq_het} ${params.reference_gff_exclude} *.vcf pigz -p ${task.cpus} *.tsv *.txt snpAlignment.fasta snpAlignmentIncludingRefGenome.fasta fullAlignment.fasta bgzip -@ ${task.cpus} *.vcf """ } //////////////////////////////////////////////////////////// /* -- HUMAN DNA SPECIFIC ADDITIONAL INFORMATION -- */ //////////////////////////////////////////////////////////// // Mitochondrial to nuclear ratio helps to evaluate quality of tissue sampled process mtnucratio { label 'sc_small' tag "${samplename}" publishDir "${params.outdir}/mtnucratio", mode: params.publish_dir_mode when: params.run_mtnucratio input: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(bam), path(bai) from ch_rmdup_formtnucratio output: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*.mtnucratio") tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*.json") into ch_mtnucratio_for_multiqc script: """ mtnucratio -Xmx${task.memory.toGiga()}g ${bam} "${params.mtnucratio_header}" """ } // Human biological sex estimation // rename to prevent single/double stranded library sample name-based file conflict process sexdeterrmine_prep { label 'sc_small' input: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(bam), path(bai) from ch_for_sexdeterrmine_prep output: file "*_{single,double}strand.bam" into ch_prepped_for_sexdeterrmine when: params.run_sexdeterrmine script: """ mv ${bam} ${bam.baseName}_${strandedness}strand.bam """ } // As we collect all files for a single sex_deterrmine run, we DO NOT use the normal input/output tuple process sexdeterrmine { label 'mc_small' publishDir "${params.outdir}/sex_determination", mode: params.publish_dir_mode input: path bam from ch_prepped_for_sexdeterrmine.collect() path(bed) from ch_bed_for_sexdeterrmine output: file "SexDet.txt" file "*.json" into ch_sexdet_for_multiqc when: params.run_sexdeterrmine script: def filter = bed.getName() != 'nf-core_eager_dummy.txt' ? "-b $bed" : '' """ ls *.bam >> bamlist.txt samtools depth -aa -q30 -Q30 $filter -f bamlist.txt | sexdeterrmine -f bamlist.txt > SexDet.txt """ } // Human DNA nuclear contamination estimation process nuclear_contamination{ label 'sc_small' tag "${samplename}" publishDir "${params.outdir}/nuclear_contamination", mode: params.publish_dir_mode when: params.run_nuclear_contamination input: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(input), path(bai) from ch_for_nuclear_contamination output: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path('*.X.contamination.out') into ch_from_nuclear_contamination script: """ samtools index ${input} angsd -i ${input} -r ${params.contamination_chrom_name}:5000000-154900000 -doCounts 1 -iCounts 1 -minMapQ 30 -minQ 30 -out ${libraryid}.doCounts contamination -a ${libraryid}.doCounts.icnts.gz -h ${projectDir}/assets/angsd_resources/HapMapChrX.gz 2> ${libraryid}.X.contamination.out """ } // As we collect all files for a single print_nuclear_contamination run, we DO NOT use the normal input/output tuple process print_nuclear_contamination{ label 'sc_tiny' publishDir "${params.outdir}/nuclear_contamination", mode: params.publish_dir_mode when: params.run_nuclear_contamination input: path Contam from ch_from_nuclear_contamination.map { it[7] }.collect() output: file 'nuclear_contamination.txt' file 'nuclear_contamination_mqc.json' into ch_nuclear_contamination_for_multiqc script: """ print_x_contamination.py ${Contam.join(' ')} """ } ///////////////////////////////////////////////////////// /* -- METAGENOMICS-SPECIFIC ADDITIONAL STEPS -- */ ///////////////////////////////////////////////////////// // Low entropy read filter to reduce input sequences of reads that are highly uninformative, and thus reduce runtime/false positives process metagenomic_complexity_filter { label 'mc_small' tag "${samplename}" publishDir "${params.outdir}/metagenomic_complexity_filter/", mode: params.publish_dir_mode when: params.metagenomic_complexity_filter input: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(fastq) from ch_bam_filtering_for_metagenomic output: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*_lowcomplexityremoved.fq.gz") into ch_lowcomplexityfiltered_for_metagenomic path("*_bbduk.stats") into ch_metagenomic_complexity_filter_for_multiqc script: """ bbduk.sh -Xmx${task.memory.toGiga()}g in=${fastq} threads=${task.cpus} entropymask=f entropy=${params.metagenomic_complexity_entropy} out=${fastq}_lowcomplexityremoved.fq.gz 2> ${fastq}_bbduk.stats """ } // metagenomic complexity filter bypass if ( params.metagenomic_complexity_filter ) { ch_lowcomplexityfiltered_for_metagenomic .set{ ch_filtered_for_metagenomic } } else { ch_metagenomic_for_skipentropyfilter .set{ ch_filtered_for_metagenomic } } // MALT is a super-fast BLAST replacement typically used for pathogen detection or microbiome profiling against large databases, here using off-target reads from mapping // As we collect all files for a all metagenomic runs, we DO NOT use the normal input/output tuple! if (params.metagenomic_tool == 'malt') { ch_filtered_for_metagenomic .set {ch_input_for_metagenomic_malt} ch_input_for_metagenomic_kraken = Channel.empty() } else if (params.metagenomic_tool == 'kraken') { ch_filtered_for_metagenomic .set {ch_input_for_metagenomic_kraken} ch_input_for_metagenomic_malt = Channel.empty() } else if ( !params.metagenomic_tool ) { ch_input_for_metagenomic_malt = Channel.empty() ch_input_for_metagenomic_kraken = Channel.empty() } // As we collect all files for a single MALT run, we DO NOT use the normal input/output tuple process malt { label 'mc_small' publishDir "${params.outdir}/metagenomic_classification/malt", mode: params.publish_dir_mode when: params.run_metagenomic_screening && params.run_bam_filtering && params.bam_unmapped_type == 'fastq' && params.metagenomic_tool == 'malt' input: file fastqs from ch_input_for_metagenomic_malt.map { it[7] }.collect() file db from ch_db_for_malt output: path("*.rma6") into ch_rma_for_maltExtract path("*.sam.gz") optional true path("malt.log") into ch_malt_for_multiqc script: if ( "${params.malt_min_support_mode}" == "percent" ) { min_supp = "-supp ${params.malt_min_support_percent}" } else if ( "${params.malt_min_support_mode}" == "reads" ) { min_supp = "-sup ${params.metagenomic_min_support_reads}" } def sam_out = params.malt_sam_output ? "-a . -f SAM" : "" """ malt-run \ -J-Xmx${task.memory.toGiga()}g \ -t ${task.cpus} \ -v \ -o . \ -d ${db} \ ${sam_out} \ -id ${params.percent_identity} \ -m ${params.malt_mode} \ -at ${params.malt_alignment_mode} \ -top ${params.malt_top_percent} \ ${min_supp} \ -mq ${params.malt_max_queries} \ --memoryMode ${params.malt_memory_mode} \ -i ${fastqs.join(' ')} |&tee malt.log """ } // MaltExtract performs aDNA evaluation from the output of MALT (damage patterns, read lengths etc.) // As we collect all files for a single MALT extract run, we DO NOT use the normal input/output tuple process maltextract { label 'mc_medium' publishDir "${params.outdir}/maltextract/", mode: params.publish_dir_mode when: params.run_maltextract && params.metagenomic_tool == 'malt' input: file rma6 from ch_rma_for_maltExtract.collect() file taxon_list from ch_taxonlist_for_maltextract file ncbifiles from ch_ncbifiles_for_maltextract output: path "results/" type('dir') file "results/*_Wevid.json" optional true into ch_hops_for_multiqc script: def destack = params.maltextract_destackingoff ? "--destackingOff" : "" def downsam = params.maltextract_downsamplingoff ? "--downSampOff" : "" def dupremo = params.maltextract_duplicateremovaloff ? "--dupRemOff" : "" def matches = params.maltextract_matches ? "--matches" : "" def megsum = params.maltextract_megansummary ? "--meganSummary" : "" def topaln = params.maltextract_topalignment ? "--useTopAlignment" : "" def ss = params.single_stranded ? "--singleStranded" : "" """ MaltExtract \ -Xmx${task.memory.toGiga()}g \ -t ${taxon_list} \ -i ${rma6.join(' ')} \ -o results/ \ -r ${ncbifiles} \ -p ${task.cpus} \ -f ${params.maltextract_filter} \ -a ${params.maltextract_toppercent} \ --minPI ${params.maltextract_percentidentity} \ ${destack} \ ${downsam} \ ${dupremo} \ ${matches} \ ${megsum} \ ${topaln} \ ${ss} postprocessing.AMPS.r -r results/ -m ${params.maltextract_filter} -t ${task.cpus} -n ${taxon_list} -j """ } // Kraken is offered as a replacement for MALT as MALT is _very_ resource hungry if (params.run_metagenomic_screening && params.database.endsWith(".tar.gz") && params.metagenomic_tool == 'kraken'){ comp_kraken = file(params.database) process decomp_kraken { input: path(ckdb) from comp_kraken output: path(dbname) into ch_krakendb script: dbname = ckdb.toString() - '.tar.gz' """ tar xvzf $ckdb mkdir -p $dbname mv *.k2d $dbname || echo "nothing to do" """ } } else if (params.database && ! params.database.endsWith(".tar.gz") && params.run_metagenomic_screening && params.metagenomic_tool == 'kraken') { ch_krakendb = Channel.fromPath(params.database).first() } else { ch_krakendb = Channel.empty() } process kraken { tag "$prefix" label 'mc_huge' publishDir "${params.outdir}/metagenomic_classification/kraken", mode: params.publish_dir_mode when: params.run_metagenomic_screening && params.run_bam_filtering && params.bam_unmapped_type == 'fastq' && params.metagenomic_tool == 'kraken' input: path(fastq) from ch_input_for_metagenomic_kraken.map { it[7] } path(krakendb) from ch_krakendb output: file "*.kraken.out" optional true into ch_kraken_out tuple prefix, path("*.kraken2_report") optional true into ch_kraken_report, ch_kraken_for_multiqc script: prefix = fastq.baseName out = prefix+".kraken.out" kreport = prefix+".kraken2_report" kreport_old = prefix+".kreport" """ kraken2 --db ${krakendb} --threads ${task.cpus} --output $out --report-minimizer-data --report $kreport $fastq cut -f1-3,6-8 $kreport > $kreport_old """ } process kraken_parse { tag "$name" errorStrategy 'ignore' input: tuple val(name), path(kraken_r) from ch_kraken_report output: path('*_kraken_parsed.csv') into ch_kraken_parsed script: read_out = name+".read_kraken_parsed.csv" kmer_out = name+".kmer_kraken_parsed.csv" """ kraken_parse.py -c ${params.metagenomic_min_support_reads} -or $read_out -ok $kmer_out $kraken_r """ } process kraken_merge { publishDir "${params.outdir}/metagenomic_classification/kraken", mode: params.publish_dir_mode input: file csv_count from ch_kraken_parsed.collect() output: path('*.csv') script: read_out = "kraken_read_count.csv" kmer_out = "kraken_kmer_duplication.csv" """ merge_kraken_res.py -or $read_out -ok $kmer_out """ } ////////////////////////////////////// /* -- PIPELINE COMPLETION -- */ ////////////////////////////////////// // Pipeline documentation for on-server guidance process output_documentation { label 'sc_tiny' publishDir "${params.outdir}/documentation", mode: params.publish_dir_mode input: file output_docs from ch_output_docs file images from ch_output_docs_images output: file "results_description.html" script: """ markdown_to_html.py $output_docs -o results_description.html """ } /* * Parse software version numbers */ process get_software_versions { label 'mc_small' publishDir "${params.outdir}/pipeline_info", mode: params.publish_dir_mode, saveAs: { filename -> if (filename.indexOf(".csv") > 0) filename else null } output: file 'software_versions_mqc.yaml' into software_versions_yaml file "software_versions.csv" script: """ echo $workflow.manifest.version &> v_pipeline.txt echo $workflow.nextflow.version &> v_nextflow.txt fastqc -t ${task.cpus} --version &> v_fastqc.txt 2>&1 || true AdapterRemoval --version &> v_adapterremoval.txt 2>&1 || true fastp --version &> v_fastp.txt 2>&1 || true bwa &> v_bwa.txt 2>&1 || true circulargenerator -Xmx${task.memory.toGiga()}g --help | head -n 1 &> v_circulargenerator.txt 2>&1 || true samtools --version &> v_samtools.txt 2>&1 || true dedup -Xmx${task.memory.toGiga()}g -v &> v_dedup.txt 2>&1 || true ## bioconda recipe of picard is incorrectly set up and extra warning made with stderr, this ugly command ensures only version exported ( exec 7>&1; picard -Xmx${task.memory.toMega()}M MarkDuplicates --version 2>&1 >&7 | grep -v '/' >&2 ) 2> v_markduplicates.txt || true qualimap --version --java-mem-size=${task.memory.toGiga()}G &> v_qualimap.txt 2>&1 || true preseq &> v_preseq.txt 2>&1 || true gatk --java-options "-Xmx${task.memory.toGiga()}G" --version 2>&1 | grep '(GATK)' > v_gatk.txt 2>&1 || true gatk3 -Xmx${task.memory.toGiga()}g --version 2>&1 | head -n 1 > v_gatk3.txt 2>&1 || true freebayes --version &> v_freebayes.txt 2>&1 || true bedtools --version &> v_bedtools.txt 2>&1 || true damageprofiler -Xmx${task.memory.toGiga()}g --version &> v_damageprofiler.txt 2>&1 || true bam --version &> v_bamutil.txt 2>&1 || true pmdtools --version &> v_pmdtools.txt 2>&1 || true angsd -h |& head -n 1 | cut -d ' ' -f3-4 &> v_angsd.txt 2>&1 || true multivcfanalyzer -Xmx${task.memory.toGiga()}g --help | head -n 1 &> v_multivcfanalyzer.txt 2>&1 || true malt-run -J-Xmx${task.memory.toGiga()}g --help |& tail -n 3 | head -n 1 | cut -f 2 -d'(' | cut -f 1 -d ',' &> v_malt.txt 2>&1 || true MaltExtract -Xmx${task.memory.toGiga()}g --help | head -n 2 | tail -n 1 &> v_maltextract.txt 2>&1 || true multiqc --version &> v_multiqc.txt 2>&1 || true vcf2genome -Xmx${task.memory.toGiga()}g -h |& head -n 1 &> v_vcf2genome.txt || true mtnucratio -Xmx${task.memory.toGiga()}g --help &> v_mtnucratiocalculator.txt || true sexdeterrmine --version &> v_sexdeterrmine.txt || true kraken2 --version | head -n 1 &> v_kraken.txt || true endorS.py --version &> v_endorSpy.txt || true pileupCaller --version &> v_sequencetools.txt 2>&1 || true bowtie2 --version | grep -a 'bowtie2-.* -fdebug' > v_bowtie2.txt || true eigenstrat_snp_coverage --version | cut -d ' ' -f2 >v_eigenstrat_snp_coverage.txt || true mapDamage --version > v_mapdamage.txt || true bbversion.sh > v_bbduk.txt || true bcftools --version | grep 'bcftools' | cut -d ' ' -f 2 > v_bcftools.txt || true scrape_software_versions.py &> software_versions_mqc.yaml """ } // MultiQC file generation for pipeline report //def workflow_summary = NfcoreSchema.params_summary_multiqc(workflow, summary_params) //ch_workflow_summary = Channel.value(workflow_summary) process multiqc { label 'sc_medium' publishDir "${params.outdir}/multiqc", mode: params.publish_dir_mode input: file multiqc_config from ch_multiqc_config file (mqc_custom_config) from ch_multiqc_custom_config.collect().ifEmpty([]) file software_versions_mqc from software_versions_yaml.collect().ifEmpty([]) file logo from ch_eager_logo file ('fastqc_raw/*') from ch_prefastqc_for_multiqc.collect().ifEmpty([]) path('fastqc/*') from ch_fastqc_after_clipping.collect().ifEmpty([]) file ('adapter_removal/*') from ch_adapterremoval_logs.collect().ifEmpty([]) file ('mapping/bt2/*') from ch_bt2_for_multiqc.collect().ifEmpty([]) file ('flagstat/*') from ch_flagstat_for_multiqc.collect().ifEmpty([]) file ('flagstat_filtered/*') from ch_bam_filtered_flagstat_for_multiqc.collect().ifEmpty([]) file ('preseq/*') from ch_preseq_for_multiqc.collect().ifEmpty([]) file ('damageprofiler/dmgprof*/*') from ch_damageprofiler_results.collect().ifEmpty([]) file ('mapdamage/*') from ch_mapdamage_for_multiqc.collect().ifEmpty([]) file ('qualimap/qualimap*/*') from ch_qualimap_results.collect().ifEmpty([]) file ('markdup/*') from ch_markdup_results_for_multiqc.collect().ifEmpty([]) file ('dedup*/*') from ch_dedup_results_for_multiqc.collect().ifEmpty([]) file ('fastp/*') from ch_fastp_for_multiqc.collect().ifEmpty([]) file ('sexdeterrmine/*') from ch_sexdet_for_multiqc.collect().ifEmpty([]) file ('mutnucratio/*') from ch_mtnucratio_for_multiqc.collect().ifEmpty([]) file ('endorspy/*') from ch_endorspy_for_multiqc.collect().ifEmpty([]) file ('multivcfanalyzer/*') from ch_multivcfanalyzer_for_multiqc.collect().ifEmpty([]) file ('fastp_lowcomplexityfilter/*') from ch_metagenomic_complexity_filter_for_multiqc.collect().ifEmpty([]) file ('malt/*') from ch_malt_for_multiqc.collect().ifEmpty([]) file ('kraken/*') from ch_kraken_for_multiqc.collect().ifEmpty([]) file ('hops/*') from ch_hops_for_multiqc.collect().ifEmpty([]) file ('nuclear_contamination/*') from ch_nuclear_contamination_for_multiqc.collect().ifEmpty([]) file ('genotyping/*') from ch_eigenstrat_snp_cov_for_multiqc.collect().ifEmpty([]) file ('bcftools_stats') from ch_bcftools_stats_for_multiqc.collect().ifEmpty([]) file workflow_summary from ch_workflow_summary.collectFile(name: "workflow_summary_mqc.yaml") output: file "*multiqc_report.html" into ch_multiqc_report file "*_data" script: rtitle = '' rfilename = '' if (!(workflow.runName ==~ /[a-z]+_[a-z]+/)) { rtitle = "--title \"${workflow.runName}\"" rfilename = "--filename " + workflow.runName.replaceAll('\\W','_').replaceAll('_+','_') + "_multiqc_report" } def custom_config_file = params.multiqc_config ? "--config $mqc_custom_config" : '' """ multiqc -f $rtitle $rfilename $multiqc_config $custom_config_file . """ } // Send completion emails if requested, so user knows data is ready workflow.onComplete { // Set up the e-mail variables def subject = "[nf-core/eager] Successful: $workflow.runName" if (!workflow.success) { subject = "[nf-core/eager] FAILED: $workflow.runName" } def email_fields = [:] email_fields['version'] = workflow.manifest.version email_fields['runName'] = workflow.runName email_fields['success'] = workflow.success email_fields['dateComplete'] = workflow.complete email_fields['duration'] = workflow.duration email_fields['exitStatus'] = workflow.exitStatus email_fields['errorMessage'] = (workflow.errorMessage ?: 'None') email_fields['errorReport'] = (workflow.errorReport ?: 'None') email_fields['commandLine'] = workflow.commandLine email_fields['projectDir'] = workflow.projectDir email_fields['summary'] = summary email_fields['summary']['Date Started'] = workflow.start email_fields['summary']['Date Completed'] = workflow.complete email_fields['summary']['Pipeline script file path'] = workflow.scriptFile email_fields['summary']['Pipeline script hash ID'] = workflow.scriptId if (workflow.repository) email_fields['summary']['Pipeline repository Git URL'] = workflow.repository if (workflow.commitId) email_fields['summary']['Pipeline repository Git Commit'] = workflow.commitId if (workflow.revision) email_fields['summary']['Pipeline Git branch/tag'] = workflow.revision email_fields['summary']['Nextflow Version'] = workflow.nextflow.version email_fields['summary']['Nextflow Build'] = workflow.nextflow.build email_fields['summary']['Nextflow Compile Timestamp'] = workflow.nextflow.timestamp // On success try attach the multiqc report def mqc_report = null try { if (workflow.success) { mqc_report = ch_multiqc_report.getVal() if (mqc_report.getClass() == ArrayList) { log.warn "[nf-core/eager] Found multiple reports from process 'multiqc', will use only one" mqc_report = mqc_report[0] } } } catch (all) { log.warn "[nf-core/eager] Could not attach MultiQC report to summary email" } // Check if we are only sending emails on failure email_address = params.email if (!params.email && params.email_on_fail && !workflow.success) { email_address = params.email_on_fail } // Render the TXT template def engine = new groovy.text.GStringTemplateEngine() def tf = new File("$projectDir/assets/email_template.txt") def txt_template = engine.createTemplate(tf).make(email_fields) def email_txt = txt_template.toString() // Render the HTML template def hf = new File("$projectDir/assets/email_template.html") def html_template = engine.createTemplate(hf).make(email_fields) def email_html = html_template.toString() // Render the sendmail template def smail_fields = [ email: email_address, subject: subject, email_txt: email_txt, email_html: email_html, projectDir: "$projectDir", mqcFile: mqc_report, mqcMaxSize: params.max_multiqc_email_size.toBytes() ] def sf = new File("$projectDir/assets/sendmail_template.txt") def sendmail_template = engine.createTemplate(sf).make(smail_fields) def sendmail_html = sendmail_template.toString() // Send the HTML e-mail if (email_address) { try { if (params.plaintext_email) { throw GroovyException('Send plaintext e-mail, not HTML') } // Try to send HTML e-mail using sendmail [ 'sendmail', '-t' ].execute() << sendmail_html log.info "[nf-core/eager] Sent summary e-mail to $email_address (sendmail)" } catch (all) { // Catch failures and try with plaintext def mail_cmd = [ 'mail', '-s', subject, '--content-type=text/html', email_address ] if ( mqc_report.size() <= params.max_multiqc_email_size.toBytes() ) { mail_cmd += [ '-A', mqc_report ] } mail_cmd.execute() << email_html log.info "[nf-core/eager] Sent summary e-mail to $email_address (mail)" } } // Write summary e-mail HTML to a file def output_d = new File("${params.outdir}/pipeline_info/") if (!output_d.exists()) { output_d.mkdirs() } def output_hf = new File(output_d, "pipeline_report.html") output_hf.withWriter { w -> w << email_html } def output_tf = new File(output_d, "pipeline_report.txt") output_tf.withWriter { w -> w << email_txt } c_green = params.monochrome_logs ? '' : "\033[0;32m"; c_purple = params.monochrome_logs ? '' : "\033[0;35m"; c_red = params.monochrome_logs ? '' : "\033[0;31m"; c_reset = params.monochrome_logs ? '' : "\033[0m"; if (workflow.stats.ignoredCount > 0 && workflow.success) { log.info "-${c_purple}Warning, pipeline completed, but with errored process(es) ${c_reset}-" log.info "-${c_red}Number of ignored errored process(es) : ${workflow.stats.ignoredCount} ${c_reset}-" log.info "-${c_green}Number of successfully ran process(es) : ${workflow.stats.succeedCount} ${c_reset}-" } if (workflow.success) { log.info "-${c_purple}[nf-core/eager]${c_green} Pipeline completed successfully${c_reset}-" log.info "-${c_purple}[nf-core/eager]${c_green} MultiQC run report can be found in ${params.outdir}/multiqc ${c_reset}-" log.info "-${c_purple}[nf-core/eager]${c_green} Further output documentation can be seen at https://nf-core/eager/output ${c_reset}-" } else { checkHostname() log.info "-${c_purple}[nf-core/eager]${c_red} Pipeline completed with errors${c_reset}-" } } workflow.onError { // Print unexpected parameters - easiest is to just rerun validation NfcoreSchema.validateParameters(params, json_schema, log) } ///////////////////////////////////// /* -- AUXILARY FUNCTIONS -- */ ///////////////////////////////////// // Channelling the TSV file containing FASTQ or BAM def extract_data(tsvFile) { Channel.fromPath(tsvFile) .splitCsv(header: true, sep: '\t') .map { row -> def expected_keys = ['Sample_Name', 'Library_ID', 'Lane', 'Colour_Chemistry', 'SeqType', 'Organism', 'Strandedness', 'UDG_Treatment', 'R1', 'R2', 'BAM'] if ( !row.keySet().containsAll(expected_keys) ) exit 1, "[nf-core/eager] error: Invalid TSV input - malformed column names. Please check input TSV. Column names should be: ${expected_keys.join(", ")}" checkNumberOfItem(row, 11) if ( row.Sample_Name == null || row.Sample_Name.isEmpty() ) exit 1, "[nf-core/eager] error: the Sample_Name column is empty. Ensure all cells are filled or contain 'NA' for optional fields. Check row:\n ${row}" if ( row.Library_ID == null || row.Library_ID.isEmpty() ) exit 1, "[nf-core/eager] error: the Library_ID column is empty. Ensure all cells are filled or contain 'NA' for optional fields. Check row:\n ${row}" if ( row.Lane == null || row.Lane.isEmpty() ) exit 1, "[nf-core/eager] error: the Lane column is empty. Ensure all cells are filled or contain 'NA' for optional fields. Check row:\n ${row}" if ( row.Colour_Chemistry == null || row.Colour_Chemistry.isEmpty() ) exit 1, "[nf-core/eager] error: the Colour_Chemistry column is empty. Ensure all cells are filled or contain 'NA' for optional fields. Check row:\n ${row}" if ( row.SeqType == null || row.SeqType.isEmpty() ) exit 1, "[nf-core/eager] error: the SeqType column is empty. Ensure all cells are filled or contain 'NA' for optional fields. Check row:\n ${row}" if ( row.Organism == null || row.Organism.isEmpty() ) exit 1, "[nf-core/eager] error: the Organism column is empty. Ensure all cells are filled or contain 'NA' for optional fields. Check row:\n ${row}" if ( row.Strandedness == null || row.Strandedness.isEmpty() ) exit 1, "[nf-core/eager] error: the Strandedness column is empty. Ensure all cells are filled or contain 'NA' for optional fields. Check row:\n ${row}" if ( row.UDG_Treatment == null || row.UDG_Treatment.isEmpty() ) exit 1, "[nf-core/eager] error: the UDG_Treatment column is empty. Ensure all cells are filled or contain 'NA' for optional fields. Check row:\n ${row}" if ( row.R1 == null || row.R1.isEmpty() ) exit 1, "[nf-core/eager] error: the R1 column is empty. Ensure all cells are filled or contain 'NA' for optional fields. Check row:\n ${row}" if ( row.R2 == null || row.R2.isEmpty() ) exit 1, "[nf-core/eager] error: the R2 column is empty. Ensure all cells are filled or contain 'NA' for optional fields. Check row:\n ${row}" if ( row.BAM == null || row.BAM.isEmpty() ) exit 1, "[nf-core/eager] error: the BAM column is empty. Ensure all cells are filled or contain 'NA' for optional fields. Check row:\n ${row}" def samplename = row.Sample_Name def libraryid = row.Library_ID def lane = row.Lane def colour = row.Colour_Chemistry def seqtype = row.SeqType def organism = row.Organism def strandedness = row.Strandedness def udg = row.UDG_Treatment def r1 = row.R1.matches('NA') ? 'NA' : return_file(row.R1) def r2 = row.R2.matches('NA') ? 'NA' : return_file(row.R2) def bam = row.BAM.matches('NA') ? 'NA' : return_file(row.BAM) // check no empty metadata fields if (samplename == '' || libraryid == '' || lane == '' || colour == '' || seqtype == '' || organism == '' || strandedness == '' || udg == '' || r1 == '' || r2 == '' || bam == '') exit 1, "[nf-core/eager] error: a field/column does not contain any information. Ensure all cells are filled or contain 'NA' for optional fields. Check row:\n ${row}" // Check no 'empty' rows if (r1.matches('NA') && r2.matches('NA') && bam.matches('NA')) exit 1, "[nf-core/eager] error: A row in your TSV appears to have all files defined as NA. See '--help' flag and documentation under 'running the pipeline' for more information. Check row for: ${samplename}" // Ensure BAMs aren't submitted with PE if (!bam.matches('NA') && seqtype.matches('PE')) exit 1, "[nf-core/eager] error: BAM input rows in TSV cannot be set as PE, only SE. See '--help' flag and documentation under 'running the pipeline' for more information. Check row for: ${samplename}" // Check valid UDG treatment if (!udg.matches('none') && !udg.matches('half') && !udg.matches('full')) exit 1, "[nf-core/eager] error: UDG treatment can only be 'none', 'half' or 'full'. See '--help' flag and documentation under 'running the pipeline' for more information. You have '${udg}'" // Check valid colour chemistry if (!colour.matches('2') && !colour.matches('4')) exit 1, "[nf-core/eager] error: Colour chemistry in TSV can either be 2 (e.g. NextSeq/NovaSeq) or 4 (e.g. HiSeq/MiSeq)" // Ensure that we do not accept incompatible chemistry setup if (!seqtype.matches('PE') && !seqtype.matches('SE')) exit 1, "[nf-core/eager] error: SeqType for one or more rows in TSV is neither SE nor PE! see '--help' flag and documentation under 'running the pipeline' for more information. You have: '${seqtype}'" // So we don't accept existing files that are wrong format: e.g. fasta or sam if ( !r1.matches('NA') && !has_extension(r1, "fastq.gz") && !has_extension(r1, "fq.gz") && !has_extension(r1, "fastq") && !has_extension(r1, "fq")) exit 1, "[nf-core/eager] error: A specified R1 file either has a non-recognizable FASTQ extension or is not NA. See '--help' flag and documentation under 'running the pipeline' for more information. Check: ${r1}" if ( !r2.matches('NA') && !has_extension(r2, "fastq.gz") && !has_extension(r2, "fq.gz") && !has_extension(r2, "fastq") && !has_extension(r2, "fq")) exit 1, "[nf-core/eager] error: A specified R2 file either has a non-recognizable FASTQ extension or is not NA. See '--help' flag and documentation under 'running the pipeline' for more information. Check: ${r2}" if ( !bam.matches('NA') && !has_extension(bam, "bam")) exit 1, "[nf-core/eager] error: A specified R1 file either has a non-recognizable BAM extension or is not NA. See '--help' flag and documentation under 'running the pipeline' for more information. Check: ${bam}" [ samplename, libraryid, lane, colour, seqtype, organism, strandedness, udg, r1, r2, bam ] } } // Check if a row has the expected number of item def checkNumberOfItem(row, number) { if (row.size() != number) exit 1, "[nf-core/eager] error: Invalid TSV input - malformed row (e.g. missing column) in ${row}, see '--help' flag and documentation under 'running the pipeline' for more information" return true } // Return file if it exists def return_file(it) { if (!file(it).exists()) exit 1, "[nf-core/eager] error: Cannot find supplied FASTQ or BAM input file. If using input method TSV set to NA if no file required. See '--help' flag and documentation under 'running the pipeline' for more information. Check file: ${it}" return file(it) } // Check file extension def has_extension(it, extension) { it.toString().toLowerCase().endsWith(extension.toLowerCase()) } // Extract FastQs from Path // Create a channel of FASTQs from a directory pattern: "my_samples/*/" // All FASTQ files in subdirectories are collected and emitted; // they must have _R1_ and/or _R2_ in their names. def retrieve_input_paths(input, colour_chem, pe_se, ds_ss, udg_treat, bam_in) { if ( !bam_in ) { if( pe_se ) { log.info "Generating single-end FASTQ data TSV" Channel .fromFilePairs( input, size: 1 ) .filter { it =~/.*.fastq.gz|.*.fq.gz|.*.fastq|.*.fq/ } .ifEmpty { exit 1, "[nf-core/eager] error: Your specified FASTQ read files did not end in: '.fastq.gz', '.fq.gz', '.fastq', or '.fq'. Did you forget --bam?" } .map { row -> [ row[0], [ row[1][0] ] ] } .ifEmpty { exit 1, "[nf-core/eager] error: --input was empty - no input files supplied!" } .into { ch_reads_for_faketsv; ch_reads_for_validate } // Check we don't have any duplicated sample names due to fromFilePairs behaviour of calculating sample name from anything before R1/R2 glob ch_reads_for_validate .groupTuple() .map{ if ( validate_size(it[1], 1) ) { null } else { exit 1, "[nf-core/eager] error: You have supplied non-unique sample names (text before R1/R2 indication). Did you accidentally supply paired-end data? see '--help' flag and documentation under 'running the pipeline' for more information. Check duplicates of: ${it[0]}" } } } else if (!pe_se ){ log.info "Generating paired-end FASTQ data TSV" Channel .fromFilePairs( input ) .filter { it =~/.*.fastq.gz|.*.fq.gz|.*.fastq|.*.fq/ } .ifEmpty { exit 1, "[nf-core/eager] error: Files could not be found. Do the specified FASTQ read files end in: '.fastq.gz', '.fq.gz', '.fastq', or '.fq'? Did you forget --single_end?" } .map { row -> [ row[0], [ row[1][0], row[1][1] ] ] } .ifEmpty { exit 1, "[nf-core/eager] error: --input was empty - no input files supplied!" } .into { ch_reads_for_faketsv; ch_reads_for_validate } // Check we don't have any duplicated sample names due to fromFilePairs behaviour of calculating sample name from anything before R1/R2 glob ch_reads_for_validate .groupTuple() .map{ if ( validate_size(it[1], 1) ) { null } else { exit 1, "[nf-core/eager] error: You have supplied non-unique sample names (text before R1/R2 indication). See '--help' flag and documentation under 'running the pipeline' for more information. Check duplicates of: ${it[0]}" } } } } else if ( bam_in ) { log.info "Generating BAM data TSV" Channel .fromFilePairs( input, size: 1 ) .filter { it =~/.*.bam/ } .map { row -> [ row[0], [ row[1][0] ] ] } .ifEmpty { exit 1, "[nf-core/eager] error: Cannot find any bam file matching: ${input}" } .set { ch_reads_for_faketsv } } ch_reads_for_faketsv .map{ def samplename = it[0] def libraryid = it[0] def lane = 0 def colour = "${colour_chem}" def seqtype = pe_se ? 'SE' : 'PE' def organism = 'NA' def strandedness = ds_ss ? 'single' : 'double' def udg = udg_treat def r1 = !bam_in ? return_file(it[1][0]) : 'NA' def r2 = !bam_in && !pe_se ? return_file(it[1][1]) : 'NA' def bam = bam_in && pe_se ? return_file(it[1][0]) : 'NA' [ samplename, libraryid, lane, colour, seqtype, organism, strandedness, udg, r1, r2, bam ] } .ifEmpty {exit 1, "[nf-core/eager] error: Invalid file paths with --input"} } // Function to check length of collection in a channel closure is as expected (e.g. with .map()) def validate_size(collection, size){ if ( collection.size() != size ) { return false } else { return true } } def checkHostname() { def c_reset = params.monochrome_logs ? '' : "\033[0m" def c_white = params.monochrome_logs ? '' : "\033[0;37m" def c_red = params.monochrome_logs ? '' : "\033[1;91m" def c_yellow_bold = params.monochrome_logs ? '' : "\033[1;93m" if (params.hostnames) { def hostname = 'hostname'.execute().text.trim() params.hostnames.each { prof, hnames -> hnames.each { hname -> if (hostname.contains(hname) && !workflow.profile.contains(prof)) { log.error "${c_red}====================================================${c_reset}\n" + " ${c_red}WARNING!${c_reset} You are running with `-profile $workflow.profile`\n" + " but your machine hostname is ${c_white}'$hostname'${c_reset}\n" + " ${c_yellow_bold}It's highly recommended that you use `-profile $prof${c_reset}`\n" + "${c_red}====================================================${c_reset}\n" } } } } } ================================================ FILE: nextflow.config ================================================ /* * ------------------------------------------------- * nf-core/eager Nextflow config file * ------------------------------------------------- * Default config options for all environments. */ // Global default params, used in configs params { // Workflow flags genome = false input = null input_paths = null single_end = false outdir = './results' publish_dir_mode = 'copy' config_profile_name = null // aws awsqueue = null awsregion = 'eu-west-1' awscli = null //Pipeline options enable_conda = false validate_params = true schema_ignore_params = 'genome' show_hidden_params = false //Input reads udg_type = 'none' single_stranded = false single_end = false colour_chemistry = 4 bam = false // Optional input information snpcapture_bed = null run_convertinputbam = false //Input reference fasta = null bwa_index = null bt2_index = null fasta_index = null seq_dict = null large_ref = false save_reference = false // this is just to stop the iGenomes WARN as we set as FALSE by default. Otherwise should be overwritten by optional config load below. genomes = false //Skipping parts of the pipeline for impatient users skip_fastqc = false skip_adapterremoval = false skip_preseq = false skip_deduplication = false skip_damage_calculation = false skip_qualimap = false //More defaults complexity_filter_poly_g = false complexity_filter_poly_g_min = 10 //Read clipping and merging parameters clip_forward_adaptor = 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC' clip_reverse_adaptor = 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA' clip_adapters_list = null clip_readlength = 30 clip_min_read_quality = 20 min_adap_overlap = 1 skip_collapse = false skip_trim = false preserve5p = false mergedonly = false qualitymax = 41 run_post_ar_trimming = false post_ar_trim_front = 7 post_ar_trim_tail = 7 post_ar_trim_front2 = 7 post_ar_trim_tail2 = 7 //Mapping algorithm mapper = 'bwaaln' bwaalnn = 0.01 // From Oliva et al. 2021 (10.1093/bib/bbab076) bwaalnk = 2 bwaalnl = 1024 // From Oliva et al. 2021 (10.1093/bib/bbab076) bwaalno = 2 // From Oliva et al. 2021 (10.1093/bib/bbab076) circularextension = 500 circulartarget = 'MT' circularfilter = false bt2_alignmode = 'local' // from Cahill 2018 (10.1093/molbev/msy018) and, Poullet and Orlando (10.3389/fevo.2020.00105) bt2_sensitivity = 'sensitive' // from Poullet and Orlando (10.3389/fevo.2020.00105) bt2n = 0 // Do not set Cahill 2018 recommendation of 1 here, so not to 'hide' overriding bowtie2 presets bt2l = 0 bt2_trim5 = 0 bt2_trim3 = 0 bt2_maxins = 500 //Mapped read removal from input FASTQ hostremoval_input_fastq = false hostremoval_mode = 'remove' //BAM Filtering steps (default = discard unmapped reads) run_bam_filtering = false bam_mapping_quality_threshold = 0 bam_filter_minreadlength = 0 bam_unmapped_type = 'discard' //DeDuplication settings dedupper = 'markduplicates' dedup_all_merged = false //Preseq settings preseq_step_size = 1000 preseq_mode = 'c_curve' preseq_bootstrap = 100 preseq_maxextrap = 10000000000 preseq_cval = 0.95 preseq_terms = 100 //Damage estimation settings damage_calculation_tool = 'damageprofiler' damageprofiler_length = 100 damageprofiler_threshold = 15 damageprofiler_yaxis = 0.30 mapdamage_downsample = 0 mapdamage_yaxis = 0.30 //PMDTools settings run_pmdtools = false pmdtools_range = 10 pmdtools_threshold = 3 pmdtools_reference_mask = null pmdtools_max_reads = 10000 pmdtools_platypus = false // mapDamage run_mapdamage_rescaling = false rescale_length_5p = 0 rescale_length_3p = 0 rescale_seqlength = 12 //Bedtools settings run_bedtools_coverage = false anno_file = null anno_file_is_unsorted = false //bamUtils trimbam settings run_trim_bam = false bamutils_clip_double_stranded_half_udg_left = 0 bamutils_clip_double_stranded_half_udg_right = 0 bamutils_clip_double_stranded_none_udg_left = 0 bamutils_clip_double_stranded_none_udg_right = 0 bamutils_clip_single_stranded_half_udg_left = 0 bamutils_clip_single_stranded_half_udg_right = 0 bamutils_clip_single_stranded_none_udg_left = 0 bamutils_clip_single_stranded_none_udg_right = 0 bamutils_softclip = false //Genotyping options run_genotyping = false genotyping_tool = null genotyping_source = 'raw' // gatk options gatk_call_conf = 30 gatk_ploidy = 2 gatk_downsample = 250 gatk_dbsnp = null gatk_hc_out_mode = 'EMIT_VARIANTS_ONLY' gatk_hc_emitrefconf = 'GVCF' gatk_ug_genotype_model = 'SNP' gatk_ug_out_mode = 'EMIT_VARIANTS_ONLY' gatk_ug_keep_realign_bam = false gatk_ug_defaultbasequalities = null // freebayes options freebayes_C = 1 freebayes_g = 0 freebayes_p = 2 // Sequencetools pileupCaller pileupcaller_snpfile = null pileupcaller_bedfile = null pileupcaller_method = 'randomHaploid' pileupcaller_transitions_mode = 'AllSites' pileupcaller_min_map_quality = 30 pileupcaller_min_base_quality = 30 // ANGSD Genotype Likelihoods angsd_glmodel = 'samtools' angsd_glformat = 'binary' angsd_createfasta = false angsd_fastamethod = 'random' run_bcftools_stats = true //Consensus sequence generation run_vcf2genome = false vcf2genome_outfile = '' vcf2genome_header = '' vcf2genome_minc = 5 vcf2genome_minq = 30 vcf2genome_minfreq = 0.8 //MultiVCFAnalyzer Options run_multivcfanalyzer = false write_allele_frequencies = false min_genotype_quality = 30 min_base_coverage = 5 min_allele_freq_hom = 0.9 min_allele_freq_het = 0.9 additional_vcf_files = null reference_gff_annotations = 'NA' reference_gff_exclude = 'NA' snp_eff_results = 'NA' //mtnucratio run_mtnucratio = false mtnucratio_header = 'MT' //Sex.DetERRmine settings run_sexdeterrmine = false sexdeterrmine_bedfile = null //Nuclear contamination based on chromosome X heterozygosity. run_nuclear_contamination = false contamination_chrom_name = 'X' // Default to using hs37d5 name // taxonomic classifier run_metagenomic_screening = false metagenomic_complexity_filter = false metagenomic_complexity_entropy = 0.3 metagenomic_tool = null database = null metagenomic_min_support_reads = 1 percent_identity = 85 malt_mode = 'BlastN' malt_alignment_mode = 'SemiGlobal' malt_top_percent = 1 malt_min_support_mode = 'percent' malt_min_support_percent = 0.01 malt_max_queries = 100 malt_memory_mode = 'load' malt_sam_output = false // maltextract - only including number // parameters if default documented or duplicate of MALT run_maltextract = false maltextract_taxon_list = null maltextract_ncbifiles = null maltextract_filter = 'def_anc' maltextract_toppercent = 0.01 maltextract_destackingoff = false maltextract_downsamplingoff = false maltextract_duplicateremovaloff = false maltextract_matches = false maltextract_megansummary = false maltextract_percentidentity = 85.0 maltextract_topalignment = false // Boilerplate options multiqc_config = false email = false email_on_fail = false max_multiqc_email_size = 25.MB plaintext_email = false monochrome_logs = false help = false igenomes_base = 's3://ngi-igenomes/igenomes' tracedir = "${params.outdir}/pipeline_info" igenomes_ignore = true custom_config_version = 'master' custom_config_base = "https://raw.githubusercontent.com/nf-core/configs/${params.custom_config_version}" hostnames = false config_profile_name = null config_profile_description = false config_profile_contact = false config_profile_url = false validate_params = true show_hidden_params = false schema_ignore_params = 'genomes,input_paths' // Defaults only, expecting to be overwritten max_memory = 128.GB max_cpus = 16 max_time = 240.h } // Container slug. Stable releases should specify release tag! // Developmental code should specify :dev process.container = 'nfcore/eager:2.5.3' // Load base.config by default for all pipelines includeConfig 'conf/base.config' // Load nf-core custom profiles from different Institutions try { includeConfig "${params.custom_config_base}/nfcore_custom.config" } catch (Exception e) { System.err.println("WARNING: Could not load nf-core/config profiles: ${params.custom_config_base}/nfcore_custom.config") } // Load nf-core/eager custom profiles from different institutions try { includeConfig "${params.custom_config_base}/pipeline/eager.config" } catch (Exception e) { System.err.println("WARNING: Could not load nf-core/config/eager profiles: ${params.custom_config_base}/pipeline/eager.config") } profiles { conda { docker.enabled = false singularity.enabled = false podman.enabled = false shifter.enabled = false charliecloud.enabled = false process.conda = "$projectDir/environment.yml" } debug { process.beforeScript = 'echo $HOSTNAME' } docker { docker.enabled = true singularity.enabled = false podman.enabled = false shifter.enabled = false charliecloud.enabled = false // Avoid this error: // WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap. // Testing this in nf-core after discussion here https://github.com/nf-core/tools/pull/351 // once this is established and works well, nextflow might implement this behavior as new default. docker.runOptions = '-u \$(id -u):\$(id -g)' } singularity { docker.enabled = false singularity.enabled = true podman.enabled = false shifter.enabled = false charliecloud.enabled = false singularity.autoMounts = true } podman { singularity.enabled = false docker.enabled = false podman.enabled = true shifter.enabled = false charliecloud.enabled = false } shifter { singularity.enabled = false docker.enabled = false podman.enabled = false shifter.enabled = true charliecloud.enabled = false } charliecloud { singularity.enabled = false docker.enabled = false podman.enabled = false shifter.enabled = false charliecloud.enabled = true } test { includeConfig 'conf/test.config'} test_direct { includeConfig 'conf/test_direct.config' } test_full { includeConfig 'conf/test_full.config' } test_bam { includeConfig 'conf/test_bam.config'} test_fna { includeConfig 'conf/test_fna.config'} test_humanbam { includeConfig 'conf/test_humanbam.config' } test_pretrim { includeConfig 'conf/test_pretrim.config' } test_kraken { includeConfig 'conf/test_kraken.config' } test_tsv_bam { includeConfig 'conf/test_tsv_bam.config'} test_tsv_fna { includeConfig 'conf/test_tsv_fna.config'} test_tsv_humanbam { includeConfig 'conf/test_tsv_humanbam.config' } test_tsv_pretrim { includeConfig 'conf/test_tsv_pretrim.config' } test_tsv_kraken { includeConfig 'conf/test_tsv_kraken.config' } test_tsv_complex { includeConfig 'conf/test_tsv_complex.config' } test_stresstest_human { includeConfig 'conf/test_stresstest_human.config' } benchmarking_human { includeConfig 'conf/benchmarking_human.config' } benchmarking_vikingfish { includeConfig 'conf/benchmarking_vikingfish.config' } } // Load igenomes.config if required if (!params.igenomes_ignore) { includeConfig 'conf/igenomes.config' } // Export these variables to prevent local Python/R libraries from conflicting with those in the container env { PYTHONNOUSERSITE = 1 R_PROFILE_USER = "/.Rprofile" R_ENVIRON_USER = "/.Renviron" } // Capture exit codes from upstream processes when piping process.shell = ['/bin/bash', '-euo', 'pipefail'] def trace_timestamp = new java.util.Date().format( 'yyyy-MM-dd_HH-mm-ss') timeline { enabled = true file = "${params.tracedir}/execution_timeline_${trace_timestamp}.html" } report { enabled = true file = "${params.tracedir}/execution_report_${trace_timestamp}.html" } trace { enabled = true file = "${params.tracedir}/execution_trace_${trace_timestamp}.txt" } dag { enabled = true file = "${params.tracedir}/pipeline_dag_${trace_timestamp}.svg" } manifest { name = 'nf-core/eager' author = 'The nf-core/eager community' homePage = 'https://github.com/nf-core/eager' description = 'A fully reproducible and state-of-the-art ancient DNA analysis pipeline' mainScript = 'main.nf' nextflowVersion = '>=20.07.1' version = '2.5.3' } // Function to ensure that resource requirements don't go beyond // a maximum limit def check_max(obj, type) { if (type == 'memory') { try { if (obj.compareTo(params.max_memory as nextflow.util.MemoryUnit) == 1) return params.max_memory as nextflow.util.MemoryUnit else return obj } catch (all) { println " ### ERROR ### Max memory '${params.max_memory}' is not valid! Using default value: $obj" return obj } } else if (type == 'time') { try { if (obj.compareTo(params.max_time as nextflow.util.Duration) == 1) return params.max_time as nextflow.util.Duration else return obj } catch (all) { println " ### ERROR ### Max time '${params.max_time}' is not valid! Using default value: $obj" return obj } } else if (type == 'cpus') { try { return Math.min( obj, params.max_cpus as int ) } catch (all) { println " ### ERROR ### Max cpus '${params.max_cpus}' is not valid! Using default value: $obj" return obj } } } ================================================ FILE: nextflow_schema.json ================================================ { "$schema": "http://json-schema.org/draft-07/schema", "$id": "https://raw.githubusercontent.com/nf-core/eager/master/nextflow_schema.json", "title": "nf-core/eager pipeline parameters", "description": "A fully reproducible and state-of-the-art ancient DNA analysis pipeline", "type": "object", "definitions": { "input_output_options": { "title": "Input/output options", "type": "object", "fa_icon": "fas fa-terminal", "description": "Define where the pipeline should find input data, and additional metadata.", "required": [ "input" ], "properties": { "input": { "type": "string", "description": "Either paths or URLs to FASTQ/BAM data (must be surrounded with quotes). For paired end data, the path must use '{1,2}' notation to specify read pairs. Alternatively, a path to a TSV file (ending .tsv) containing file paths and sequencing/sample metadata. Allows for merging of multiple lanes/libraries/samples. Please see documentation for template.", "fa_icon": "fas fa-dna", "help_text": "There are two possible ways of supplying input sequencing data to nf-core/eager. The most efficient but more simplistic is supplying direct paths (with wildcards) to your FASTQ or BAM files, with each file or pair being considered a single library and each one run independently (e.g. for paired-end data: `--input '///*_{R1,R2}_*.fq.gz'`). TSV input requires creation of an extra file by the user (`--input '///eager_data.tsv'`) and extra metadata, but allows more powerful lane and library merging. Please see [usage docs](https://nf-co.re/eager/docs/usage#input-specifications) for detailed instructions and specifications." }, "udg_type": { "type": "string", "default": "none", "description": "Specifies whether you have UDG treated libraries. Set to 'half' for partial treatment, or 'full' for UDG. If not set, libraries are assumed to have no UDG treatment ('none'). Not required for TSV input.", "fa_icon": "fas fa-vial", "help_text": "Defines whether Uracil-DNA glycosylase (UDG) treatment was used to remove DNA\ndamage on the sequencing libraries.\n\nSpecify `'none'` if no treatment was performed. If you have partial UDG treated\ndata ([Rohland et al 2016](http://dx.doi.org/10.1098/rstb.2013.0624)), specify\n`'half'`. If you have complete UDG treated data ([Briggs et al.\n2010](https://doi.org/10.1093/nar/gkp1163)), specify `'full'`. \n\nWhen also using PMDtools specifying `'half'` will use a different model for DNA\ndamage assessment in PMDTools (PMDtools: `--UDGhalf`). Specify `'full'` and the\nPMDtools DNA damage assessment will use CpG context only (PMDtools: `--CpG`).\nDefault: `'none'`.\n\n> **Tip**: You should provide a small decoy reference genome with pre-made indices, e.g.\n> the human mtDNA genome, for the mandatory parameter `--fasta` in order to\n> avoid long computational time for generating the index files of the reference\n> genome, even if you do not actually need a reference genome for any downstream\n> analyses.", "enum": [ "none", "half", "full" ] }, "single_stranded": { "type": "boolean", "description": "Specifies that libraries are single stranded. Always affects MALTExtract but will be ignored by pileupCaller with TSV input. Not required for TSV input.", "fa_icon": "fas fa-minus", "help_text": "Indicates libraries are single stranded.\n\nCurrently only affects MALTExtract where it will switch on damage patterns\ncalculation mode to single-stranded, (MaltExtract: `--singleStranded`) and\ngenotyping with pileupCaller where a different method is used (pileupCaller:\n`--singleStrandMode`). Default: false\n\nOnly required when using the 'Path' method of `--input`" }, "single_end": { "type": "boolean", "description": "Specifies that the input is single end reads. Not required for TSV input.", "fa_icon": "fas fa-align-left", "help_text": "By default, the pipeline expects paired-end data. If you have single-end data, specify this parameter on the command line when you launch the pipeline. It is not possible to run a mixture of single-end and paired-end files in one run.\n\nOnly required when using the 'Path' method of `--input`" }, "colour_chemistry": { "type": "integer", "default": 4, "description": "Specifies which Illumina sequencing chemistry was used. Used to inform whether to poly-G trim if turned on (see below). Not required for TSV input. Options: 2, 4.", "fa_icon": "fas fa-palette", "help_text": "Specifies which Illumina colour chemistry a library was sequenced with. This informs whether to perform poly-G trimming (if `--complexity_filter_poly_g` is also supplied). Only 2 colour chemistry sequencers (e.g. NextSeq or NovaSeq) can generate uncertain poly-G tails (due to 'G' being indicated via a no-colour detection). Default is '4' to indicate e.g. HiSeq or MiSeq platforms, which do not require poly-G trimming. Options: 2, 4. Default: 4\n\nOnly required when using the 'Path' method of input." }, "bam": { "type": "boolean", "description": "Specifies that the input is in BAM format. Not required for TSV input.", "fa_icon": "fas fa-align-justify", "help_text": "Specifies the input file type to `--input` is in BAM format. This will automatically also apply `--single_end`.\n\nOnly required when using the 'Path' method of `--input`.\n" } }, "help_text": "There are two possible ways of supplying input sequencing data to nf-core/eager.\nThe most efficient but more simplistic is supplying direct paths (with\nwildcards) to your FASTQ or BAM files, with each file or pair being considered a\nsingle library and each one run independently. TSV input requires creation of an\nextra file by the user and extra metadata, but allows more powerful lane and\nlibrary merging." }, "input_data_additional_options": { "title": "Input Data Additional Options", "type": "object", "description": "Additional options regarding input data.", "default": "", "properties": { "snpcapture_bed": { "type": "string", "fa_icon": "fas fa-magnet", "description": "If library result of SNP capture, path to BED file containing SNPS positions on reference genome. SNP statistics are qualimap results directory only not MultiQC.", "help_text": "Can be used to set a path to a BED file (3/6 column format) of SNP positions of a reference genome, to calculate SNP captured libraries on-target efficiency. This should be used for array or in-solution SNP capture protocols such as 390K, 1240K, etc. If supplied, some on-target metrics are automatically generated for you by qualimap in the 'Globals inside' section of the 'genome_results.txt' file in the qualimap results directory. These statistics are currently NOT displayed in MultiQC!" }, "run_convertinputbam": { "type": "boolean", "description": "Turns on conversion of an input BAM file into FASTQ format to allow re-preprocessing (e.g. AdapterRemoval etc.).", "fa_icon": "fas fa-undo-alt", "help_text": "Allows you to convert an input BAM file back to FASTQ for downstream processing. Note this is required if you need to perform AdapterRemoval and/or polyG clipping.\n\nIf not turned on, BAMs will automatically be sent to post-mapping steps." } }, "fa_icon": "far fa-plus-square" }, "reference_genome_options": { "title": "Reference genome options", "type": "object", "fa_icon": "fas fa-dna", "properties": { "fasta": { "type": "string", "fa_icon": "fas fa-font", "description": "Path or URL to a FASTA reference file (required if not iGenome reference). File suffixes can be: '.fa', '.fn', '.fna', '.fasta'.", "help_text": "You specify the full path to your reference genome here. The FASTA file can have any file suffix, such as `.fasta`, `.fna`, `.fa`, `.FastA` etc. You may also supply a gzipped reference files, which will be unzipped automatically for you.\n\nFor example:\n\n```bash\n--fasta '///my_reference.fasta'\n```\n\n> If you don't specify appropriate `--bwa_index`, `--fasta_index` parameters, the pipeline will create these indices for you automatically. Note that you can save the indices created for you for later by giving the `--save_reference` flag.\n> You must select either a `--fasta` or `--genome`\n" }, "genome": { "type": "string", "description": "Name of iGenomes reference (required if not FASTA reference). Requires argument `--igenomes_ignore false`, as iGenomes is ignored by default in nf-core/eager", "fa_icon": "fas fa-book", "help_text": "Alternatively to `--fasta`, the pipeline config files come bundled with paths to the Illumina iGenomes reference index files. If running with docker or AWS, the configuration is set up to use the [AWS-iGenomes](https://ewels.github.io/AWS-iGenomes/) resource.\n\nThere are 31 different species supported in the iGenomes references. To run the pipeline, you must specify which to use with the `--genome` flag.\n\nYou can find the keys to specify the genomes in the [iGenomes config file](../conf/igenomes.config). Common genomes that are supported are:\n\n- Human\n - `--genome GRCh37`\n - `--genome GRCh38`\n- Mouse *\n - `--genome GRCm38`\n- _Drosophila_ *\n - `--genome BDGP6`\n- _S. cerevisiae_ *\n - `--genome 'R64-1-1'`\n\n> \\* Not bundled with nf-core eager by default.\n\nNote that you can use the same configuration setup to save sets of reference files for your own use, even if they are not part of the iGenomes resource. See the [Nextflow documentation](https://www.nextflow.io/docs/latest/config.html) for instructions on where to save such a file.\n\nThe syntax for this reference configuration is as follows:\n\n```nextflow\nparams {\n genomes {\n 'GRCh37' {\n fasta = ''\n }\n // Any number of additional genomes, key is used with --genome\n }\n}\n**NB** Requires argument `--igenomes_ignore false` as iGenomes ignored by default in nf-core/eager\n\n```" }, "igenomes_base": { "type": "string", "description": "Directory / URL base for iGenomes references.", "default": "s3://ngi-igenomes/igenomes", "fa_icon": "fas fa-cloud-download-alt", "hidden": true }, "igenomes_ignore": { "type": "boolean", "description": "Do not load the iGenomes reference config.", "fa_icon": "fas fa-ban", "hidden": true, "help_text": "Do not load `igenomes.config` when running the pipeline. You may choose this option if you observe clashes between custom parameters and those supplied in `igenomes.config`." }, "bwa_index": { "type": "string", "description": "Path to directory containing pre-made BWA indices (i.e. the directory before the files ending in '.amb' '.ann' '.bwt'. Do not include the files themselves. Most likely the same directory of the file provided with --fasta). If not supplied will be made for you.", "fa_icon": "fas fa-address-book", "help_text": "If you want to use pre-existing `bwa index` indices, please supply the **directory** to the FASTA you also specified in `--fasta` nf-core/eager will automagically detect the index files by searching for the FASTA filename with the corresponding `bwa` index file suffixes.\n\nFor example:\n\n```bash\nnextflow run nf-core/eager \\\n-profile test,docker \\\n--input '*{R1,R2}*.fq.gz'\n--fasta 'results/reference_genome/bwa_index/BWAIndex/Mammoth_MT_Krause.fasta' \\\n--bwa_index 'results/reference_genome/bwa_index/BWAIndex/'\n```\n\n> `bwa index` does not give you an option to supply alternative suffixes/names for these indices. Thus, the file names generated by this command _must not_ be changed, otherwise nf-core/eager will not be able to find them." }, "bt2_index": { "type": "string", "description": "Path to directory containing pre-made Bowtie2 indices (i.e. everything before the endings e.g. '.1.bt2', '.2.bt2', '.rev.1.bt2'. Most likely the same value as --fasta). If not supplied will be made for you.", "fa_icon": "far fa-address-book", "help_text": "If you want to use pre-existing `bt2 index` indices, please supply the **directory** to the FASTA you also specified in `--fasta`. nf-core/eager will automagically detect the index files by searching for the FASTA filename with the corresponding `bt2` index file suffixes.\n\nFor example:\n\n```bash\nnextflow run nf-core/eager \\\n-profile test,docker \\\n--input '*{R1,R2}*.fq.gz'\n--fasta 'results/reference_genome/bwa_index/BWAIndex/Mammoth_MT_Krause.fasta' \\\n--bwa_index 'results/reference_genome/bt2_index/BT2Index/'\n```\n\n> `bowtie2-build` does not give you an option to supply alternative suffixes/names for these indices. Thus, the file names generated by this command _must not_ be changed, otherwise nf-core/eager will not be able to find them." }, "fasta_index": { "type": "string", "description": "Path to samtools FASTA index (typically ending in '.fai'). If not supplied will be made for you.", "fa_icon": "far fa-bookmark", "help_text": "If you want to use a pre-existing `samtools faidx` index, use this to specify the required FASTA index file for the selected reference genome. This should be generated by `samtools faidx` and has a file suffix of `.fai`\n\nFor example:\n\n```bash\n--fasta_index 'Mammoth_MT_Krause.fasta.fai'\n```" }, "seq_dict": { "type": "string", "description": "Path to picard sequence dictionary file (typically ending in '.dict'). If not supplied will be made for you.", "fa_icon": "fas fa-spell-check", "help_text": "If you want to use a pre-existing `picard CreateSequenceDictionary` dictionary file, use this to specify the required `.dict` file for the selected reference genome.\n\nFor example:\n\n```bash\n--seq_dict 'Mammoth_MT_Krause.dict'\n```" }, "large_ref": { "type": "boolean", "description": "Specify to generate more recent '.csi' BAM indices. If your reference genome is larger than 3.5GB, this is recommended due to more efficient data handling with the '.csi' format over the older '.bai'.", "fa_icon": "fas fa-mountain", "help_text": "This parameter is required to be set for large reference genomes. If your\nreference genome is larger than 3.5GB, the `samtools index` calls in the\npipeline need to generate `CSI` indices instead of `BAI` indices to compensate\nfor the size of the reference genome (with samtools: `-c`). This parameter is\nnot required for smaller references (including the human `hg19` or\n`grch37`/`grch38` references), but `>4GB` genomes have been shown to need `CSI`\nindices. Default: off" }, "save_reference": { "type": "boolean", "description": "If not already supplied by user, turns on saving of generated reference genome indices for later re-usage.", "fa_icon": "far fa-save", "help_text": "Use this if you do not have pre-made reference FASTA indices for `bwa`, `samtools` and `picard`. If you turn this on, the indices nf-core/eager generates for you and will be saved in the `/results/reference_genomes` for you. If not supplied, nf-core/eager generated index references will be deleted.\n\n> modifies SAMtools index command: `-c`" } }, "description": "Specify locations of references and optionally, additional pre-made indices", "help_text": "All nf-core/eager runs require a reference genome in FASTA format to map reads\nagainst to.\n\nIn addition we provide various options for indexing of different types of\nreference genomes (based on the tools used in the pipeline). nf-core/eager can\nindex reference genomes for you (with options to save these for other analysis),\nbut you can also supply your pre-made indices.\n\nSupplying pre-made indices saves time in pipeline execution and is especially\nadvised when running multiple times on the same cluster system for example. You\ncan even add a resource [specific profile](#profile) that sets paths to\npre-computed reference genomes, saving time when specifying these.\n\n> :warning: you must always supply a reference file. If you want to use\n functionality that does not require one, supply a small decoy genome such as\n phiX or the human mtDNA genome." }, "output_options": { "title": "Output options", "type": "object", "description": "Specify where to put output files and optional saving of intermediate files", "default": "", "properties": { "outdir": { "type": "string", "description": "The output directory where the results will be saved.", "default": "./results", "fa_icon": "fas fa-folder-open", "help_text": "The output directory where the results will be saved. By default will be made in the directory you run the command in under `./results`." }, "publish_dir_mode": { "type": "string", "default": "copy", "hidden": true, "description": "Method used to save pipeline results to output directory.", "help_text": "The Nextflow `publishDir` option specifies which intermediate files should be saved to the output directory. This option tells the pipeline what method should be used to move these files. See [Nextflow docs](https://www.nextflow.io/docs/latest/process.html#publishdir) for details.", "fa_icon": "fas fa-copy", "enum": [ "symlink", "rellink", "link", "copy", "copyNoFollow", "move" ] } }, "fa_icon": "fas fa-cloud-download-alt" }, "generic_options": { "title": "Generic options", "type": "object", "properties": { "help": { "type": "boolean", "description": "Display help text.", "hidden": true, "fa_icon": "fas fa-question-circle" }, "validate_params": { "type": "boolean", "description": "Boolean whether to validate parameters against the schema at runtime", "default": true, "fa_icon": "fas fa-check-square", "hidden": true }, "email": { "type": "string", "description": "Email address for completion summary.", "fa_icon": "fas fa-envelope", "help_text": "An email address to send a summary email to when the pipeline is completed.", "pattern": "^([a-zA-Z0-9_\\-\\.]+)@([a-zA-Z0-9_\\-\\.]+)\\.([a-zA-Z]{2,5})$" }, "email_on_fail": { "type": "string", "description": "Email address for completion summary, only when pipeline fails.", "fa_icon": "fas fa-exclamation-triangle", "pattern": "^([a-zA-Z0-9_\\-\\.]+)@([a-zA-Z0-9_\\-\\.]+)\\.([a-zA-Z]{2,5})$", "hidden": true, "help_text": "Set this parameter to your e-mail address to get a summary e-mail with details of the run if it **fails**. Normally would be the same as in `--email` but can be different. If set in your user config file (`~/.nextflow/config`) then you don't need to specify this on the command line for every run.\n\n> Note that this functionality requires either `mail` or `sendmail` to be installed on your system." }, "plaintext_email": { "type": "boolean", "description": "Send plain-text email instead of HTML.", "fa_icon": "fas fa-remove-format", "hidden": true, "help_text": "Set to receive plain-text e-mails instead of HTML formatted." }, "max_multiqc_email_size": { "type": "string", "description": "File size limit when attaching MultiQC reports to summary emails.", "default": "25.MB", "fa_icon": "fas fa-file-upload", "hidden": true, "help_text": "If file generated by pipeline exceeds the threshold, it will not be attached." }, "monochrome_logs": { "type": "boolean", "description": "Do not use coloured log outputs.", "fa_icon": "fas fa-palette", "hidden": true, "help_text": "Set to disable colourful command line output and live life in monochrome." }, "multiqc_config": { "type": "string", "description": "Custom config file to supply to MultiQC.", "fa_icon": "fas fa-cog", "hidden": true }, "tracedir": { "type": "string", "description": "Directory to keep pipeline Nextflow logs and reports.", "default": "${params.outdir}/pipeline_info", "fa_icon": "fas fa-cogs", "hidden": true }, "show_hidden_params": { "type": "boolean", "fa_icon": "far fa-eye-slash", "description": "Show all params when using `--help`", "hidden": true, "help_text": "By default, parameters set as _hidden_ in the schema are not shown on the command line when a user runs with `--help`. Specifying this option will tell the pipeline to show all parameters." }, "enable_conda": { "type": "boolean", "hidden": true, "description": "Parameter used for checking conda channels to be set correctly." }, "schema_ignore_params": { "type": "string", "fa_icon": "fas fa-not-equal", "description": "String to specify ignored parameters for parameter validation", "hidden": true, "default": "genomes" } }, "fa_icon": "fas fa-file-import", "description": "Less common options for the pipeline, typically set in a config file.", "help_text": "These options are common to all nf-core pipelines and allow you to customise some of the core preferences for how the pipeline runs.\n\nTypically these options would be set in a Nextflow config file loaded for all pipeline runs, such as `~/.nextflow/config`." }, "max_job_request_options": { "title": "Max job request options", "type": "object", "fa_icon": "fab fa-acquisitions-incorporated", "description": "Set the top limit for requested resources for any single job.", "help_text": "If you are running on a smaller system, a pipeline step requesting more resources than are available may cause the Nextflow to stop the run with an error. These options allow you to cap the maximum resources requested by any single job so that the pipeline will run on your system.\n\nNote that you can not _increase_ the resources requested by any job using these options. For that you will need your own configuration file. See [the nf-core website](https://nf-co.re/usage/configuration) for details.", "properties": { "max_cpus": { "type": "integer", "description": "Maximum number of CPUs that can be requested for any single job.", "default": 16, "fa_icon": "fas fa-microchip", "hidden": true, "help_text": "Use to set an upper-limit for the CPU requirement for each process. Should be an integer e.g. `--max_cpus 1`" }, "max_memory": { "type": "string", "description": "Maximum amount of memory that can be requested for any single job.", "default": "128.GB", "fa_icon": "fas fa-memory", "pattern": "^\\d+(\\.\\d+)?\\.?\\s*(K|M|G|T)?B$", "hidden": true, "help_text": "Use to set an upper-limit for the memory requirement for each process. Should be a string in the format integer-unit e.g. `--max_memory '8.GB'`" }, "max_time": { "type": "string", "description": "Maximum amount of time that can be requested for any single job.", "default": "240.h", "fa_icon": "far fa-clock", "pattern": "^(\\d+\\.?\\s*(s|m|h|day)\\s*)+$", "hidden": true, "help_text": "Use to set an upper-limit for the time requirement for each process. Should be a string in the format integer-unit e.g. `--max_time '2.h'`" } } }, "institutional_config_options": { "title": "Institutional config options", "type": "object", "fa_icon": "fas fa-university", "description": "Parameters used to describe centralised config profiles. These generally should not be edited.", "help_text": "The centralised nf-core configuration profiles use a handful of pipeline parameters to describe themselves. This information is then printed to the Nextflow log when you run a pipeline. You should not need to change these values when you run a pipeline.", "properties": { "custom_config_version": { "type": "string", "description": "Git commit id for Institutional configs.", "default": "master", "hidden": true, "fa_icon": "fas fa-users-cog", "help_text": "Provide git commit id for custom Institutional configs hosted at `nf-core/configs`. This was implemented for reproducibility purposes. Default: `master`.\n\n```bash\n## Download and use config file with following git commit id\n--custom_config_version d52db660777c4bf36546ddb188ec530c3ada1b96\n```" }, "custom_config_base": { "type": "string", "description": "Base directory for Institutional configs.", "default": "https://raw.githubusercontent.com/nf-core/configs/master", "hidden": true, "help_text": "If you're running offline, nextflow will not be able to fetch the institutional config files from the internet. If you don't need them, then this is not a problem. If you do need them, you should download the files from the repo and tell nextflow where to find them with the `custom_config_base` option. For example:\n\n```bash\n## Download and unzip the config files\ncd /path/to/my/configs\nwget https://github.com/nf-core/configs/archive/master.zip\nunzip master.zip\n\n## Run the pipeline\ncd /path/to/my/data\nnextflow run /path/to/pipeline/ --custom_config_base /path/to/my/configs/configs-master/\n```\n\n> Note that the nf-core/tools helper package has a `download` command to download all required pipeline files + singularity containers + institutional configs in one go for you, to make this process easier.", "fa_icon": "fas fa-users-cog" }, "hostnames": { "type": "string", "description": "Institutional configs hostname.", "hidden": true, "fa_icon": "fas fa-users-cog" }, "config_profile_name": { "type": "string", "description": "Institutional config name.", "hidden": true, "fa_icon": "fas fa-users-cog" }, "config_profile_description": { "type": "string", "description": "Institutional config description.", "hidden": true, "fa_icon": "fas fa-users-cog" }, "config_profile_contact": { "type": "string", "description": "Institutional config contact information.", "hidden": true, "fa_icon": "fas fa-users-cog" }, "config_profile_url": { "type": "string", "description": "Institutional config URL link.", "hidden": true, "fa_icon": "fas fa-users-cog" }, "awsqueue": { "type": "string", "description": "The AWSBatch JobQueue that needs to be set when running on AWSBatch", "fa_icon": "fab fa-aws" }, "awsregion": { "type": "string", "default": "eu-west-1", "description": "The AWS Region for your AWS Batch job to run on", "fa_icon": "fab fa-aws" }, "awscli": { "type": "string", "description": "Path to the AWS CLI tool", "fa_icon": "fab fa-aws" } } }, "skip_steps": { "title": "Skip steps", "type": "object", "description": "Skip any of the mentioned steps.", "default": "", "properties": { "skip_fastqc": { "type": "boolean", "fa_icon": "fas fa-fast-forward", "help_text": "Turns off FastQC pre- and post-Adapter Removal, to speed up the pipeline. Use of this flag is most common when data has been previously pre-processed and the post-Adapter Removal mapped reads are being re-mapped to a new reference genome." }, "skip_adapterremoval": { "type": "boolean", "fa_icon": "fas fa-fast-forward", "help_text": "Turns off adapter trimming and paired-end read merging. Equivalent to setting both `--skip_collapse` and `--skip_trim`." }, "skip_preseq": { "type": "boolean", "fa_icon": "fas fa-fast-forward", "help_text": "Turns off the computation of library complexity estimation." }, "skip_deduplication": { "type": "boolean", "fa_icon": "fas fa-fast-forward", "help_text": "Turns off duplicate removal methods DeDup and MarkDuplicates respectively. No duplicates will be removed on any data in the pipeline.\n" }, "skip_damage_calculation": { "type": "boolean", "fa_icon": "fas fa-fast-forward", "help_text": "Turns off the DamageProfiler module to compute DNA damage profiles.\n" }, "skip_qualimap": { "type": "boolean", "fa_icon": "fas fa-fast-forward", "help_text": "Turns off QualiMap and thus does not compute coverage and other mapping metrics.\n" } }, "fa_icon": "fas fa-fast-forward", "help_text": "Some of the steps in the pipeline can be executed optionally. If you specify\nspecific steps to be skipped, there won't be any output related to these\nmodules." }, "complexity_filtering": { "title": "Complexity filtering", "type": "object", "description": "Processing of Illumina two-colour chemistry data.", "default": "", "properties": { "complexity_filter_poly_g": { "type": "boolean", "description": "Turn on running poly-G removal on FASTQ files. Will only be performed on 2 colour chemistry machine sequenced libraries.", "fa_icon": "fas fa-power-off", "help_text": "Performs a poly-G tail removal step in the beginning of the pipeline using `fastp`, if turned on. This can be useful for trimming ploy-G tails from short-fragments sequenced on two-colour Illumina chemistry such as NextSeqs (where no-fluorescence is read as a G on two-colour chemistry), which can inflate reported GC content values.\n" }, "complexity_filter_poly_g_min": { "type": "integer", "default": 10, "description": "Specify length of poly-g min for clipping to be performed.", "fa_icon": "fas fa-ruler-horizontal", "help_text": "This option can be used to define the minimum length of a poly-G tail to begin low complexity trimming. By default, this is set to a value of `10` unless the user has chosen something specifically using this option.\n\n> Modifies fastp parameter: `--poly_g_min_len`" } }, "fa_icon": "fas fa-filter", "help_text": "More details can be seen in the [fastp\ndocumentation](https://github.com/OpenGene/fastp)\n\nIf using TSV input, this is performed per lane separately" }, "read_merging_and_adapter_removal": { "title": "Read merging and adapter removal", "type": "object", "description": "Options for adapter clipping and paired-end merging.", "default": "", "properties": { "clip_forward_adaptor": { "type": "string", "default": "AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC", "description": "Specify adapter sequence to be clipped off (forward strand).", "fa_icon": "fas fa-cut", "help_text": "Defines the adapter sequence to be used for the forward read. By default, this is set to `'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC'`.\n\n> Modifies AdapterRemoval parameter: `--adapter1`" }, "clip_reverse_adaptor": { "type": "string", "default": "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA", "description": "Specify adapter sequence to be clipped off (reverse strand).", "fa_icon": "fas fa-cut", "help_text": "Defines the adapter sequence to be used for the reverse read in paired end sequencing projects. This is set to `'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA'` by default.\n\n> Modifies AdapterRemoval parameter: `--adapter2`" }, "clip_adapters_list": { "type": "string", "description": "Path to AdapterRemoval adapter list file. Overrides `--clip_*_adaptor` parameters", "fa_icon": "fas fa-cut", "help_text": "Allows to supply a file with a list of adapter (combinations) to remove from all files. **Overrides** the `--clip_*_adaptor` parameters . First column represents forward strand, second column for reverse strand. You must supply all possibly combinations, one per line, and this list is applied to all files. See [AdapterRemoval documentation](https://adapterremoval.readthedocs.io/en/latest/manpage.html) for more information.\n\n> Modifies AdapterRemoval parameter: `--adapter-list`" }, "clip_readlength": { "type": "integer", "default": 30, "description": "Specify read minimum length to be kept for downstream analysis.", "fa_icon": "fas fa-ruler", "help_text": "Defines the minimum read length that is required for reads after merging to be considered for downstream analysis after read merging. Default is `30`.\n\nNote that when you have a large percentage of very short reads in your library (< 20 bp) - such as retrieved in single-stranded library protocols - that performing read length filtering at this step is not _always_ reliable for correct endogenous DNA calculation. When you have very few reads passing this length filter, it will artificially inflate your 'endogenous DNA' value by creating a very small denominator. \n\nIf you notice you have ultra short reads (< 20 bp), it is recommended to set this parameter to 0, and use `--bam_filter_minreadlength` instead, to filter out 'un-usable' short reads after mapping. A caveat, however, is that this will cause a very large increase in computational run time, due to all reads in the library will be being mapped.\n\n> Modifies AdapterRemoval parameter: `--minlength`\n" }, "clip_min_read_quality": { "type": "integer", "default": 20, "description": "Specify minimum base quality for trimming off bases.", "fa_icon": "fas fa-medal", "help_text": "Defines the minimum read quality per base that is required for a base to be kept. Individual bases at the ends of reads falling below this threshold will be clipped off. Default is set to `20`.\n\n> Modifies AdapterRemoval parameter: `--minquality`" }, "min_adap_overlap": { "type": "integer", "default": 1, "description": "Specify minimum adapter overlap required for clipping.", "fa_icon": "fas fa-hands-helping", "help_text": "Specifies a minimum number of bases that overlap with the adapter sequence before adapters are trimmed from reads. Default is set to `1` base overlap.\n\n> Modifies AdapterRemoval parameter: `--minadapteroverlap`" }, "skip_collapse": { "type": "boolean", "description": "Skip of merging forward and reverse reads together and turns on paired-end alignment for downstream mapping. Only applicable for paired-end libraries.", "fa_icon": "fas fa-fast-forward", "help_text": "Turns off the paired-end read merging.\n\nFor example\n\n```bash\n--skip_collapse --input '*_{R1,R2}_*.fastq'\n```\n\nIt is important to use the paired-end wildcard globbing as `--skip_collapse` can only be used on paired-end data!\n\n:warning: If you run this and also with `--clip_readlength` set to something (as is by default), you may end up removing single reads from either the pair1 or pair2 file. These will be NOT be mapped when aligning with either `bwa` or `bowtie`, as both can only accept one (forward) or two (forward and reverse) FASTQs as input.\n\nAlso note that supplying this flag will then also cause downstream mapping steps to run in paired-end mode. This may be more suitable for modern data, or when you want to utilise mate-pair spatial information.\n\n> Modifies AdapterRemoval parameter: `--collapse`" }, "skip_trim": { "type": "boolean", "description": "Skip adapter and quality trimming.", "fa_icon": "fas fa-fast-forward", "help_text": "Turns off adapter AND quality trimming.\n\nFor example:\n\n```bash\n--skip_trim --input '*.fastq'\n```\n\n:warning: it is not possible to keep quality trimming (n or base quality) on,\n_and_ skip adapter trimming.\n\n:warning: it is not possible to turn off one or the other of quality\ntrimming or n trimming. i.e. --trimns --trimqualities are both given\nor neither. However setting quality in `--clip_min_read_quality` to 0 would\ntheoretically turn off base quality trimming.\n\n> Modifies AdapterRemoval parameters: `--trimns --trimqualities --adapter1 --adapter2`" }, "preserve5p": { "type": "boolean", "description": "Skip quality base trimming (n, score, window) of 5 prime end.", "fa_icon": "fas fa-life-ring", "help_text": "Turns off quality based trimming at the 5p end of reads when any of the --trimns, --trimqualities, or --trimwindows options are used. Only 3p end of reads will be removed.\n\nThis also entirely disables quality based trimming of collapsed reads, since both ends of these are informative for PCR duplicate filtering. Described [here](https://github.com/MikkelSchubert/adapterremoval/issues/32#issuecomment-504758137).\n\n> Modifies AdapterRemoval parameters: `--preserve5p`" }, "mergedonly": { "type": "boolean", "description": "Only use merged reads downstream (un-merged reads and singletons are discarded).", "fa_icon": "fas fa-handshake", "help_text": "Specify that only merged reads are sent downstream for analysis.\n\nSingletons (i.e. reads missing a pair), or un-merged reads (where there wasn't sufficient overlap) are discarded.\n\nYou may want to use this if you want ensure only the best quality reads for your analysis, but with the penalty of potentially losing still valid data (even if some reads have slightly lower quality). It is highly recommended when using `--dedupper 'dedup'` (see below)." }, "qualitymax": { "type": "integer", "description": "Specify the maximum Phred score used in input FASTQ files", "help_text": "Specify maximum Phred score of the quality field of FASTQ files. The quality-score range can vary depending on the machine and version (e.g. see diagram [here](https://en.wikipedia.org/wiki/FASTQ_format#Encoding), and this allows you to increase from the default AdapterRemoval value of `41`.\n\n> Modifies AdapterRemoval parameters: `--qualitymax`", "default": 41, "fa_icon": "fas fa-arrow-up" }, "run_post_ar_trimming": { "type": "boolean", "description": "Turn on trimming of inline barcodes (i.e. internal barcodes after adapter removal)", "help_text": "In some cases, you may want to additionally trim reads in a FASTQ file after adapter removal.\n\nThis could be to remove short 'inline' or 'internal' barcodes that are ligated directly onto DNA molecules prior ligation of adapters and indicies (the former of which allow ultra-multiplexing and/or checks for barcode hopping).\n\nIn other cases, you may wish to already remove known high-frequency damage bases to allow stricter mapping.\n\nTurning on this module uses `fastp` to trim one, or both ends of a merged read, or in cases where you have not collapsed your read, R1 and R2.\n" }, "post_ar_trim_front": { "type": "integer", "default": 7, "description": "Specify the number of bases to trim off the front of a merged read or R1", "help_text": "Specify the number of bases to trim off the start of a read in a merged- or forward read FASTQ file.\n\n> Modifies fastp parameters: `--trim_front1`" }, "post_ar_trim_tail": { "type": "integer", "default": 7, "description": "Specify the number of bases to trim off the tail of of a merged read or R1", "help_text": "Specify the number of bases to trim off the end of a read in a merged- or forward read FASTQ file.\n\n> Modifies fastp parameters: `--trim_tail1`" }, "post_ar_trim_front2": { "type": "integer", "default": 7, "description": "Specify the number of bases to trim off the front of R2", "help_text": "Specify the number of bases to trim off the start of a read in an unmerged forward read (R1) FASTQ file.\n\n> Modifies fastp parameters: `--trim_front2`" }, "post_ar_trim_tail2": { "type": "integer", "default": 7, "description": "Specify the number of bases to trim off the tail of R2", "help_text": "Specify the number of bases to trim off the end of a read in an unmerged reverse read (R2) FASTQ file.\n\n> Modifies fastp parameters: `--trim_tail2`" } }, "fa_icon": "fas fa-cut", "help_text": "These options handle various parts of adapter clipping and read merging steps.\n\nMore details can be seen in the [AdapterRemoval\ndocumentation](https://adapterremoval.readthedocs.io/en/latest/)\n\nIf using TSV input, this is performed per lane separately.\n\n> :warning: `--skip_trim` will skip adapter clipping AND quality trimming\n> (n, base quality). It is currently not possible skip one or the other." }, "mapping": { "title": "Read mapping to reference genome", "type": "object", "description": "Options for reference-genome mapping", "default": "", "properties": { "mapper": { "title": "Mapper", "type": "string", "description": "Specify which mapper to use. Options: 'bwaaln', 'bwamem', 'circularmapper', 'bowtie2'.", "default": "bwaaln", "fa_icon": "fas fa-layer-group", "help_text": "Specify which mapping tool to use. Options are BWA aln (`'bwaaln'`), BWA mem (`'bwamem'`), circularmapper (`'circularmapper'`), or bowtie2 (`bowtie2`). BWA aln is the default and highly suited for short-read ancient DNA. BWA mem can be quite useful for modern DNA, but is rarely used in projects for ancient DNA. CircularMapper enhances the mapping procedure to circular references, using the BWA algorithm but utilizing a extend-remap procedure (see Peltzer et al 2016, Genome Biology for details). Bowtie2 is similar to BWA aln, and has recently been suggested to provide slightly better results under certain conditions ([Poullet and Orlando 2020](https://doi.org/10.3389/fevo.2020.00105)), as well as providing extra functionality (such as FASTQ trimming). Default is 'bwaaln'\n\nMore documentation can be seen for each tool under:\n\n- [BWA aln](http://bio-bwa.sourceforge.net/bwa.shtml#3)\n- [BWA mem](http://bio-bwa.sourceforge.net/bwa.shtml#3)\n- [CircularMapper](https://circularmapper.readthedocs.io/en/latest/contents/userguide.html)\n- [Bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#command-line)\n", "enum": [ "bwaaln", "bwamem", "circularmapper", "bowtie2" ] }, "bwaalnn": { "type": "number", "default": 0.01, "description": "Specify the -n parameter for BWA aln, i.e. amount of allowed mismatches in the alignment.", "fa_icon": "fas fa-sort-numeric-down", "help_text": "Configures the `bwa aln -n` parameter, defining how many mismatches are allowed in a read. By default set to `0.04` (following recommendations of [Schubert et al. (2012 _BMC Genomics_)](https://doi.org/10.1186/1471-2164-13-178)), if you're uncertain what to set check out [this](https://apeltzer.shinyapps.io/bwa-mismatches/) Shiny App for more information on how to set this parameter efficiently.\n\n> Modifies bwa aln parameter: `-n`" }, "bwaalnk": { "type": "integer", "default": 2, "description": "Specify the -k parameter for BWA aln, i.e. maximum edit distance allowed in a seed.", "fa_icon": "fas fa-drafting-compass", "help_text": "Configures the `bwa aln -k` parameter for the seeding phase in the mapping algorithm. Default is set to `2`.\n\n> Modifies BWA aln parameter: `-k`" }, "bwaalnl": { "type": "integer", "default": 1024, "description": "Specify the -l parameter for BWA aln i.e. the length of seeds to be used.", "fa_icon": "fas fa-ruler-horizontal", "help_text": "Configures the length of the seed used in `bwa aln -l`. Default is set to be 'turned off' at the recommendation of Schubert et al. ([2012 _BMC Genomics_](https://doi.org/10.1186/1471-2164-13-178)) for ancient DNA with `1024`.\n\nNote: Despite being recommended, turning off seeding can result in long runtimes!\n\n> Modifies BWA aln parameter: `-l`\n" }, "bwaalno": { "type": "integer", "default": 2, "fa_icon": "fas fa-people-arrows", "description": "Specify the -o parameter for BWA aln i.e. the number of gaps allowed.", "help_text": "Configures the number of gaps used in `bwa aln`. Default is set to `bwa` default.\n\n> Modifies BWA aln parameter: `-o`\n" }, "circularextension": { "type": "integer", "default": 500, "description": "Specify the number of bases to extend reference by (circularmapper only).", "fa_icon": "fas fa-external-link-alt", "help_text": "The number of bases to extend the reference genome with. By default this is set to `500` if not specified otherwise.\n\n> Modifies circulargenerator and realignsamfile parameter: `-e`" }, "circulartarget": { "type": "string", "default": "MT", "description": "Specify the FASTA header of the target chromosome to extend (circularmapper only).", "fa_icon": "fas fa-bullseye", "help_text": "The chromosome in your FASTA reference that you'd like to be treated as circular. By default this is set to `MT` but can be configured to match any other chromosome.\n\n> Modifies circulargenerator parameter: `-s`" }, "circularfilter": { "type": "boolean", "description": "Turn on to remove reads that did not map to the circularised genome (circularmapper only).", "fa_icon": "fas fa-filter", "help_text": "If you want to filter out reads that don't map to a circular chromosome (and also non-circular chromosome headers) from the resulting BAM file, turn this on. By default this option is turned off.\n> Modifies -f and -x parameters of CircularMapper's realignsamfile\n" }, "bt2_alignmode": { "type": "string", "default": "local", "description": "Specify the bowtie2 alignment mode. Options: 'local', 'end-to-end'.", "fa_icon": "fas fa-arrows-alt-h", "help_text": "The type of read alignment to use. Options are 'local' or 'end-to-end'. Local allows only partial alignment of read, with ends of reads possibly 'soft-clipped' (i.e. remain unaligned/ignored), if the soft-clipped alignment provides best alignment score. End-to-end requires all nucleotides to be aligned. Default is 'local', following [Cahill et al (2018)](https://doi.org/10.1093/molbev/msy018) and [Poullet and Orlando 2020](https://doi.org/10.3389/fevo.2020.00105).\n\n> Modifies Bowtie2 parameters: `--very-fast --fast --sensitive --very-sensitive --very-fast-local --fast-local --sensitive-local --very-sensitive-local`", "enum": [ "local", "end-to-end" ] }, "bt2_sensitivity": { "type": "string", "default": "sensitive", "description": "Specify the level of sensitivity for the bowtie2 alignment mode. Options: 'no-preset', 'very-fast', 'fast', 'sensitive', 'very-sensitive'.", "fa_icon": "fas fa-microscope", "help_text": "The Bowtie2 'preset' to use. Options: 'no-preset' 'very-fast', 'fast', 'sensitive', or 'very-sensitive'. These strings apply to both `--bt2_alignmode` options. See the Bowtie2 [manual](http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#command-line) for actual settings. Default is 'sensitive' (following [Poullet and Orlando (2020)](https://doi.org/10.3389/fevo.2020.00105), when running damaged-data _without_ UDG treatment)\n\n> Modifies Bowtie2 parameters: `--very-fast --fast --sensitive --very-sensitive --very-fast-local --fast-local --sensitive-local --very-sensitive-local`", "enum": [ "no-preset", "very-fast", "fast", "sensitive", "very-sensitive" ] }, "bt2n": { "type": "integer", "description": "Specify the -N parameter for bowtie2 (mismatches in seed). This will override defaults from alignmode/sensitivity.", "fa_icon": "fas fa-sort-numeric-down", "help_text": "The number of mismatches allowed in the seed during seed-and-extend procedure of Bowtie2. This will override any values set with `--bt2_sensitivity`. Can either be 0 or 1. Default: 0 (i.e. use`--bt2_sensitivity` defaults).\n\n> Modifies Bowtie2 parameters: `-N`", "default": 0 }, "bt2l": { "type": "integer", "description": "Specify the -L parameter for bowtie2 (length of seed substrings). This will override defaults from alignmode/sensitivity.", "fa_icon": "fas fa-ruler-horizontal", "help_text": "The length of the seed sub-string to use during seeding. This will override any values set with `--bt2_sensitivity`. Default: 0 (i.e. use`--bt2_sensitivity` defaults: [20 for local and 22 for end-to-end](http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#command-line).\n\n> Modifies Bowtie2 parameters: `-L`", "default": 0 }, "bt2_trim5": { "type": "integer", "description": "Specify number of bases to trim off from 5' (left) end of read before alignment.", "fa_icon": "fas fa-cut", "help_text": "Number of bases to trim at the 5' (left) end of read prior alignment. Maybe useful when left-over sequencing artefacts of in-line barcodes present Default: 0\n\n> Modifies Bowtie2 parameters: `-bt2_trim5`", "default": 0 }, "bt2_trim3": { "type": "integer", "description": "Specify number of bases to trim off from 3' (right) end of read before alignment.", "fa_icon": "fas fa-cut", "help_text": "Number of bases to trim at the 3' (right) end of read prior alignment. Maybe useful when left-over sequencing artefacts of in-line barcodes present Default: 0.\n\n> Modifies Bowtie2 parameters: `-bt2_trim3`", "default": 0 }, "bt2_maxins": { "type": "integer", "default": 500, "fa_icon": "fas fa-exchange-alt", "description": "Specify the maximum fragment length for Bowtie2 paired-end mapping mode only.", "help_text": "The maximum fragment for valid paired-end alignments. Only for paired-end mapping (i.e. unmerged), and therefore typically only useful for modern data.\n\n See [Bowtie2 documentation](http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml) for more information.\n\n> Modifies Bowtie2 parameters: `--maxins`" } }, "fa_icon": "fas fa-layer-group", "help_text": "If using TSV input, mapping is performed at the library level, i.e. after lane merging.\n" }, "host_removal": { "title": "Removal of Host-Mapped Reads", "type": "object", "description": "Options for production of host-read removed FASTQ files for privacy reasons.", "default": "", "properties": { "hostremoval_input_fastq": { "type": "boolean", "description": "Turn on per-library creation pre-Adapter Removal FASTQ files without reads that mapped to reference (e.g. for public upload of privacy sensitive non-host data)", "fa_icon": "fas fa-power-off", "help_text": "Create pre-Adapter Removal FASTQ files without reads that mapped to reference (e.g. for public upload of privacy sensitive non-host data)\n" }, "hostremoval_mode": { "type": "string", "default": "remove", "description": "Host removal mode. Remove mapped reads completely from FASTQ (remove) or just mask mapped reads sequence by N (replace).", "fa_icon": "fas fa-mask", "help_text": "Read removal mode. Remove mapped reads completely (`'remove'`) or just replace mapped reads sequence by N (`'replace'`)\n\n> Modifies extract_map_reads.py parameter: `-m`", "enum": [ "strip", "replace", "remove" ] } }, "fa_icon": "fas fa-user-shield", "help_text": "These parameters are used for removing mapped reads from the original input\nFASTQ files, usually in the context of uploading the original FASTQ files to a\npublic read archive (NCBI SRA/EBI ENA/DDBJ SRA).\n\nThese flags will produce FASTQ files almost identical to your input files,\nexcept that reads with the same read ID as one found in the mapped bam file, are\neither removed or 'masked' (every base replaced with Ns).\n\nThis functionality allows you to provide other researchers who wish to re-use\nyour data to apply their own adapter removal/read merging procedures, while\nmaintaining anonymity for sample donors - for example with microbiome\nresearch.\n\nIf using TSV input, stripping is performed library, i.e. after lane merging." }, "bam_filtering": { "title": "BAM Filtering", "type": "object", "description": "Options for quality filtering and how to deal with off-target unmapped reads.", "default": "", "properties": { "run_bam_filtering": { "type": "boolean", "description": "Turn on filtering of mapping quality, read lengths, or unmapped reads of BAM files.", "fa_icon": "fas fa-power-off", "help_text": "Turns on the bam filtering module for either mapping quality filtering or unmapped read treatment.\n" }, "bam_mapping_quality_threshold": { "type": "integer", "description": "Minimum mapping quality for reads filter.", "fa_icon": "fas fa-greater-than-equal", "help_text": "Specify a mapping quality threshold for mapped reads to be kept for downstream analysis. By default keeps all reads and is therefore set to `0` (basically doesn't filter anything).\n\n> Modifies samtools view parameter: `-q`", "default": 0 }, "bam_filter_minreadlength": { "type": "integer", "fa_icon": "fas fa-ruler-horizontal", "description": "Specify minimum read length to be kept after mapping.", "help_text": "Specify minimum length of mapped reads. This filtering will apply at the same time as mapping quality filtering.\n\nIf used _instead_ of minimum length read filtering at AdapterRemoval, this can be useful to get more realistic endogenous DNA percentages, when most of your reads are very short (e.g. in single-stranded libraries) and would otherwise be discarded by AdapterRemoval (thus making an artificially small denominator for a typical endogenous DNA calculation). Note in this context you should not perform mapping quality filtering nor discarding of unmapped reads to ensure a correct denominator of all reads, for the endogenous DNA calculation.\n\n> Modifies filter_bam_fragment_length.py parameter: `-l`", "default": 0 }, "bam_unmapped_type": { "type": "string", "default": "discard", "description": "Defines whether to discard all unmapped reads, keep only bam and/or keep only fastq format Options: 'discard', 'bam', 'fastq', 'both'.", "fa_icon": "fas fa-trash-alt", "help_text": "Defines how to proceed with unmapped reads: `'discard'` removes all unmapped reads, `keep` keeps both unmapped and mapped reads in the same BAM file, `'bam'` keeps unmapped reads as BAM file, `'fastq'` keeps unmapped reads as FastQ file, `both` keeps both BAM and FASTQ files. Default is `discard`. `keep` is what would happen if `--run_bam_filtering` was _not_ supplied.\n\nNote that in all cases, if `--bam_mapping_quality_threshold` is also supplied, mapping quality filtering will still occur on the mapped reads.\n\n> Modifies samtools view parameter: `-f4 -F4`", "enum": [ "discard", "keep", "bam", "fastq", "both" ] } }, "fa_icon": "fas fa-sort-amount-down", "help_text": "Users can configure to keep/discard/extract certain groups of reads efficiently\nin the nf-core/eager pipeline.\n\nIf using TSV input, filtering is performed library, i.e. after lane merging.\n\nThis module utilises `samtools view` and `filter_bam_fragment_length.py`" }, "deduplication": { "title": "DeDuplication", "type": "object", "description": "Options for removal of PCR amplicon duplicates that can artificially inflate coverage.", "default": "", "properties": { "dedupper": { "type": "string", "default": "markduplicates", "description": "Deduplication method to use. Options: 'markduplicates', 'dedup'.", "fa_icon": "fas fa-object-group", "help_text": "Sets the duplicate read removal tool. By default uses `markduplicates` from Picard. Alternatively an ancient DNA specific read deduplication tool `dedup` ([Peltzer et al. 2016](http://dx.doi.org/10.1186/s13059-016-0918-z)) is offered.\n\nThis utilises both ends of paired-end data to remove duplicates (i.e. true exact duplicates, as markduplicates will over-zealously deduplicate anything with the same starting position even if the ends are different). DeDup should generally only be used solely on paired-end data otherwise suboptimal deduplication can occur if applied to either single-end or a mix of single-end/paired-end data.\n", "enum": [ "markduplicates", "dedup" ] }, "dedup_all_merged": { "type": "boolean", "description": "Turn on treating all reads as merged reads.", "fa_icon": "fas fa-handshake", "help_text": "Sets DeDup to treat all reads as merged reads. This is useful if reads are for example not prefixed with `M_` in all cases. Therefore, this can be used as a workaround when also using a mixture of paired-end and single-end data, however this is not recommended (see above).\n\n> Modifies dedup parameter: `-m`" } }, "fa_icon": "fas fa-clone", "help_text": "If using TSV input, deduplication is performed per library, i.e. after lane merging." }, "library_complexity_analysis": { "title": "Library Complexity Analysis", "type": "object", "description": "Options for calculating library complexity (i.e. how many unique reads are present).", "default": "", "properties": { "preseq_mode": { "type": "string", "default": "c_curve", "description": "Specify which mode of preseq to run.", "fa_icon": "fas fa-toggle-on", "help_text": "Specify which mode of preseq to run.\n\nFrom the [PreSeq documentation](http://smithlabresearch.org/wp-content/uploads/manual.pdf): \n\n`c curve` is used to compute the expected complexity curve of a mapped read file with a hypergeometric\nformula\n\n`lc extrap` is used to generate the expected yield for theoretical larger experiments and bounds on the\nnumber of distinct reads in the library and the associated confidence intervals, which is computed by\nbootstrapping the observed duplicate counts histogram", "enum": [ "c_curve", "lc_extrap" ] }, "preseq_step_size": { "type": "integer", "default": 1000, "description": "Specify the step size of Preseq.", "fa_icon": "fas fa-shoe-prints", "help_text": "Can be used to configure the step size of Preseq's `c_curve` and `lc_extrap` method. Can be useful when only few and thus shallow sequencing results are used for extrapolation.\n\n> Modifies preseq c_curve and lc_extrap parameter: `-s`" }, "preseq_maxextrap": { "type": "integer", "default": 10000000000, "description": "Specify the maximum extrapolation (lc_extrap mode only)", "fa_icon": "fas fa-ban", "help_text": "Specify the maximum extrapolation that `lc_extrap` mode will perform.\n\n> Modifies preseq lc_extrap parameter: `-e`" }, "preseq_terms": { "type": "integer", "default": 100, "description": "Specify the maximum number of terms for extrapolation (lc_extrap mode only)", "fa_icon": "fas fa-sort-numeric-up-alt", "help_text": "Specify the maximum number of terms that `lc_extrap` mode will use.\n\n> Modifies preseq lc_extrap parameter: `-x`" }, "preseq_bootstrap": { "type": "integer", "default": 100, "description": "Specify number of bootstraps to perform (lc_extrap mode only)", "fa_icon": "fab fa-bootstrap", "help_text": "Specify the number of bootstraps `lc_extrap` mode will perform to calculate confidence intervals.\n\n> Modifies preseq lc_extrap parameter: `-n`" }, "preseq_cval": { "type": "number", "default": 0.95, "description": "Specify confidence interval level (lc_extrap mode only)", "fa_icon": "fas fa-check-circle", "help_text": "Specify the allowed level of confidence intervals used for `lc_extrap` mode.\n\n> Modifies preseq lc_extrap parameter: `-c`" } }, "fa_icon": "fas fa-bezier-curve", "help_text": "nf-core/eager uses Preseq on mapped reads as one method to calculate library\ncomplexity. If DeDup is used, Preseq uses the histogram output of DeDup,\notherwise the sorted non-duplicated BAM file is supplied. Furthermore, if\npaired-end read collapsing is not performed, the `-P` flag is used." }, "adna_damage_analysis": { "title": "(aDNA) Damage Analysis", "type": "object", "description": "Options for calculating and filtering for characteristic ancient DNA damage patterns.", "default": "", "properties": { "damage_calculation_tool": { "type": "string", "default": "damageprofiler", "description": "Specify the tool to use for damage calculation.", "fa_icon": "fas fa-tools", "help_text": "Specify the tool to be used for damage calculation. DamageProfiler is generally faster than mapDamage2, but the latter has an option to limit the number of reads used. This can significantly speed up the processing of very large files, where the damage estimates are already accurate after processing only a fraction of the input. Options: `damageprofiler`, `mapdamage`. By default, DamageProfiler is used.", "enum": [ "damageprofiler", "mapdamage" ] }, "damageprofiler_length": { "type": "integer", "default": 100, "description": "Specify length filter for DamageProfiler.", "fa_icon": "fas fa-sort-amount-up", "help_text": "Specifies the length filter for DamageProfiler. By default set to `100`.\n\n> Modifies DamageProfile parameter: `-l`" }, "damageprofiler_threshold": { "type": "integer", "default": 15, "description": "Specify number of bases of each read to consider for DamageProfiler calculations.", "fa_icon": "fas fa-ruler-horizontal", "help_text": "Specifies the length of the read start and end to be considered for profile generation in DamageProfiler. By default set to `15` bases.\n\n> Modifies DamageProfile parameter: `-t`" }, "damageprofiler_yaxis": { "type": "number", "default": 0.3, "description": "Specify the maximum misincorporation frequency that should be displayed on the damage plot. Set to 0 to 'autoscale'.", "fa_icon": "fas fa-ruler-vertical", "help_text": "Specifies what the maximum misincorporation frequency should be displayed as, in the DamageProfiler damage plot. This is set to `0.30` (i.e. 30%) by default as this matches the popular [mapDamage2.0](https://ginolhac.github.io/mapDamage) program. However, the default behaviour of DamageProfiler is to 'autoscale' the y-axis maximum to zoom in on any _possible_ damage that may occur (e.g. if the damage is about 10%, the highest value on the y-axis would be set to 0.12). This 'autoscale' behaviour can be turned on by specifying the number to `0`. Default: `0.30`.\n\n> Modifies DamageProfile parameter: `-yaxis_damageplot`" }, "mapdamage_downsample": { "type": "integer", "default": 0, "description": "Specify the maximum number of reads to consider for damage calculation. Defaults value is `0` (i.e. no downsampling is performed).", "fa_icon": "fas fa-greater-than-equal", "help_text": "The maximum number of reads used for damage calculation in mapDamage2. Can be used to significantly reduce the amount of time required for damage assessment. Note that a too low value can also obtain incorrect results.\n\n> Modifies mapDamage2 parameter: `-n`" }, "mapdamage_yaxis": { "type": "number", "default": 0.3, "description": "Specify the maximum misincorporation frequency that should be displayed on the damage plot.", "fa_icon": "fas fa-ruler-vertical", "help_text": "Specifies what the maximum misincorporation frequency should be displayed as, in the mapDamage2 damage plot. This defaults to `0.30` (i.e. 30%).\n\n> Modifies mapDamage2 parameter: `-y`" }, "run_pmdtools": { "type": "boolean", "description": "Turn on PMDtools", "fa_icon": "fas fa-power-off", "help_text": "Specifies to run PMDTools for damage based read filtering and assessment of DNA damage in sequencing libraries. By default turned off.\n" }, "pmdtools_range": { "type": "integer", "default": 10, "description": "Specify range of bases for PMDTools to scan for damage.", "fa_icon": "fas fa-arrows-alt-h", "help_text": "Specifies the range in which to consider DNA damage from the ends of reads. By default set to `10`.\n\n> Modifies PMDTools parameter: `--range`" }, "pmdtools_threshold": { "type": "integer", "default": 3, "description": "Specify PMDScore threshold for PMDTools.", "fa_icon": "fas fa-chart-bar", "help_text": "Specifies the PMDScore threshold to use in the pipeline when filtering BAM files for DNA damage. Only reads which surpass this damage score are considered for downstream DNA analysis. By default set to `3` if not set specifically by the user.\n\n> Modifies PMDTools parameter: `--threshold`" }, "pmdtools_reference_mask": { "type": "string", "description": "Specify a bedfile to be used to mask the reference fasta prior to running pmdtools.", "fa_icon": "fas fa-mask", "help_text": "Activates masking of the reference fasta prior to running pmdtools. Positions that are in the provided bedfile will be replaced by Ns in the reference genome. This is useful for capture data, where you might not want the allele of a SNP to be counted as damage when it is a transition. Masking of the reference is done using `bedtools maskfasta`." }, "pmdtools_max_reads": { "type": "integer", "default": 10000, "description": "Specify the maximum number of reads to consider for metrics generation.", "fa_icon": "fas fa-greater-than-equal", "help_text": "The maximum number of reads used for damage assessment in PMDtools. Can be used to significantly reduce the amount of time required for damage assessment in PMDTools. Note that a too low value can also obtain incorrect results.\n\n> Modifies PMDTools parameter: `-n`" }, "pmdtools_platypus": { "type": "boolean", "description": "Append big list of base frequencies for platypus to output.", "fa_icon": "fas fa-power-off", "help_text": "Enables the printing of a wider list of base frequencies used by platypus as an addition to the output base misincorporation frequency table. By default turned off.\n" }, "run_mapdamage_rescaling": { "type": "boolean", "fa_icon": "fas fa-map", "description": "Turn on damage rescaling of BAM files using mapDamage2 to probabilistically remove damage.", "help_text": "Turns on mapDamage2's BAM rescaling functionality. This probablistically replaces Ts back to Cs depending on the likelihood this reference-mismatch was originally caused by damage. If the library is specified to be single stranded, this will automatically use the `--single-stranded` mode.\n\nThis functionality does not have any MultiQC output.\n\n:warning: rescaled libraries will not be merged with non-scaled libraries of the same sample for downstream genotyping, as the model may be different for each library. If you wish to merge these, please do this manually and re-run nf-core/eager using the merged BAMs as input. \n\n> Modifies the `--rescale` parameter of mapDamage2" }, "rescale_seqlength": { "type": "integer", "default": 12, "fa_icon": "fas fa-ruler-horizontal", "description": "Length of read sequence to use from each side for rescaling. Can be overridden by --rescale_length_*p.", "help_text": "Specify the length from the end of the read that mapDamage should rescale at both ends.\n\n> Modifies the `--seq-length` parameter of mapDamage2." }, "rescale_length_5p": { "type": "integer", "default": 0, "fa_icon": "fas fa-balance-scale-right", "description": "Length of read for mapDamage2 to rescale from 5p end. Only used if not 0 otherwise --rescale_seqlength used.", "help_text": "Specify the length from the end of the read that mapDamage should rescale. Overrides `--rescale_seqlength`.\n\n> Modifies the `--rescale-length-5p` parameter of mapDamage2." }, "rescale_length_3p": { "type": "integer", "default": 0, "fa_icon": "fas fa-balance-scale-left", "description": "Length of read for mapDamage2 to rescale from 3p end. Only used if not 0 otherwise --rescale_seqlength used..", "help_text": "Specify the length from the end of the read that mapDamage should rescale.\n\n> Modifies the `--rescale-length-3p` parameter of mapDamage2." } }, "fa_icon": "fas fa-chart-line", "help_text": "More documentation can be seen in the follow links for:\n\n- [DamageProfiler](https://github.com/Integrative-Transcriptomics/DamageProfiler)\n- [PMDTools documentation](https://github.com/pontussk/PMDtools)\n\nIf using TSV input, DamageProfiler is performed per library, i.e. after lane\nmerging. PMDtools and BAM Trimming is run after library merging of same-named\nlibrary BAMs that have the same type of UDG treatment. BAM Trimming is only\nperformed on non-UDG and half-UDG treated data.\n" }, "feature_annotation_statistics": { "title": "Feature Annotation Statistics", "type": "object", "description": "Options for getting reference annotation statistics (e.g. gene coverages)", "default": "", "properties": { "run_bedtools_coverage": { "type": "boolean", "description": "Turn on ability to calculate no. reads, depth and breadth coverage of features in reference.", "fa_icon": "fas fa-chart-area", "help_text": "Specifies to turn on the bedtools module, producing statistics for breadth (or percent coverage), and depth (or X fold) coverages.\n" }, "anno_file": { "type": "string", "description": "Path to GFF or BED file containing positions of features in reference file (--fasta). Path should be enclosed in quotes.", "fa_icon": "fas fa-file-signature", "help_text": "Specify the path to a GFF/BED containing the feature coordinates (or any acceptable input for [`bedtools coverage`](https://bedtools.readthedocs.io/en/latest/content/tools/coverage.html)). Must be in quotes.\n" }, "anno_file_is_unsorted": { "type": "boolean", "fa_icon": "fas fa-random", "description": "Specify if the annotation file provided to --anno_file is not sorted in the same way as the reference fasta file.", "help_text": "In cases where the annotation file is NOT sorted the same way as the reference fasta, this option should be specified. This will significantly increase the memory usage of bedtools!\n\n> Modifies bedtools parameter: `-sorted`" } }, "fa_icon": "fas fa-scroll", "help_text": "If you're interested in looking at coverage stats for certain features on your\nreference such as genes, SNPs etc., you can use the following bedtools module\nfor this purpose.\n\nMore documentation on bedtools can be seen in the [bedtools\ndocumentation](https://bedtools.readthedocs.io/en/latest/)\n\nIf using TSV input, bedtools is run after library merging of same-named library\nBAMs that have the same type of UDG treatment.\n" }, "bam_trimming": { "title": "BAM Trimming", "type": "object", "description": "Options for trimming of aligned reads (e.g. to remove damage prior genotyping).", "default": "", "properties": { "run_trim_bam": { "type": "boolean", "description": "Turn on BAM trimming. Will only run on non-UDG or half-UDG libraries", "fa_icon": "fas fa-power-off", "help_text": "Turns on the BAM trimming method. Trims off `[n]` bases from reads in the deduplicated BAM file. Damage assessment in PMDTools or DamageProfiler remains untouched, as data is routed through this independently. BAM trimming is typically performed to reduce errors during genotyping that can be caused by aDNA damage.\n\nBAM trimming will only be performed on libraries indicated as `--udg_type 'none'` or `--udg_type 'half'`. Complete UDG treatment ('full') should have removed all damage. The amount of bases that will be trimmed off can be set separately for libraries with `--udg_type` `'none'` and `'half'` (see `--bamutils_clip_half_udg_left` / `--bamutils_clip_half_udg_right` / `--bamutils_clip_none_udg_left` / `--bamutils_clip_none_udg_right`).\n\n> Note: additional artefacts such as bar-codes or adapters that could potentially also be trimmed should be removed prior mapping." }, "bamutils_clip_double_stranded_half_udg_left": { "type": "integer", "default": 0, "fa_icon": "fas fa-ruler-combined", "description": "Specify the number of bases to clip off reads from 'left' end of read for double-stranded half-UDG libraries.", "help_text": "Default set to `0` and clips off no bases on the left or right side of reads from double_stranded libraries whose UDG treatment is set to `half`. Note that reverse reads will automatically be clipped off at the reverse side with this (automatically reverses left and right for the reverse read).\n\n> Modifies bamUtil's trimBam parameter: `-L -R`" }, "bamutils_clip_double_stranded_half_udg_right": { "type": "integer", "default": 0, "fa_icon": "fas fa-ruler", "description": "Specify the number of bases to clip off reads from 'right' end of read for double-stranded half-UDG libraries.", "help_text": "Default set to `0` and clips off no bases on the left or right side of reads from double_stranded libraries whose UDG treatment is set to `half`. Note that reverse reads will automatically be clipped off at the reverse side with this (automatically reverses left and right for the reverse read).\n\n> Modifies bamUtil's trimBam parameter: `-L -R`" }, "bamutils_clip_double_stranded_none_udg_left": { "type": "integer", "default": 0, "fa_icon": "fas fa-ruler-combined", "description": "Specify the number of bases to clip off reads from 'left' end of read for double-stranded non-UDG libraries.", "help_text": "Default set to `0` and clips off no bases on the left or right side of reads from double_stranded libraries whose UDG treatment is set to `none`. Note that reverse reads will automatically be clipped off at the reverse side with this (automatically reverses left and right for the reverse read).\n\n> Modifies bamUtil's trimBam parameter: `-L -R`" }, "bamutils_clip_double_stranded_none_udg_right": { "type": "integer", "default": 0, "fa_icon": "fas fa-ruler", "description": "Specify the number of bases to clip off reads from 'right' end of read for double-stranded non-UDG libraries.", "help_text": "Default set to `0` and clips off no bases on the left or right side of reads from double_stranded libraries whose UDG treatment is set to `none`. Note that reverse reads will automatically be clipped off at the reverse side with this (automatically reverses left and right for the reverse read).\n\n> Modifies bamUtil's trimBam parameter: `-L -R`" }, "bamutils_clip_single_stranded_half_udg_left": { "type": "integer", "default": 0, "fa_icon": "fas fa-ruler-combined", "description": "Specify the number of bases to clip off reads from 'left' end of read for single-stranded half-UDG libraries.", "help_text": "Default set to `0` and clips off no bases on the left or right side of reads from single-stranded libraries whose UDG treatment is set to `half`. Note that reverse reads will automatically be clipped off at the reverse side with this (automatically reverses left and right for the reverse read).\n\n> Modifies bamUtil's trimBam parameter: `-L -R`" }, "bamutils_clip_single_stranded_half_udg_right": { "type": "integer", "default": 0, "fa_icon": "fas fa-ruler", "description": "Specify the number of bases to clip off reads from 'right' end of read for single-stranded half-UDG libraries.", "help_text": "Default set to `0` and clips off no bases on the left or right side of reads from single-stranded libraries whose UDG treatment is set to `half`. Note that reverse reads will automatically be clipped off at the reverse side with this (automatically reverses left and right for the reverse read).\n\n> Modifies bamUtil's trimBam parameter: `-L -R`" }, "bamutils_clip_single_stranded_none_udg_left": { "type": "integer", "default": 0, "fa_icon": "fas fa-ruler-combined", "description": "Specify the number of bases to clip off reads from 'left' end of read for single-stranded non-UDG libraries.", "help_text": "Default set to `0` and clips off no bases on the left or right side of reads from single-stranded libraries whose UDG treatment is set to `none`. Note that reverse reads will automatically be clipped off at the reverse side with this (automatically reverses left and right for the reverse read).\n\n> Modifies bamUtil's trimBam parameter: `-L -R`" }, "bamutils_clip_single_stranded_none_udg_right": { "type": "integer", "default": 0, "fa_icon": "fas fa-ruler", "description": "Specify the number of bases to clip off reads from 'right' end of read for single-stranded non-UDG libraries.", "help_text": "Default set to `0` and clips off no bases on the left or right side of reads from single-stranded libraries whose UDG treatment is set to `none`. Note that reverse reads will automatically be clipped off at the reverse side with this (automatically reverses left and right for the reverse read).\n\n> Modifies bamUtil's trimBam parameter: `-L -R`" }, "bamutils_softclip": { "type": "boolean", "description": "Turn on using softclip instead of hard masking.", "fa_icon": "fas fa-paint-roller", "help_text": "By default, nf-core/eager uses hard clipping and sets clipped bases to `N` with quality `!` in the BAM output. Turn this on to use soft-clipping instead, masking reads at the read ends respectively using the CIGAR string.\n\n> Modifies bam trimBam parameter: `-c`" } }, "fa_icon": "fas fa-eraser", "help_text": "For some library preparation protocols, users might want to clip off damaged\nbases before applying genotyping methods. This can be done in nf-core/eager\nautomatically by turning on the `--run_trim_bam` parameter.\n\nMore documentation can be seen in the [bamUtil\ndocumentation](https://genome.sph.umich.edu/wiki/BamUtil:_trimBam)\n" }, "genotyping": { "title": "Genotyping", "type": "object", "description": "Options for variant calling.", "default": "", "properties": { "run_genotyping": { "type": "boolean", "description": "Turn on genotyping of BAM files.", "fa_icon": "fas fa-power-off", "help_text": "Turns on genotyping to run on all post-dedup and downstream BAMs. For example if `--run_pmdtools` and `--trim_bam` are both supplied, the genotyper will be run on all three BAM files i.e. post-deduplication, post-pmd and post-trimmed BAM files." }, "genotyping_tool": { "type": "string", "description": "Specify which genotyper to use either GATK UnifiedGenotyper, GATK HaplotypeCaller, Freebayes, or pileupCaller. Options: 'ug', 'hc', 'freebayes', 'pileupcaller', 'angsd'.", "fa_icon": "fas fa-tools", "help_text": "Specifies which genotyper to use. Current options are: GATK (v3.5) UnifiedGenotyper or GATK Haplotype Caller (v4); and the FreeBayes Caller. Specify 'ug', 'hc', 'freebayes', 'pileupcaller' and 'angsd' respectively.\n\n> > Note that while UnifiedGenotyper is more suitable for low-coverage ancient DNA (HaplotypeCaller does _de novo_ assembly around each variant site), be aware GATK 3.5 it is officially deprecated by the Broad Institute.", "enum": [ "ug", "hc", "freebayes", "pileupcaller", "angsd" ] }, "genotyping_source": { "type": "string", "default": "raw", "description": "Specify which input BAM to use for genotyping. Options: 'raw', 'trimmed', 'pmd' or 'rescaled'.", "fa_icon": "fas fa-faucet", "help_text": "Indicates which BAM file to use for genotyping, depending on what BAM processing modules you have turned on. Options are: `'raw'` for mapped only, filtered, or DeDup BAMs (with priority right to left); `'trimmed'` (for base clipped BAMs); `'pmd'` (for pmdtools output); `'rescaled'` (for mapDamage2 rescaling output). Default is: `'raw'`.\n", "enum": [ "raw", "pmd", "trimmed", "rescaled" ] }, "gatk_call_conf": { "type": "integer", "default": 30, "description": "Specify GATK phred-scaled confidence threshold.", "fa_icon": "fas fa-balance-scale-right", "help_text": "If selected, specify a GATK genotyper phred-scaled confidence threshold of a given SNP/INDEL call. Default: `30`\n\n> Modifies GATK UnifiedGenotyper or HaplotypeCaller parameter: `-stand_call_conf`" }, "gatk_ploidy": { "type": "integer", "default": 2, "description": "Specify GATK organism ploidy.", "fa_icon": "fas fa-pastafarianism", "help_text": "If selected, specify a GATK genotyper ploidy value of your reference organism. E.g. if you want to allow heterozygous calls from >= diploid organisms. Default: `2`\n\n> Modifies GATK UnifiedGenotyper or HaplotypeCaller parameter: `--sample-ploidy`" }, "gatk_downsample": { "type": "integer", "default": 250, "description": "Maximum depth coverage allowed for genotyping before down-sampling is turned on.", "fa_icon": "fas fa-icicles", "help_text": "Maximum depth coverage allowed for genotyping before down-sampling is turned on. Any position with a coverage higher than this value will be randomly down-sampled to 250 reads. Default: `250`\n\n> Modifies GATK UnifiedGenotyper parameter: `-dcov`" }, "gatk_dbsnp": { "type": "string", "description": "Specify VCF file for SNP annotation of output VCF files. Optional. Gzip not accepted.", "fa_icon": "fas fa-marker", "help_text": "(Optional) Specify VCF file for output VCF SNP annotation e.g. if you want to annotate your VCF file with 'rs' SNP IDs. Check GATK documentation for more information. Gzip not accepted.\n" }, "gatk_hc_out_mode": { "type": "string", "default": "EMIT_VARIANTS_ONLY", "description": "Specify GATK output mode. Options: 'EMIT_VARIANTS_ONLY', 'EMIT_ALL_CONFIDENT_SITES', 'EMIT_ALL_ACTIVE_SITES'.", "fa_icon": "fas fa-bullhorn", "help_text": "If the GATK genotyper HaplotypeCaller is selected, what type of VCF to create, i.e. produce calls for every site or just confidence sites. Options: `'EMIT_VARIANTS_ONLY'`, `'EMIT_ALL_CONFIDENT_SITES'`, `'EMIT_ALL_ACTIVE_SITES'`. Default: `'EMIT_VARIANTS_ONLY'`\n\n> Modifies GATK HaplotypeCaller parameter: `-output_mode`", "enum": [ "EMIT_ALL_ACTIVE_SITES", "EMIT_ALL_CONFIDENT_SITES", "EMIT_VARIANTS_ONLY" ] }, "gatk_hc_emitrefconf": { "type": "string", "default": "GVCF", "description": "Specify HaplotypeCaller mode for emitting reference confidence calls . Options: 'NONE', 'BP_RESOLUTION', 'GVCF'.", "fa_icon": "fas fa-bullhorn", "help_text": "If the GATK HaplotypeCaller is selected, mode for emitting reference confidence calls. Options: `'NONE'`, `'BP_RESOLUTION'`, `'GVCF'`. Default: `'GVCF'`\n\n> Modifies GATK HaplotypeCaller parameter: `--emit-ref-confidence`\n", "enum": [ "NONE", "GVCF", "BP_RESOLUTION" ] }, "gatk_ug_out_mode": { "type": "string", "default": "EMIT_VARIANTS_ONLY", "description": "Specify GATK output mode. Options: 'EMIT_VARIANTS_ONLY', 'EMIT_ALL_CONFIDENT_SITES', 'EMIT_ALL_SITES'.", "fa_icon": "fas fa-bullhorn", "help_text": "If the GATK UnifiedGenotyper is selected, what type of VCF to create, i.e. produce calls for every site or just confidence sites. Options: `'EMIT_VARIANTS_ONLY'`, `'EMIT_ALL_CONFIDENT_SITES'`, `'EMIT_ALL_SITES'`. Default: `'EMIT_VARIANTS_ONLY'`\n\n> Modifies GATK UnifiedGenotyper parameter: `--output_mode`", "enum": [ "EMIT_ALL_SITES", "EMIT_ALL_CONFIDENT_SITES", "EMIT_VARIANTS_ONLY" ] }, "gatk_ug_genotype_model": { "type": "string", "default": "SNP", "description": "Specify UnifiedGenotyper likelihood model. Options: 'SNP', 'INDEL', 'BOTH', 'GENERALPLOIDYSNP', 'GENERALPLOIDYINDEL'.", "fa_icon": "fas fa-project-diagram", "help_text": "If the GATK UnifiedGenotyper is selected, which likelihood model to follow, i.e. whether to call use SNPs or INDELS etc. Options: `'SNP'`, `'INDEL'`, `'BOTH'`, `'GENERALPLOIDYSNP'`, `'GENERALPLOIDYINDEL`'. Default: `'SNP'`\n\n> Modifies GATK UnifiedGenotyper parameter: `--genotype_likelihoods_model`", "enum": [ "SNP", "INDEL", "BOTH", "GENERALPLOIDYSNP", "GENERALPLOIDYINDEL" ] }, "gatk_ug_keep_realign_bam": { "type": "boolean", "description": "Specify to keep the BAM output of re-alignment around variants from GATK UnifiedGenotyper.", "fa_icon": "fas fa-align-left", "help_text": "If provided when running GATK's UnifiedGenotyper, this will put into the output folder the BAMs that have realigned reads (with GATK's (v3) IndelRealigner) around possible variants for improved genotyping.\n\nThese BAMs will be stored in the same folder as the corresponding VCF files." }, "gatk_ug_defaultbasequalities": { "type": "string", "description": "Supply a default base quality if a read is missing a base quality score. Setting to -1 turns this off.", "fa_icon": "fas fa-undo-alt", "help_text": "When running GATK's UnifiedGenotyper, specify a value to set base quality scores, if reads are missing this information. Might be useful if you have 'synthetically' generated reads (e.g. chopping up a reference genome). Default is set to -1 which is to not set any default quality (turned off). Default: `-1`\n\n> Modifies GATK UnifiedGenotyper parameter: `--defaultBaseQualities`" }, "freebayes_C": { "type": "integer", "default": 1, "description": "Specify minimum required supporting observations to consider a variant.", "fa_icon": "fas fa-align-center", "help_text": "Specify minimum required supporting observations to consider a variant. Default: `1`\n\n> Modifies freebayes parameter: `-C`" }, "freebayes_g": { "type": "integer", "description": "Specify to skip over regions of high depth by discarding alignments overlapping positions where total read depth is greater than specified in --freebayes_C.", "fa_icon": "fab fa-think-peaks", "help_text": "Specify to skip over regions of high depth by discarding alignments overlapping positions where total read depth is greater than specified C. Not set by default.\n\n> Modifies freebayes parameter: `-g`", "default": 0 }, "freebayes_p": { "type": "integer", "default": 2, "description": "Specify ploidy of sample in FreeBayes.", "fa_icon": "fas fa-pastafarianism", "help_text": "Specify ploidy of sample in FreeBayes. Default is diploid. Default: `2`\n\n> Modifies freebayes parameter: `-p`" }, "pileupcaller_bedfile": { "type": "string", "description": "Specify path to SNP panel in bed format for pileupCaller.", "fa_icon": "fas fa-bed", "help_text": "Specify a SNP panel in the form of a bed file of sites at which to generate pileup for pileupCaller.\n" }, "pileupcaller_snpfile": { "type": "string", "description": "Specify path to SNP panel in EIGENSTRAT format for pileupCaller.", "fa_icon": "fas fa-sliders-h", "help_text": "Specify a SNP panel in [EIGENSTRAT](https://github.com/DReichLab/EIG/tree/master/CONVERTF) format, pileupCaller will call these sites.\n" }, "pileupcaller_method": { "type": "string", "default": "randomHaploid", "description": "Specify calling method to use. Options: 'randomHaploid', 'randomDiploid', 'majorityCall'.", "fa_icon": "fas fa-toolbox", "help_text": "Specify calling method to use. Options: randomHaploid, randomDiploid, majorityCall. Default: `'randomHaploid'`\n\n> Modifies pileupCaller parameter: `--randomHaploid --randomDiploid --majorityCall`", "enum": [ "randomHaploid", "randomDiploid", "majorityCall" ] }, "pileupcaller_transitions_mode": { "type": "string", "default": "AllSites", "description": "Specify the calling mode for transitions. Options: 'AllSites', 'TransitionsMissing', 'SkipTransitions'.", "fa_icon": "fas fa-toggle-on", "help_text": "Specify if genotypes of transition SNPs should be called, set to missing, or excluded from the genotypes respectively. Options: `'AllSites'`, `'TransitionsMissing'`, `'SkipTransitions'`. Default: `'AllSites'`\n\n> Modifies pileupCaller parameter: `--skipTransitions --transitionsMissing`", "enum": [ "AllSites", "TransitionsMissing", "SkipTransitions" ] }, "pileupcaller_min_map_quality": { "type": "integer", "default": 30, "description": "The minimum mapping quality to be used for genotyping.", "fa_icon": "fas fa-filter", "help_text": "The minimum mapping quality to be used for genotyping. Affects the `samtools pileup` output that is used by pileupcaller. Affects `-q` parameter of samtools mpileup." }, "pileupcaller_min_base_quality": { "type": "integer", "default": 30, "description": "The minimum base quality to be used for genotyping.", "fa_icon": "fas fa-filter", "help_text": "The minimum base quality to be used for genotyping. Affects the `samtools pileup` output that is used by pileupcaller. Affects `-Q` parameter of samtools mpileup." }, "angsd_glmodel": { "type": "string", "default": "samtools", "description": "Specify which ANGSD genotyping likelihood model to use. Options: 'samtools', 'gatk', 'soapsnp', 'syk'.", "fa_icon": "fas fa-project-diagram", "help_text": "Specify which genotype likelihood model to use. Options: `'samtools`, `'gatk'`, `'soapsnp'`, `'syk'`. Default: `'samtools'`\n\n> Modifies ANGSD parameter: `-GL`", "enum": [ "samtools", "gatk", "soapsnp", "syk" ] }, "angsd_glformat": { "type": "string", "default": "binary", "description": "Specify which output type to output ANGSD genotyping likelihood results: Options: 'text', 'binary', 'binary_three', 'beagle'.", "fa_icon": "fas fa-text-height", "help_text": "Specifies what type of genotyping likelihood file format will be output. Options: `'text'`, `'binary'`, `'binary_three'`, `'beagle_binary'`. Default: `'text'`.\n\nThe options refer to the following descriptions respectively:\n\n- `text`: textoutput of all 10 log genotype likelihoods.\n- `binary`: binary all 10 log genotype likelihood\n- `binary_three`: binary 3 times likelihood\n- `beagle_binary`: beagle likelihood file\n\nSee the [ANGSD documentation](http://www.popgen.dk/angsd/) for more information on which to select for your downstream applications.\n\n> Modifies ANGSD parameter: `-doGlF`", "enum": [ "text", "binary", "binary_three", "beagle" ] }, "angsd_createfasta": { "type": "boolean", "description": "Turn on creation of FASTA from ANGSD genotyping likelihood.", "fa_icon": "fas fa-align-justify", "help_text": "Turns on the ANGSD creation of a FASTA file from the BAM file.\n" }, "angsd_fastamethod": { "type": "string", "default": "random", "description": "Specify which genotype type of 'base calling' to use for ANGSD FASTA generation. Options: 'random', 'common'.", "fa_icon": "fas fa-toolbox", "help_text": "The type of base calling to be performed when creating the ANGSD FASTA file. Options: `'random'` or `'common'`. Will output the most common non-N base at each given position, whereas 'random' will pick one at random. Default: `'random'`.\n\n> Modifies ANGSD parameter: `-doFasta -doCounts`", "enum": [ "random", "common" ] }, "run_bcftools_stats": { "type": "boolean", "default": true, "description": "Turn on bcftools stats generation for VCF based variant calling statistics", "help_text": "Runs `bcftools stats` against VCF files from GATK and FreeBayes genotypers.\n\nIt will automatically include the FASTA reference for INDEL-related statistics.", "fa_icon": "far fa-chart-bar" } }, "fa_icon": "fas fa-sliders-h", "help_text": "There are options for different genotypers (or genotype likelihood calculators)\nto be used. We suggest you read the documentation of each tool to find the ones that\nsuit your needs.\n\nDocumentation for each tool:\n\n- [GATK\n UnifiedGenotyper](https://software.broadinstitute.org/gatk/documentation/tooldocs/3.5-0/org_broadinstitute_gatk_tools_walkers_genotyper_UnifiedGenotyper.php)\n- [GATK\n HaplotypeCaller](https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php)\n- [FreeBayes](https://github.com/ekg/freebayes)\n- [ANGSD](http://www.popgen.dk/angsd/index.php/Genotype_Likelihoods)\n- [sequenceTools pileupCaller](https://github.com/stschiff/sequenceTools)\n\nIf using TSV input, genotyping is performed per sample (i.e. after all types of\nlibraries are merged), except for pileupCaller which gathers all double-stranded and\nsingle-stranded (same-type merged) libraries respectively." }, "consensus_sequence_generation": { "title": "Consensus Sequence Generation", "type": "object", "description": "Options for creation of a per-sample FASTA sequence useful for downstream analysis (e.g. multi sequence alignment)", "default": "", "properties": { "run_vcf2genome": { "type": "boolean", "description": "Turns on ability to create a consensus sequence FASTA file based on a UnifiedGenotyper VCF file and the original reference (only considers SNPs).", "fa_icon": "fas fa-power-off", "help_text": "Turn on consensus sequence genome creation via VCF2Genome. Only accepts GATK UnifiedGenotyper VCF files with the `--gatk_ug_out_mode 'EMIT_ALL_SITES'` and `--gatk_ug_genotype_model 'SNP` flags. Typically useful for small genomes such as mitochondria.\n" }, "vcf2genome_outfile": { "type": "string", "description": "Specify the name of the output FASTA file containing the consensus sequence.", "fa_icon": "fas fa-file-alt", "help_text": "The output FASTA file will be named `_.fasta`.\n" }, "vcf2genome_header": { "type": "string", "description": "Specify the header name of the consensus sequence entry within the FASTA file.", "fa_icon": "fas fa-heading", "help_text": "The name of the FASTA entry you would like in your FASTA file.\n" }, "vcf2genome_minc": { "type": "integer", "default": 5, "description": "Minimum depth coverage required for a call to be included (else N will be called).", "fa_icon": "fas fa-sort-amount-up", "help_text": "Minimum depth coverage for a SNP to be made. Else, a SNP will be called as N. Default: `5`\n\n> Modifies VCF2Genome parameter: `-minc`" }, "vcf2genome_minq": { "type": "integer", "default": 30, "description": "Minimum genotyping quality of a call to be called. Else N will be called.", "fa_icon": "fas fa-medal", "help_text": "Minimum genotyping quality of a call to be made. Else N will be called. Default: `30`\n\n> Modifies VCF2Genome parameter: `-minq`" }, "vcf2genome_minfreq": { "type": "number", "default": 0.8, "description": "Minimum fraction of reads supporting a call to be included. Else N will be called.", "fa_icon": "fas fa-percent", "help_text": "In the case of two possible alleles, the frequency of the majority allele required for a call to be made. Else, a SNP will be called as N. Default: `0.8`\n\n> Modifies VCF2Genome parameter: `-minfreq`" } }, "fa_icon": "fas fa-handshake", "help_text": "If using TSV input, consensus generation is performed per sample (i.e. after all\ntypes of libraries are merged)." }, "snp_table_generation": { "title": "SNP Table Generation", "type": "object", "description": "Options for creation of a SNP table useful for downstream analysis (e.g. estimation of cross-mapping of different species and multi-sequence alignment)", "default": "", "properties": { "run_multivcfanalyzer": { "type": "boolean", "description": "Turn on MultiVCFAnalyzer. Note: This currently only supports diploid GATK UnifiedGenotyper input.", "fa_icon": "fas fa-power-off", "help_text": "Turns on MultiVCFAnalyzer. Will only work when in combination with UnifiedGenotyper genotyping module.\n" }, "write_allele_frequencies": { "type": "boolean", "description": "Turn on writing write allele frequencies in the SNP table.", "fa_icon": "fas fa-pen", "help_text": "Specify whether to tell MultiVCFAnalyzer to write within the SNP table the frequencies of the allele at that position e.g. A (70%).\n" }, "min_genotype_quality": { "type": "integer", "default": 30, "description": "Specify the minimum genotyping quality threshold for a SNP to be called.", "fa_icon": "fas fa-medal", "help_text": "The minimal genotyping quality for a SNP to be considered for processing by MultiVCFAnalyzer. The default threshold is `30`.\n" }, "min_base_coverage": { "type": "integer", "default": 5, "description": "Specify the minimum number of reads a position needs to be covered to be considered for base calling.", "fa_icon": "fas fa-sort-amount-up", "help_text": "The minimal number of reads covering a base for a SNP at that position to be considered for processing by MultiVCFAnalyzer. The default depth is `5`.\n" }, "min_allele_freq_hom": { "type": "number", "default": 0.9, "description": "Specify the minimum allele frequency that a base requires to be considered a 'homozygous' call.", "fa_icon": "fas fa-percent", "help_text": "The minimal frequency of a nucleotide for a 'homozygous' SNP to be called. In other words, e.g. 90% of the reads covering that position must have that SNP to be called. If the threshold is not reached, and the previous two parameters are matched, a reference call is made (displayed as . in the SNP table). If the above two parameters are not met, an 'N' is called. The default allele frequency is `0.9`.\n" }, "min_allele_freq_het": { "type": "number", "default": 0.9, "description": "Specify the minimum allele frequency that a base requires to be considered a 'heterozygous' call.", "fa_icon": "fas fa-percent", "help_text": "The minimum frequency of a nucleotide for a 'heterozygous' SNP to be called. If\nthis parameter is set to the same as `--min_allele_freq_hom`, then only\nhomozygous calls are made. If this value is less than the previous parameter,\nthen a SNP call will be made. If it is between this and the previous parameter,\nit will be displayed as a IUPAC uncertainty call. Default is `0.9`." }, "additional_vcf_files": { "type": "string", "description": "Specify paths to additional pre-made VCF files to be included in the SNP table generation. Use wildcard(s) for multiple files.", "fa_icon": "fas fa-copy", "help_text": "If you wish to add to the table previously created VCF files, specify here a path with wildcards (in quotes). These VCF files must be created the same way as your settings for [GATK UnifiedGenotyping](#genotyping-parameters) module above." }, "reference_gff_annotations": { "type": "string", "default": "NA", "description": "Specify path to the reference genome annotations in '.gff' format. Optional.", "fa_icon": "fas fa-file-signature", "help_text": "If you wish to report in the SNP table annotation information for the regions\nSNPs fall in, provide a file in GFF format (the path must be in quotes).\n" }, "reference_gff_exclude": { "type": "string", "default": "NA", "description": "Specify path to the positions to be excluded in '.gff' format. Optional.", "fa_icon": "fas fa-times", "help_text": "If you wish to exclude SNP regions from consideration by MultiVCFAnalyzer (such as for problematic regions), provide a file in GFF format (the path must be in quotes).\n" }, "snp_eff_results": { "type": "string", "default": "NA", "description": "Specify path to the output file from SNP effect analysis in '.txt' format. Optional.", "fa_icon": "fas fa-magic", "help_text": "If you wish to include results from SNPEff effect analysis, supply the output\nfrom SNPEff in txt format (the path must be in quotes)." } }, "fa_icon": "fas fa-table", "help_text": "SNP Table Generation here is performed by MultiVCFAnalyzer. The current version\nof MultiVCFAnalyzer version only accepts GATK UnifiedGenotyper 3.5 VCF files,\nand when the ploidy was set to 2 (this allows MultiVCFAnalyzer to report\nfrequencies of polymorphic positions). A description of how the tool works can\nbe seen in the Supplementary Information of [Bos et al.\n(2014)](https://doi.org/10.1038/nature13591) under \"SNP Calling and Phylogenetic\nAnalysis\".\n\nMore can be seen in the [MultiVCFAnalyzer\ndocumentation](https://github.com/alexherbig/MultiVCFAnalyzer).\n\nIf using TSV input, MultiVCFAnalyzer is performed on all samples gathered\ntogether." }, "mitochondrial_to_nuclear_ratio": { "title": "Mitochondrial to Nuclear Ratio", "type": "object", "description": "Options for the calculation of ratio of reads to one chromosome/FASTA entry against all others.", "default": "", "properties": { "run_mtnucratio": { "type": "boolean", "description": "Turn on mitochondrial to nuclear ratio calculation.", "fa_icon": "fas fa-balance-scale-left", "help_text": "Turn on the module to estimate the ratio of mitochondrial to nuclear reads.\n" }, "mtnucratio_header": { "type": "string", "default": "MT", "description": "Specify the name of the reference FASTA entry corresponding to the mitochondrial genome (up to the first space).", "fa_icon": "fas fa-heading", "help_text": "Specify the FASTA entry in the reference file specified as `--fasta`, which acts\nas the mitochondrial 'chromosome' to base the ratio calculation on. The tool\nonly accepts the first section of the header before the first space. The default\nchromosome name is based on hs37d5/GrCH37 human reference genome. Default: 'MT'" } }, "fa_icon": "fas fa-balance-scale-left", "help_text": "If using TSV input, Mitochondrial to Nuclear Ratio calculation is calculated per\ndeduplicated library (after lane merging)" }, "human_sex_determination": { "title": "Human Sex Determination", "type": "object", "description": "Options for the calculation of biological sex of human individuals.", "default": "", "properties": { "run_sexdeterrmine": { "type": "boolean", "description": "Turn on sex determination for human reference genomes. This will run on single- and double-stranded variants of a library separately.", "fa_icon": "fas fa-transgender-alt", "help_text": "Specify to run the optional process of sex determination.\n" }, "sexdeterrmine_bedfile": { "type": "string", "description": "Specify path to SNP panel in bed format for error bar calculation. Optional (see documentation).", "fa_icon": "fas fa-bed", "help_text": "Specify an optional bedfile of the list of SNPs to be used for X-/Y-rate calculation. Running without this parameter will considerably increase runtime, and render the resulting error bars untrustworthy. Theoretically, any set of SNPs that are distant enough that two SNPs are unlikely to be covered by the same read can be used here. The programme was coded with the 1240K panel in mind. The path must be in quotes." } }, "fa_icon": "fas fa-transgender", "help_text": "An optional process for human DNA. It can be used to calculate the relative\ncoverage of X and Y chromosomes compared to the autosomes (X-/Y-rate). Standard\nerrors for these measurements are also calculated, assuming a binomial\ndistribution of reads across the SNPs.\n\nIf using TSV input, SexDetERRmine is performed on all samples gathered together." }, "nuclear_contamination_for_human_dna": { "title": "Nuclear Contamination for Human DNA", "type": "object", "description": "Options for the estimation of contamination of human DNA.", "default": "", "properties": { "run_nuclear_contamination": { "type": "boolean", "description": "Turn on nuclear contamination estimation for human reference genomes.", "fa_icon": "fas fa-power-off", "help_text": "Specify to run the optional processes for (human) nuclear DNA contamination estimation.\n" }, "contamination_chrom_name": { "type": "string", "default": "X", "description": "The name of the X chromosome in your bam/FASTA header. 'X' for hs37d5, 'chrX' for HG19.", "fa_icon": "fas fa-address-card", "help_text": "The name of the human chromosome X in your bam. `'X'` for hs37d5, `'chrX'` for HG19. Defaults to `'X'`." } }, "fa_icon": "fas fa-radiation-alt" }, "metagenomic_screening": { "title": "Metagenomic Screening", "type": "object", "description": "Options for metagenomic screening of off-target reads.", "default": "", "properties": { "metagenomic_complexity_filter": { "type": "boolean", "description": "Turn on removal of low-sequence complexity reads for metagenomic screening with bbduk", "help_text": "Turns on low-sequence complexity filtering of off-target reads using `bbduk`.\n\nThis is typically performed to reduce the number of uninformative reads or potential false-positive reads, typically for input for metagenomic screening. This thus reduces false positive species IDs and also run-time and resource requirements.\n\nSee `--metagenomic_complexity_entropy` for how complexity is calculated. **Important** There are no MultiQC output results for this module, you must check the number of reads removed with the `_bbduk.stats` output file.\n\nDefault: off\n", "fa_icon": "fas fa-filter" }, "metagenomic_complexity_entropy": { "type": "number", "default": 0.3, "description": "Specify the entropy threshold that under which a sequencing read will be complexity filtered out. This should be between 0-1.", "minimum": 0, "maximum": 1, "help_text": "Specify a minimum entropy threshold that under which it will be _removed_ from the FASTQ file that goes into metagenomic screening. \n\nA mono-nucleotide read such as GGGGGG will have an entropy of 0, a completely random sequence has an entropy of almost 1.\n\nSee the `bbduk` [documentation](https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbduk-guide/-filter) on entropy for more information.\n\n> Modifies`bbduk` parameter `entropy=`", "fa_icon": "fas fa-percent" }, "run_metagenomic_screening": { "type": "boolean", "description": "Turn on metagenomic screening module for reference-unmapped reads.", "fa_icon": "fas fa-power-off", "help_text": "Turn on the metagenomic screening module.\n" }, "metagenomic_tool": { "type": "string", "description": "Specify which classifier to use. Options: 'malt', 'kraken'.", "fa_icon": "fas fa-tools", "help_text": "Specify which taxonomic classifier to use. There are two options available:\n\n- `kraken` for [Kraken2](https://ccb.jhu.edu/software/kraken2)\n- `malt` for [MALT](https://software-ab.informatik.uni-tuebingen.de/download/malt/welcome.html)\n\n:warning: **Important** It is very important to run `nextflow clean -f` on your\nNextflow run directory once completed. RMA6 files are VERY large and are\n_copied_ from a `work/` directory into the results folder. You should clean the\nwork directory with the command to ensure non-redundancy and large HDD\nfootprints!" }, "database": { "type": "string", "description": "Specify path to classifier database directory. For Kraken2 this can also be a `.tar.gz` of the directory.", "fa_icon": "fas fa-database", "help_text": "Specify the path to the _directory_ containing your taxonomic classifier's database (malt or kraken).\n\nFor Kraken2, it can be either the path to the _directory_ or the path to the `.tar.gz` compressed directory of the Kraken2 database." }, "metagenomic_min_support_reads": { "type": "integer", "default": 1, "description": "Specify a minimum number of reads a taxon of sample total is required to have to be retained. Not compatible with --malt_min_support_mode 'percent'.", "fa_icon": "fas fa-sort-numeric-up-alt", "help_text": "Specify the minimum number of reads a given taxon is required to have to be retained as a positive 'hit'. \nFor malt, this only applies when `--malt_min_support_mode` is set to 'reads'. Default: 1.\n\n> Modifies MALT or kraken_parse.py parameter: `-sup` and `-c` respectively\n" }, "percent_identity": { "type": "integer", "default": 85, "description": "Percent identity value threshold for MALT.", "fa_icon": "fas fa-id-card", "help_text": "Specify the minimum percent identity (or similarity) a sequence must have to the reference for it to be retained. Default is `85`\n\nOnly used when `--metagenomic_tool malt` is also supplied.\n\n> Modifies MALT parameter: `-id`" }, "malt_mode": { "type": "string", "default": "BlastN", "description": "Specify which alignment mode to use for MALT. Options: 'Unknown', 'BlastN', 'BlastP', 'BlastX', 'Classifier'.", "fa_icon": "fas fa-align-left", "help_text": "Use this to run the program in 'BlastN', 'BlastP', 'BlastX' modes to align DNA\nand DNA, protein and protein, or DNA reads against protein references\nrespectively. Ensure your database matches the mode. Check the\n[MALT\nmanual](http://ab.inf.uni-tuebingen.de/data/software/malt/download/manual.pdf)\nfor more details. Default: `'BlastN'`\n\nOnly when `--metagenomic_tool malt` is also supplied.\n\n> Modifies MALT parameter: `-m`\n", "enum": [ "BlastN", "BlastP", "BlastX" ] }, "malt_alignment_mode": { "type": "string", "default": "SemiGlobal", "description": "Specify alignment method for MALT. Options: 'Local', 'SemiGlobal'.", "fa_icon": "fas fa-align-center", "help_text": "Specify what alignment algorithm to use. Options are 'Local' or 'SemiGlobal'. Local is a BLAST like alignment, but is much slower. Semi-global alignment aligns reads end-to-end. Default: `'SemiGlobal'`\n\nOnly when `--metagenomic_tool malt` is also supplied.\n\n> Modifies MALT parameter: `-at`", "enum": [ "Local", "SemiGlobal" ] }, "malt_top_percent": { "type": "integer", "default": 1, "description": "Specify the percent for LCA algorithm for MALT (see MEGAN6 CE manual).", "fa_icon": "fas fa-percent", "help_text": "Specify the top percent value of the LCA algorithm. From the [MALT manual](http://ab.inf.uni-tuebingen.de/data/software/malt/download/manual.pdf): \"For each\nread, only those matches are used for taxonomic placement whose bit disjointScore is within\n10% of the best disjointScore for that read.\". Default: `1`.\n\nOnly when `--metagenomic_tool malt` is also supplied.\n\n> Modifies MALT parameter: `-top`" }, "malt_min_support_mode": { "type": "string", "default": "percent", "description": "Specify whether to use percent or raw number of reads for minimum support required for taxon to be retained for MALT. Options: 'percent', 'reads'.", "fa_icon": "fas fa-drumstick-bite", "help_text": "Specify whether to use a percentage, or raw number of reads as the value used to decide the minimum support a taxon requires to be retained.\n\nOnly when `--metagenomic_tool malt` is also supplied.\n\n> Modifies MALT parameter: `-sup -supp`", "enum": [ "percent", "reads" ] }, "malt_min_support_percent": { "type": "number", "default": 0.01, "description": "Specify the minimum percentage of reads a taxon of sample total is required to have to be retained for MALT.", "fa_icon": "fas fa-percentage", "help_text": "Specify the minimum number of reads (as a percentage of all assigned reads) a given taxon is required to have to be retained as a positive 'hit' in the RMA6 file. This only applies when `--malt_min_support_mode` is set to 'percent'. Default 0.01.\n\nOnly when `--metagenomic_tool malt` is also supplied.\n\n> Modifies MALT parameter: `-supp`" }, "malt_max_queries": { "type": "integer", "default": 100, "description": "Specify the maximum number of queries a read can have for MALT.", "fa_icon": "fas fa-phone", "help_text": "Specify the maximum number of alignments a read can have. All further alignments are discarded. Default: `100`\n\nOnly when `--metagenomic_tool malt` is also supplied.\n\n> Modifies MALT parameter: `-mq`" }, "malt_memory_mode": { "type": "string", "default": "load", "description": "Specify the memory load method. Do not use 'map' with GPFS file systems for MALT as can be very slow. Options: 'load', 'page', 'map'.", "fa_icon": "fas fa-memory", "help_text": "\nHow to load the database into memory. Options are `'load'`, `'page'` or `'map'`.\n'load' directly loads the entire database into memory prior seed look up, this\nis slow but compatible with all servers/file systems. `'page'` and `'map'`\nperform a sort of 'chunked' database loading, allowing seed look up prior entire\ndatabase loading. Note that Page and Map modes do not work properly not with\nmany remote file-systems such as GPFS. Default is `'load'`.\n\nOnly when `--metagenomic_tool malt` is also supplied.\n\n> Modifies MALT parameter: `--memoryMode`", "enum": [ "load", "page", "map" ] }, "malt_sam_output": { "type": "boolean", "description": "Specify to also produce SAM alignment files. Note this includes both aligned and unaligned reads, and are gzipped. Note this will result in very large file sizes.", "fa_icon": "fas fa-file-alt", "help_text": "Specify to _also_ produce gzipped SAM files of all alignments and un-aligned reads in addition to RMA6 files. These are **not** soft-clipped or in 'sparse' format. Can be useful for downstream analyses due to more common file format. \n\n:warning: can result in very large run output directories as this is essentially duplication of the RMA6 files.\n\n> Modifies MALT parameter `-a -f`" } }, "fa_icon": "fas fa-search", "help_text": "\nAn increasingly common line of analysis in high-throughput aDNA analysis today\nis simultaneously screening off target reads of the host for endogenous\nmicrobial signals - particularly of pathogens. Metagenomic screening is\ncurrently offered via MALT with aDNA specific verification via MaltExtract, or\nKraken2.\n\nPlease note the following:\n\n- :warning: Metagenomic screening is only performed on _unmapped_ reads from a\n mapping step.\n - You _must_ supply the `--run_bam_filtering` flag with unmapped reads in\n FASTQ format.\n - If you wish to run solely MALT (i.e. the HOPS pipeline), you must still\n supply a small decoy genome like phiX or human mtDNA `--fasta`.\n- MALT database construction functionality is _not_ included within the pipeline\n - this should be done independently, **prior** the nf-core/eager run.\n - To use `malt-build` from the same version as `malt-run`, load either the\n Docker, Singularity or Conda environment.\n- MALT can often require very large computing resources depending on your\n database. We set a absolute minimum of 16 cores and 128GB of memory (which is\n 1/4 of the recommendation from the developer). Please leave an issue on the\n [nf-core github](https://github.com/nf-core/eager/issues) if you would like to\n see this changed.\n\n> :warning: Running MALT on a server with less than 128GB of memory should be\n> performed at your own risk.\n\nIf using TSV input, metagenomic screening is performed on all samples gathered\ntogether." }, "metagenomic_authentication": { "title": "Metagenomic Authentication", "type": "object", "description": "Options for authentication of metagenomic screening performed by MALT.", "default": "", "properties": { "run_maltextract": { "type": "boolean", "description": "Turn on MaltExtract for MALT aDNA characteristics authentication.", "fa_icon": "fas fa-power-off", "help_text": "Turn on MaltExtract for MALT aDNA characteristics authentication of metagenomic output from MALT.\n\nMore can be seen in the [MaltExtract documentation](https://github.com/rhuebler/)\n\nOnly when `--metagenomic_tool malt` is also supplied" }, "maltextract_taxon_list": { "type": "string", "description": "Path to a text file with taxa of interest (one taxon per row, NCBI taxonomy name format)", "fa_icon": "fas fa-list-ul", "help_text": "\nPath to a `.txt` file with taxa of interest you wish to assess for aDNA characteristics. In `.txt` file should be one taxon per row, and the taxon should be in a valid [NCBI taxonomy](https://www.ncbi.nlm.nih.gov/taxonomy) name format.\n\nOnly when `--metagenomic_tool malt` is also supplied." }, "maltextract_ncbifiles": { "type": "string", "description": "Path to directory containing containing NCBI resource files (ncbi.tre and ncbi.map; available: https://github.com/rhuebler/HOPS/)", "fa_icon": "fas fa-database", "help_text": "Path to directory containing containing the NCBI resource tree and taxonomy table files (ncbi.tre and ncbi.map; available at the [HOPS repository](https://github.com/rhuebler/HOPS/Resources)).\n\nOnly when `--metagenomic_tool malt` is also supplied." }, "maltextract_filter": { "type": "string", "default": "def_anc", "description": "Specify which MaltExtract filter to use. Options: 'def_anc', 'ancient', 'default', 'crawl', 'scan', 'srna', 'assignment'.", "fa_icon": "fas fa-filter", "help_text": "Specify which MaltExtract filter to use. This is used to specify what types of characteristics to scan for. The default will output statistics on all alignments, and then a second set with just reads with one C to T mismatch in the first 5 bases. Further details on other parameters can be seen in the [HOPS documentation](https://github.com/rhuebler/HOPS/#maltextract-parameters). Options: `'def_anc'`, `'ancient'`, `'default'`, `'crawl'`, `'scan'`, `'srna'`, 'assignment'. Default: `'def_anc'`.\n\nOnly when `--metagenomic_tool malt` is also supplied.\n\n> Modifies MaltExtract parameter: `-f`", "enum": [ "def_anc", "default", "ancient", "scan", "crawl", "srna" ] }, "maltextract_toppercent": { "type": "number", "default": 0.01, "description": "Specify percent of top alignments to use.", "fa_icon": "fas fa-percent", "help_text": "Specify frequency of top alignments for each read to be considered for each node.\nDefault is 0.01, i.e. 1% of all reads (where 1 would correspond to 100%).\n\n> :warning: this parameter follows the same concept as `--malt_top_percent` but\n> uses a different notation i.e. integer (MALT) versus float (MALTExtract)\n\nDefault: `0.01`.\n\nOnly when `--metagenomic_tool malt` is also supplied.\n\n> Modifies MaltExtract parameter: `-a`" }, "maltextract_destackingoff": { "type": "boolean", "description": "Turn off destacking.", "fa_icon": "fas fa-align-center", "help_text": "Turn off destacking. If left on, a read that overlaps with another read will be\nremoved (leaving a depth coverage of 1).\n\nOnly when `--metagenomic_tool malt` is also supplied.\n\n> Modifies MaltExtract parameter: `--destackingOff`" }, "maltextract_downsamplingoff": { "type": "boolean", "description": "Turn off downsampling.", "fa_icon": "fab fa-creative-commons-sampling", "help_text": "Turn off downsampling. By default, downsampling is on and will randomly select 10,000 reads if the number of reads on a node exceeds this number. This is to speed up processing, under the assumption at 10,000 reads the species is a 'true positive'.\n\nOnly when `--metagenomic_tool malt` is also supplied.\n\n> Modifies MaltExtract parameter: `--downSampOff`" }, "maltextract_duplicateremovaloff": { "type": "boolean", "description": "Turn off duplicate removal.", "fa_icon": "fas fa-align-left", "help_text": "\nTurn off duplicate removal. By default, reads that are an exact copy (i.e. same start, stop coordinate and exact sequence match) will be removed as it is considered a PCR duplicate.\n\nOnly when `--metagenomic_tool malt` is also supplied.\n\n> Modifies MaltExtract parameter: `--dupRemOff`" }, "maltextract_matches": { "type": "boolean", "description": "Turn on exporting alignments of hits in BLAST format.", "fa_icon": "fas fa-equals", "help_text": "\nExport alignments of hits for each node in BLAST format. By default turned off.\n\nOnly when `--metagenomic_tool malt` is also supplied.\n\n> Modifies MaltExtract parameter: `--matches`" }, "maltextract_megansummary": { "type": "boolean", "description": "Turn on export of MEGAN summary files.", "fa_icon": "fas fa-download", "help_text": "Export 'minimal' summary files (i.e. without alignments) that can be loaded into [MEGAN6](https://doi.org/10.1371/journal.pcbi.1004957). By default turned off.\n\nOnly when `--metagenomic_tool malt` is also supplied.\n\n> Modifies MaltExtract parameter: `--meganSummary`" }, "maltextract_percentidentity": { "type": "number", "description": "Minimum percent identity alignments are required to have to be reported. Recommended to set same as MALT parameter.", "default": 85, "fa_icon": "fas fa-id-card", "help_text": "Minimum percent identity alignments are required to have to be reported. Higher values allows fewer mismatches between read and reference sequence, but therefore will provide greater confidence in the hit. Lower values allow more mismatches, which can account for damage and divergence of a related strain/species to the reference. Recommended to set same as MALT parameter or higher. Default: `85`.\n\nOnly when `--metagenomic_tool malt` is also supplied.\n\n> Modifies MaltExtract parameter: `--minPI`" }, "maltextract_topalignment": { "type": "boolean", "description": "Turn on using top alignments per read after filtering.", "fa_icon": "fas fa-star-half-alt", "help_text": "Use the best alignment of each read for every statistic, except for those concerning read distribution and coverage. Default: off.\n\nOnly when `--metagenomic_tool malt` is also supplied.\n\n> Modifies MaltExtract parameter: `--useTopAlignment`" } }, "fa_icon": "fas fa-tasks", "help_text": "Turn on MaltExtract for MALT aDNA characteristics authentication of metagenomic\noutput from MALT.\n\nMore can be seen in the [MaltExtract\ndocumentation](https://github.com/rhuebler/)\n\nOnly when `--metagenomic_tool malt` is also supplied" } }, "allOf": [ { "$ref": "#/definitions/input_output_options" }, { "$ref": "#/definitions/input_data_additional_options" }, { "$ref": "#/definitions/reference_genome_options" }, { "$ref": "#/definitions/output_options" }, { "$ref": "#/definitions/generic_options" }, { "$ref": "#/definitions/max_job_request_options" }, { "$ref": "#/definitions/institutional_config_options" }, { "$ref": "#/definitions/skip_steps" }, { "$ref": "#/definitions/complexity_filtering" }, { "$ref": "#/definitions/read_merging_and_adapter_removal" }, { "$ref": "#/definitions/mapping" }, { "$ref": "#/definitions/host_removal" }, { "$ref": "#/definitions/bam_filtering" }, { "$ref": "#/definitions/deduplication" }, { "$ref": "#/definitions/library_complexity_analysis" }, { "$ref": "#/definitions/adna_damage_analysis" }, { "$ref": "#/definitions/feature_annotation_statistics" }, { "$ref": "#/definitions/bam_trimming" }, { "$ref": "#/definitions/genotyping" }, { "$ref": "#/definitions/consensus_sequence_generation" }, { "$ref": "#/definitions/snp_table_generation" }, { "$ref": "#/definitions/mitochondrial_to_nuclear_ratio" }, { "$ref": "#/definitions/human_sex_determination" }, { "$ref": "#/definitions/nuclear_contamination_for_human_dna" }, { "$ref": "#/definitions/metagenomic_screening" }, { "$ref": "#/definitions/metagenomic_authentication" } ] }