Repository: nf-core/eager
Branch: master
Commit: 3f9d64ced5e2
Files: 73
Total size: 774.9 KB

Directory structure:
gitextract_rzjf5t_w/

├── .gitattributes
├── .github/
│   ├── .dockstore.yml
│   ├── CONTRIBUTING.md
│   ├── ISSUE_TEMPLATE/
│   │   ├── bug_report.md
│   │   ├── config.yml
│   │   └── feature_request.md
│   ├── PULL_REQUEST_TEMPLATE/
│   │   └── pull_request_template.md
│   ├── PULL_REQUEST_TEMPLATE.md
│   ├── markdownlint.yml
│   ├── workflows/
│   │   ├── awsfulltest.yml
│   │   ├── awstest.yml
│   │   ├── branch.yml
│   │   ├── ci.yml
│   │   ├── linting.yml
│   │   ├── linting_comment.yml
│   │   ├── push_dockerhub_dev.yml
│   │   └── push_dockerhub_release.yml
│   └── yamllint.yml
├── .gitignore
├── .gitpod.yml
├── .nf-core-lint.yml
├── CHANGELOG.md
├── CODE_OF_CONDUCT.md
├── Dockerfile
├── LICENSE
├── README.md
├── assets/
│   ├── angsd_resources/
│   │   ├── README
│   │   └── getALL.txt
│   ├── email_template.html
│   ├── email_template.txt
│   ├── multiqc_config.yaml
│   ├── nf-core_eager_dummy.txt
│   ├── nf-core_eager_dummy2.txt
│   ├── sendmail_template.txt
│   └── where_are_my_files.txt
├── bin/
│   ├── endorS.py
│   ├── extract_map_reads.py
│   ├── filter_bam_fragment_length.py
│   ├── kraken_parse.py
│   ├── markdown_to_html.py
│   ├── merge_kraken_res.py
│   ├── parse_snp_cov.py
│   ├── print_x_contamination.py
│   └── scrape_software_versions.py
├── conf/
│   ├── base.config
│   ├── benchmarking_human.config
│   ├── benchmarking_vikingfish.config
│   ├── igenomes.config
│   ├── test.config
│   ├── test_direct.config
│   ├── test_full.config
│   ├── test_resources.config
│   ├── test_stresstest_human.config
│   ├── test_tsv_bam.config
│   ├── test_tsv_complex.config
│   ├── test_tsv_fna.config
│   ├── test_tsv_humanbam.config
│   ├── test_tsv_kraken.config
│   └── test_tsv_pretrim.config
├── docs/
│   ├── README.md
│   ├── images/
│   │   ├── README.md
│   │   └── usage/
│   │       └── nfcore-eager_tsv_template.tsv
│   ├── output.md
│   └── usage.md
├── environment.yml
├── lib/
│   ├── Checks.groovy
│   ├── Completion.groovy
│   ├── Headers.groovy
│   ├── NfcoreSchema.groovy
│   └── nfcore_external_java_deps.jar
├── main.nf
├── nextflow.config
└── nextflow_schema.json

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitattributes
================================================
*.config linguist-language=nextflow


================================================
FILE: .github/.dockstore.yml
================================================
# Dockstore config version, not pipeline version
version: 1.2
workflows:
  - subclass: nfl
    primaryDescriptorPath: /nextflow.config
    publish: True


================================================
FILE: .github/CONTRIBUTING.md
================================================
# nf-core/eager: Contributing Guidelines

Hi there!
Many thanks for taking an interest in improving nf-core/eager.

We try to manage the required tasks for nf-core/eager using GitHub issues, you probably came to this page when creating one.
Please use the pre-filled template to save time.

However, don't be put off by this template - other more general issues and suggestions are welcome!
Contributions to the code are even more welcome ;)

> If you need help using or modifying nf-core/eager then the best place to ask is on the nf-core Slack [#eager](https://nfcore.slack.com/channels/eager) channel ([join our Slack here](https://nf-co.re/join/slack)).

## Contribution workflow

If you'd like to write some code for nf-core/eager, the standard workflow is as follows:

1. Check that there isn't already an issue about your idea in the [nf-core/eager issues](https://github.com/nf-core/eager/issues) to avoid duplicating work
    * If there isn't one already, please create one so that others know you're working on this
2. [Fork](https://help.github.com/en/github/getting-started-with-github/fork-a-repo) the [nf-core/eager repository](https://github.com/nf-core/eager) to your GitHub account
3. Make the necessary changes / additions within your forked repository following [Pipeline conventions](#pipeline-contribution-conventions)
4. Use `nf-core schema build .` and add any new parameters to the pipeline JSON schema (requires [nf-core tools](https://github.com/nf-core/tools) >= 1.10).
5. Submit a Pull Request against the `dev` branch and wait for the code to be reviewed and merged

If you're not used to this workflow with git, you can start with some [docs from GitHub](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests) or even their [excellent `git` resources](https://try.github.io/).

## Tests

When you create a pull request with changes, [GitHub Actions](https://github.com/features/actions) will run automatic tests.
Typically, pull-requests are only fully reviewed when these tests are passing, though of course we can help out before then.

There are typically two types of tests that run:

### Lint tests

`nf-core` has a [set of guidelines](https://nf-co.re/developers/guidelines) which all pipelines must adhere to.
To enforce these and ensure that all pipelines stay in sync, we have developed a helper tool which runs checks on the pipeline code. This is in the [nf-core/tools repository](https://github.com/nf-core/tools) and once installed can be run locally with the `nf-core lint <pipeline-directory>` command.

If any failures or warnings are encountered, please follow the listed URL for more documentation.

### Pipeline tests

Each `nf-core` pipeline should be set up with a minimal set of test-data.
`GitHub Actions` then runs the pipeline on this data to ensure that it exits successfully.
If there are any failures then the automated tests fail.
These tests are run both with the latest available version of `Nextflow` and also the minimum required version that is stated in the pipeline code.

## Patch

:warning: Only in the unlikely and regretful event of a release happening with a bug.

* On your own fork, make a new branch `patch` based on `upstream/master`.
* Fix the bug, and bump version (X.Y.Z+1).
* A PR should be made on `master` from patch to directly this particular bug.

## Getting help

For further information/help, please consult the [nf-core/eager documentation](https://nf-co.re/eager/usage) and don't hesitate to get in touch on the nf-core Slack [#eager](https://nfcore.slack.com/channels/eager) channel ([join our Slack here](https://nf-co.re/join/slack)).

## Pipeline contribution conventions

To make the nf-core/eager code and processing logic more understandable for new contributors and to ensure quality, we semi-standardise the way the code and other contributions are written.

### Adding a new step

If you wish to contribute a new step, please use the following coding standards:

1. Define the corresponding input channel into your new process from the expected previous process channel
2. Write the process block (see below).
3. Define the output channel if needed (see below).
4. Add any new flags/options to `nextflow.config` with a default (see below).
5. Add any new flags/options to `nextflow_schema.json` with help text (with `nf-core schema build .`).
6. Add sanity checks for all relevant parameters.
7. Add any new software to the `scrape_software_versions.py` script in `bin/` and the version command to the `scrape_software_versions` process in `main.nf`.
8. Do local tests that the new code works properly and as expected.
9. Add a new test command in `.github/workflow/ci.yaml`.
10. If applicable add a [MultiQC](https://https://multiqc.info/) module.
11. Update MultiQC config `assets/multiqc_config.yaml` so relevant suffixes, name clean up, General Statistics Table column order, and module figures are in the right order.
12. Optional: Add any descriptions of MultiQC report sections and output files to `docs/output.md`.

### Default values

Parameters should be initialised / defined with default values in `nextflow.config` under the `params` scope.

Once there, use `nf-core schema build .` to add to `nextflow_schema.json`.

### Default processes resource requirements

Sensible defaults for process resource requirements (CPUs / memory / time) for a process should be defined in `conf/base.config`. These should generally be specified generic with `withLabel:` selectors so they can be shared across multiple processes/steps of the pipeline. A nf-core standard set of labels that should be followed where possible can be seen in the [nf-core pipeline template](https://github.com/nf-core/tools/blob/master/nf_core/pipeline-template/conf/base.config), which has the default process as a single core-process, and then different levels of multi-core configurations for increasingly large memory requirements defined with standardised labels.

:warning: Note that in nf-core/eager we currently have our own custom process labels, so please check `base.config`!

The process resources can be passed on to the tool dynamically within the process with the `${task.cpu}` and `${task.memory}` variables in the `script:` block.

### Naming schemes

Please use the following naming schemes, to make it easy to understand what is going where.

* initial process channel: `ch_output_from_<process>`
* intermediate and terminal channels: `ch_<previousprocess>_for_<nextprocess>`
* skipped process output: `ch_<previousstage>_for_<skipprocess>`(this goes out of the bypass statement described above)

### Nextflow version bumping

If you are using a new feature from core Nextflow, you may bump the minimum required version of nextflow in the pipeline with: `nf-core bump-version --nextflow . [min-nf-version]`

### Software version reporting

If you add a new tool to the pipeline, please ensure you add the information of the tool to the `get_software_version` process.

Add to the script block of the process, something like the following:

```bash
<YOUR_TOOL> --version &> v_<YOUR_TOOL>.txt 2>&1 || true
```

or

```bash
<YOUR_TOOL> --help | head -n 1 &> v_<YOUR_TOOL>.txt 2>&1 || true
```

You then need to edit the script `bin/scrape_software_versions.py` to:

1. Add a Python regex for your tool's `--version` output (as in stored in the `v_<YOUR_TOOL>.txt` file), to ensure the version is reported as a `v` and the version number e.g. `v2.1.1`
2. Add a HTML entry to the `OrderedDict` for formatting in MultiQC.

### Images and figures

For overview images and other documents we follow the nf-core [style guidelines and examples](https://nf-co.re/developers/design_guidelines).

For all internal nf-core/eager documentation images we are using the 'Kalam' font by the Indian Type Foundry and licensed under the Open Font License. It can be found for download here [here](https://fonts.google.com/specimen/Kalam).

## Process Concept

We are providing a highly configurable pipeline, with many options to turn on and off different processes in different combinations. This can make a very complex graph structure that can cause a large amount of duplicated channels coming out of every process to account for each possible combination.

The EAGER pipeline can currently be broken down into the following 'stages', where a stage is a collection of  non-terminal mutually exclusive processes, which is the output of which is used for another file reporting module (but not reporting!) .

* Input
* Convert BAM
* PolyG Clipping
* AdapterRemoval
* Mapping (either `bwa`, `bwamem`, or `circularmapper`)
* BAM Filtering
* Deduplication (either `dedup` or `markduplicates`)
* BAM Trimming
* PMDtools
* Genotyping

Every step can potentially be skipped, therefore the output of a previous stage must be able to be passed to the next stage, if the given stage is not run.

To somewhat simplify this logic, we have implemented the following structure.

The concept is as follows:

* Every 'stage' of the pipeline (i.e. collection of mutually exclusive processes) must always have a if else statement following it.
* This if else 'bypass' statement collects and standardises all possible input files into single channel(s) for the next stage.
* Importantly - within the bypass statement, a channel from the previous stage's bypass mixes into these output channels. This additional channel is named `ch_previousstage_for_skipcurrentstage`. This contains the output from the previous stage, i.e. not the modified version from the current stage.
* The bypass statement works as follows:
  * If the current stage is turned on: will mix the previous stage and current stage output and filter for file suffixes unique to the current stage output
  * If the current stage is turned off or skipped: will mix the previous stage and current stage output. However as there there is no files in the output channel from the current stage, no filtering is required and the files in the 'ch_XXX_for_skipXXX' stage will be used.
  
 This ensures the same channel inputs to the next stage is 'homogeneous' - i.e. all comes from the same source (the bypass statement)
  
 An example schematic can be given as follows

```nextflow
 // PREVIOUS STAGE OUTPUT
if (params.run_bam_filtering) {
    ch_input_for_skipconvertbam.mix(ch_output_ch_convertbam)
        .filter{ it =~/.*converted.fq/}
        .into { ch_convertbam_for_fastp; ch_convertbam_for_skipfastp }
} else {
    ch_input_for_skipconvertbam
        .into { ch_convertbam_for_fastp; ch_convertbam_for_skipfastp }
}

// SKIPPABLE CURRENT STAGE PROCESS
process fastp {
    publishDir "${params.outdir}/fastp", mode: 'copy'

    when:
    params.run_fastp

    input:
    file fq from ch_convertbam_for_fastp

    output:
    file "*pG.fq" into ch_output_from_fastp

    script:
    """
    echo "I have been fastp'd" > ${fq}  
    mv ${fq} ${fq}.pG.fq
    """
}

// NEXT STAGE INPUT PREPARATION
if (params.run_fastp) {
    ch_convertbam_for_skipfastp.mix(ch_output_from_fastp)
        .filter { it =~/.*pG.fq/ }
        .into { ch_fastp_for_adapterremoval; ch_fastp_for_skipadapterremoval }
} else {
    ch_convertbam_for_skipfastp
        .into { ch_fastp_for_adapterremoval; ch_fastp_for_skipadapterremoval }
}

 ```


================================================
FILE: .github/ISSUE_TEMPLATE/bug_report.md
================================================
---
name: Bug report
about: Report something that is broken or incorrect
labels: bug
---

<!--
# nf-core/eager bug report

Hi there!

Thanks for telling us about a problem with the pipeline.
Please delete this text and anything that's not relevant from the template below:
-->

## Check Documentation

I have checked the following places for your error:

- [ ] [nf-core website: troubleshooting](https://nf-co.re/usage/troubleshooting)
- [ ] [nf-core/eager pipeline documentation](https://nf-co.re/nf-core/eager/usage)
      - nf-core/eager FAQ/troubleshooting can be found [here](https://nf-co.re/eager/usage#troubleshooting-and-faqs)

## Description of the bug

<!-- A clear and concise description of what the bug is. -->

## Steps to reproduce

Steps to reproduce the behaviour:

1. Command line: `nextflow run ...`
2. See error: _Please provide your error message_

## Expected behaviour

<!-- A clear and concise description of what you expected to happen. -->

## Log files

Have you provided the following extra information/files:

- [ ] The command used to run the pipeline
- [ ] The `.nextflow.log` file <!-- this is a hidden file in the directory where you launched the pipeline -->
- [ ] The exact error: <!-- [Please provide your error message] -->

## System

- Hardware: <!-- [e.g. HPC, Desktop, Cloud...] -->
- Executor: <!-- [e.g. slurm, local, awsbatch...] -->
- OS: <!-- [e.g. CentOS Linux, macOS, Linux Mint...] -->
- Version <!-- [e.g. 7, 10.13.6, 18.3...] -->

## Nextflow Installation

- Version: <!-- [e.g. 19.10.0] -->

## Container engine

- Engine: <!-- [e.g. Conda, Docker, Singularity, Podman, Shifter or Charliecloud] -->
- version: <!-- [e.g. 1.0.0] -->
- Image tag: <!-- [e.g. nfcore/eager:1.0.0] -->

## Additional context

<!-- Add any other context about the problem here. -->


================================================
FILE: .github/ISSUE_TEMPLATE/config.yml
================================================
blank_issues_enabled: false
contact_links:
  - name: Join nf-core
    url: https://nf-co.re/join
    about: Please join the nf-core community here
  - name: "Slack #eager channel"
    url: https://nfcore.slack.com/channels/eager
    about: Discussion about the nf-core/eager pipeline


================================================
FILE: .github/ISSUE_TEMPLATE/feature_request.md
================================================
---
name: Feature request
about: Suggest an idea for the nf-core/eager pipeline
labels: enhancement
---

<!--
# nf-core/eager feature request

Hi there!

Thanks for suggesting a new feature for the pipeline!
Please delete this text and anything that's not relevant from the template below:
-->

## Is your feature request related to a problem? Please describe

<!-- A clear and concise description of what the problem is. -->

<!-- e.g. [I'm always frustrated when ...] -->

## Describe the solution you'd like

<!-- A clear and concise description of what you want to happen. -->

## Describe alternatives you've considered

<!-- A clear and concise description of any alternative solutions or features you've considered. -->

## Additional context

<!-- Add any other context about the feature request here. -->


================================================
FILE: .github/PULL_REQUEST_TEMPLATE/pull_request_template.md
================================================
Many thanks to contributing to nf-core/eager!

Please fill in the appropriate checklist below (delete whatever is not relevant). These are the most common things requested on pull requests (PRs).

## PR checklist

 - [ ] This comment contains a description of changes (with reason).
 - [ ] If you've fixed a bug or added code that should be tested, add tests!
   - [ ] If you've added a new tool - add to the software_versions process and a regex to `scrape_software_versions.py`
   - [ ] If necessary, also make a PR on the [nf-core/eager branch on the nf-core/test-datasets repo]( https://github.com/nf-core/test-datasets/pull/new/nf-core/eager).
 - [ ] Make sure your code lints (`nf-core lint .`).
 - [ ] Ensure the test suite passes (`nextflow run . -profile test,docker`).
 - [ ] Usage Documentation in `docs/usage.md` is updated.
 - [ ] Output Documentation in `docs/output.md` is updated.
 - [ ] `CHANGELOG.md` is updated.
 - [ ] `README.md` is updated (including new tool citations and authors/contributors).

**Learn more about contributing:** https://github.com/nf-core/eager/tree/master/.github/CONTRIBUTING.md


================================================
FILE: .github/PULL_REQUEST_TEMPLATE.md
================================================
<!--
# nf-core/eager pull request

Many thanks for contributing to nf-core/eager!

Please fill in the appropriate checklist below (delete whatever is not relevant).
These are the most common things requested on pull requests (PRs).

Remember that PRs should be made against the dev branch, unless you're preparing a pipeline release.

Learn more about contributing: [CONTRIBUTING.md](https://github.com/nf-core/eager/tree/master/.github/CONTRIBUTING.md)
-->
<!-- markdownlint-disable ul-indent -->

## PR checklist

- [ ] This comment contains a description of changes (with reason).
- [ ] If you've fixed a bug or added code that should be tested, add tests!
    - [ ] If you've added a new tool - add to the software_versions process and a regex to `scrape_software_versions.py`
    - [ ] If you've added a new tool - have you followed the pipeline conventions in the [contribution docs](<https://github.com/>nf-core/eager/tree/master/.github/CONTRIBUTING.md)
    - [ ] If necessary, also make a PR on the nf-core/eager _branch_ on the [nf-core/test-datasets](https://github.com/nf-core/test-datasets) repository.
- [ ] Make sure your code lints (`nf-core lint .`).
- [ ] Ensure the test suite passes (`nextflow run . -profile test,docker`).
- [ ] Usage Documentation in `docs/usage.md` is updated.
- [ ] Output Documentation in `docs/output.md` is updated.
- [ ] `CHANGELOG.md` is updated.
- [ ] `README.md` is updated (including new tool citations and authors/contributors).


================================================
FILE: .github/markdownlint.yml
================================================
# Markdownlint configuration file
default: true
line-length: false
no-duplicate-header:
    siblings_only: true
no-inline-html:
    allowed_elements:
        - img
        - p
        - kbd
        - details
        - summary


================================================
FILE: .github/workflows/awsfulltest.yml
================================================
name: nf-core AWS full size tests
# This workflow is triggered on published releases.
# It can be additionally triggered manually with GitHub actions workflow dispatch.
# It runs the -profile 'test_full' on AWS batch

on:
  workflow_run:
    workflows: ["nf-core Docker push (release)"]
    types: [completed]
  workflow_dispatch:


env:
  AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
  AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
  TOWER_ACCESS_TOKEN: ${{ secrets.AWS_TOWER_TOKEN }}
  AWS_JOB_DEFINITION: ${{ secrets.AWS_JOB_DEFINITION }}
  AWS_JOB_QUEUE: ${{ secrets.AWS_JOB_QUEUE }}
  AWS_S3_BUCKET: ${{ secrets.AWS_S3_BUCKET }}


jobs:
  run-awstest:
    name: Run AWS full tests
    if: github.repository == 'nf-core/eager'
    runs-on: ubuntu-latest
    steps:
      - name: Setup Miniconda
        uses: conda-incubator/setup-miniconda@v2
        with:
          auto-update-conda: true
          python-version: 3.7
      - name: Install awscli
        run: conda install -c conda-forge awscli
      - name: Start AWS batch job
        # Add full size test data (but still relatively small datasets for few samples)
        # on the `test_full.config` test runs with only one set of parameters
        # Then specify `-profile test_full` instead of `-profile test` on the AWS batch command
        run: |
          aws batch submit-job \
            --region eu-west-1 \
            --job-name nf-core-eager \
            --job-queue $AWS_JOB_QUEUE \
            --job-definition $AWS_JOB_DEFINITION \
            --container-overrides '{"command": ["nf-core/eager", "-r '"${GITHUB_SHA}"' -profile test_full --outdir s3://'"${AWS_S3_BUCKET}"'/eager/results-'"${GITHUB_SHA}"' -w s3://'"${AWS_S3_BUCKET}"'/eager/work-'"${GITHUB_SHA}"' -with-tower"], "environment": [{"name": "TOWER_ACCESS_TOKEN", "value": "'"$TOWER_ACCESS_TOKEN"'"}]}'


================================================
FILE: .github/workflows/awstest.yml
================================================
name: nf-core AWS test
# This workflow is triggered on push to the master branch.
# It can be additionally triggered manually with GitHub actions workflow dispatch.
# It runs the -profile 'test' on AWS batch.

on:
  workflow_dispatch:


env:
  AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
  AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
  TOWER_ACCESS_TOKEN: ${{ secrets.AWS_TOWER_TOKEN }}
  AWS_JOB_DEFINITION: ${{ secrets.AWS_JOB_DEFINITION }}
  AWS_JOB_QUEUE: ${{ secrets.AWS_JOB_QUEUE }}
  AWS_S3_BUCKET: ${{ secrets.AWS_S3_BUCKET }}


jobs:
  run-awstest:
    name: Run AWS tests
    if: github.repository == 'nf-core/eager'
    runs-on: ubuntu-latest
    steps:
      - name: Setup Miniconda
        uses: conda-incubator/setup-miniconda@v2
        with:
          auto-update-conda: true
          python-version: 3.7
      - name: Install awscli
        run: conda install -c conda-forge awscli
      - name: Start AWS batch job
        # For example: adding multiple test runs with different parameters
        # Remember that you can parallelise this by using strategy.matrix
        run: |
          aws batch submit-job \
          --region eu-west-1 \
          --job-name nf-core-eager \
          --job-queue $AWS_JOB_QUEUE \
          --job-definition $AWS_JOB_DEFINITION \
          --container-overrides '{"command": ["nf-core/eager", "-r '"${GITHUB_SHA}"' -profile test_tsv_complex --outdir s3://'"${AWS_S3_BUCKET}"'/eager/results-'"${GITHUB_SHA}"' -w s3://'"${AWS_S3_BUCKET}"'/eager/work-'"${GITHUB_SHA}"' -with-tower"], "environment": [{"name": "TOWER_ACCESS_TOKEN", "value": "'"$TOWER_ACCESS_TOKEN"'"}]}'


================================================
FILE: .github/workflows/branch.yml
================================================
name: nf-core branch protection
# This workflow is triggered on PRs to master branch on the repository
# It fails when someone tries to make a PR against the nf-core `master` branch instead of `dev`
on:
  pull_request_target:
    branches: [master]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      # PRs to the nf-core repo master branch are only ok if coming from the nf-core repo `dev` or any `patch` branches
      - name: Check PRs
        if: github.repository == 'nf-core/eager'
        run: |
          { [[ ${{github.event.pull_request.head.repo.full_name }} == nf-core/eager ]] && [[ $GITHUB_HEAD_REF = "dev" ]]; } || [[ $GITHUB_HEAD_REF == "patch" ]]


      # If the above check failed, post a comment on the PR explaining the failure
      # NOTE - this doesn't currently work if the PR is coming from a fork, due to limitations in GitHub actions secrets
      - name: Post PR comment
        if: failure()
        uses: mshick/add-pr-comment@v1
        with:
          message: |
            ## This PR is against the `master` branch :x:

            * Do not close this PR
            * Click _Edit_ and change the `base` to `dev`
            * This CI test will remain failed until you push a new commit

            ---

            Hi @${{ github.event.pull_request.user.login }},

            It looks like this pull-request is has been made against the [${{github.event.pull_request.head.repo.full_name }}](https://github.com/${{github.event.pull_request.head.repo.full_name }}) `master` branch.
            The `master` branch on nf-core repositories should always contain code from the latest release.
            Because of this, PRs to `master` are only allowed if they come from the [${{github.event.pull_request.head.repo.full_name }}](https://github.com/${{github.event.pull_request.head.repo.full_name }}) `dev` branch.

            You do not need to close this PR, you can change the target branch to `dev` by clicking the _"Edit"_ button at the top of this page.
            Note that even after this, the test will continue to show as failing until you push a new commit.

            Thanks again for your contribution!
          repo-token: ${{ secrets.GITHUB_TOKEN }}
          allow-repeats: false


================================================
FILE: .github/workflows/ci.yml
================================================
name: nf-core CI
# This workflow runs the pipeline with the minimal test dataset to check that it completes without any syntax errors
on:
  push:
    branches:
      - dev
  pull_request:
  release:
    types: [published]

# Uncomment if we need an edge release of Nextflow again
# env: NXF_EDGE: 1

jobs:
  test:
    name: Run workflow tests
    # Only run on push if this is the nf-core dev branch (merged PRs)
    if: ${{ github.event_name != 'push' || (github.event_name == 'push' && github.repository == 'nf-core/eager') }}
    runs-on: ubuntu-latest
    env:
      NXF_VER: ${{ matrix.nxf_ver }}
      NXF_ANSI_LOG: false
    strategy:
      matrix:
        # Nextflow versions: check pipeline minimum and current latest
        nxf_ver: ["20.07.1", "22.10.6"]
    steps:
      - name: Check out pipeline code
        uses: actions/checkout@v2
      - name: Install older Java
        uses: actions/setup-java@v4
        with:
          distribution: "temurin" # See 'Supported distributions' for available options
          java-version: "11"
      - name: Check if Dockerfile or Conda environment changed
        uses: technote-space/get-diff-action@v4
        with:
          FILES: |
            Dockerfile
            environment.yml

      - name: Build new docker image
        if: env.MATCHED_FILES
        run: docker build --no-cache . -t nfcore/eager:2.5.3

      - name: Pull docker image
        if: ${{ !env.MATCHED_FILES }}
        run: |
          docker pull nfcore/eager:dev
          docker tag nfcore/eager:dev nfcore/eager:2.5.3
      - name: Install Nextflow
        env:
          CAPSULE_LOG: none
        run: |
          wget -qO- https://github.com/nextflow-io/nextflow/releases/download/v22.10.6/nextflow | bash
          sudo mv nextflow /usr/local/bin/
      - name: HELPTEXT Run with the help flag
        run: |
          nextflow run ${GITHUB_WORKSPACE} --help
      - name: Get test data for cases where we don't use TSV input
        run: |
          git clone --single-branch --branch eager https://github.com/nf-core/test-datasets.git data
      - name: DELAY to try address some odd behaviour with what appears to be a conflict between parallel htslib jobs leading to CI hangs
        run: |
          if [[ $NXF_VER = '' ]]; then sleep 1200; fi
      - name: BASIC Run the basic pipeline with directly supplied single-end FASTQ
        run: |
          nextflow run ${GITHUB_WORKSPACE} -profile test_direct,docker --input 'data/testdata/Mammoth/fastq/*_R1_*.fq.gz' --single_end
      - name: BASIC Run the basic pipeline with directly supplied paired-end FASTQ
        run: |
          nextflow run ${GITHUB_WORKSPACE} -profile test_direct,docker --input 'data/testdata/Mammoth/fastq/*_{R1,R2}_*tengrand.fq.gz'
      - name: BASIC Run the basic pipeline with supplied --input BAM
        run: |
          nextflow run ${GITHUB_WORKSPACE} -profile test_direct,docker --input 'data/testdata/Mammoth/bam/*_R1_*.bam' --bam --single_end
      - name: BASIC Run the basic pipeline with the test profile with, PE/SE, bwa aln
        run: |
          nextflow run ${GITHUB_WORKSPACE} -profile test,docker --save_reference
      - name: REFERENCE Basic workflow, with supplied indices
        run: |
          nextflow run ${GITHUB_WORKSPACE} -profile test,docker --bwa_index 'results/reference_genome/bwa_index/BWAIndex/' --fasta_index 'https://github.com/nf-core/test-datasets/blob/eager/reference/Mammoth/Mammoth_MT_Krause.fasta.fai'
      - name: REFERENCE Run the basic pipeline with FastA reference with `fna` extension
        run: |
          nextflow run ${GITHUB_WORKSPACE} -profile test_tsv_fna,docker
      - name: REFERENCE Test with zipped reference input
        run: |
          nextflow run ${GITHUB_WORKSPACE} -profile test,docker --fasta 'https://github.com/nf-core/test-datasets/raw/eager/reference/Mammoth/Mammoth_MT_Krause.fasta.gz'
      - name: FASTP Test fastp complexity filtering
        run: |
          nextflow run ${GITHUB_WORKSPACE} -profile test,docker --complexity_filter_poly_g
      - name: ADAPTERREMOVAL Test skip paired end collapsing
        run: |
          nextflow run ${GITHUB_WORKSPACE} -profile test,docker --skip_collapse
      - name: ADAPTERREMOVAL Test paired end collapsing but no trimming
        run: |
          nextflow run ${GITHUB_WORKSPACE} -profile test_tsv_pretrim,docker --skip_trim
      - name: ADAPTERREMOVAL Run the basic pipeline with paired end data without adapterRemoval
        run: |
          nextflow run ${GITHUB_WORKSPACE} -profile test,docker --skip_adapterremoval
      - name: ADAPTERREMOVAL Run the basic pipeline with preserve5p end option
        run: |
          nextflow run ${GITHUB_WORKSPACE} -profile test,docker --preserve5p
      - name: ADAPTERREMOVAL Run the basic pipeline with merged only option
        run: |
          nextflow run ${GITHUB_WORKSPACE} -profile test,docker --mergedonly
      - name: ADAPTERREMOVAL Run the basic pipeline with preserve5p end and merged reads only options
        run: |
          nextflow run ${GITHUB_WORKSPACE} -profile test,docker --preserve5p --mergedonly
      - name: ADAPTER LIST Run the basic pipeline using an adapter list
        run: |
          nextflow run ${GITHUB_WORKSPACE} -profile test,docker --clip_adapters_list 'https://github.com/nf-core/test-datasets/raw/eager/databases/adapters/adapter-list.txt'
      - name: ADAPTER LIST Run the basic pipeline using an adapter list, skipping adapter removal
        run: |
          nextflow run ${GITHUB_WORKSPACE} -profile test,docker --clip_adapters_list 'https://github.com/nf-core/test-datasets/raw/eager/databases/adapters/adapter-list.txt' --skip_adapterremoval
      - name: POST_AR_FASTQ_TRIMMING Run the basic pipeline post-adapterremoval FASTQ trimming
        run: |
          nextflow run ${GITHUB_WORKSPACE} -profile test,docker --run_post_ar_trimming
      - name: POST_AR_FASTQ_TRIMMING Run the basic pipeline post-adapterremoval FASTQ trimming, but skip adapterremoval
        run: |
          nextflow run ${GITHUB_WORKSPACE} -profile test,docker --run_post_ar_trimming --skip_adapterremoval
      - name: MAPPER_CIRCULARMAPPER Test running with CircularMapper
        run: |
          nextflow run ${GITHUB_WORKSPACE} -profile test,docker --mapper 'circularmapper' --circulartarget 'NC_007596.2'
      - name: MAPPER_BWAMEM Test running with BWA Mem
        run: |
          nextflow run ${GITHUB_WORKSPACE} -profile test,docker --mapper 'bwamem' --skip_collapse
      - name: MAPPER_BT2 Test running with BowTie2
        run: |
          nextflow run ${GITHUB_WORKSPACE} -profile test,docker --mapper 'bowtie2' --bt2_alignmode 'local' --bt2_sensitivity 'sensitive' --bt2n 1 --bt2l 16 --bt2_trim5 1 --bt2_trim3 1
      - name: HOST_REMOVAL_FASTQ Run the basic pipeline with output unmapped reads as fastq
        run: |
          nextflow run ${GITHUB_WORKSPACE} -profile test_tsv_complex,docker --hostremoval_input_fastq
      - name: BAM_FILTERING Run basic mapping pipeline with mapping quality filtering, and unmapped export
        run: |
          nextflow run ${GITHUB_WORKSPACE} -profile test,docker --run_bam_filtering --bam_mapping_quality_threshold 37  --bam_unmapped_type 'fastq'
      - name: BAM_FILTERING Run basic mapping pipeline with post-mapping length filtering
        run: |
          nextflow run ${GITHUB_WORKSPACE} -profile test,docker --clip_readlength 0 --run_bam_filtering --bam_filter_minreadlength 50
      - name: PRESEQ Run basic mapping pipeline with different preseq mode
        run: |
          nextflow run ${GITHUB_WORKSPACE} -profile test,docker --preseq_mode 'lc_extrap' --preseq_maxextrap 10000 --preseq_bootstrap 10
      - name: DEDUPLICATION Test with dedup
        run: |
          nextflow run ${GITHUB_WORKSPACE} -profile test,docker --dedupper 'dedup' --dedup_all_merged
      - name: BEDTOOLS Test bedtools feature annotation
        run: |
          nextflow run ${GITHUB_WORKSPACE} -profile test,docker --run_bedtools_coverage --anno_file 'https://github.com/nf-core/test-datasets/raw/eager/reference/Mammoth/Mammoth_MT_Krause.gff3'
      - name: MAPDAMAGE2 damage calculation
        run: |
          nextflow run ${GITHUB_WORKSPACE} -profile test,docker --damage_calculation_tool 'mapdamage'
      - name: GENOTYPING_HC Test running GATK HaplotypeCaller
        run: |
          nextflow run ${GITHUB_WORKSPACE} -profile test_tsv_fna,docker --run_genotyping --genotyping_tool 'hc' --gatk_hc_out_mode 'EMIT_ALL_ACTIVE_SITES' --gatk_hc_emitrefconf 'BP_RESOLUTION'
      - name: GENOTYPING_FB Test running FreeBayes
        run: |
          nextflow run ${GITHUB_WORKSPACE} -profile test,docker --run_genotyping --genotyping_tool 'freebayes'
      - name: GENOTYPING_PC Test running pileupCaller
        run: |
          nextflow run ${GITHUB_WORKSPACE} -profile test_tsv_humanbam,docker --run_genotyping --genotyping_tool 'pileupcaller'
      - name: GENOTYPING_ANGSD Test running ANGSD genotype likelihood calculation
        run: |
          nextflow run ${GITHUB_WORKSPACE} -profile test_tsv_humanbam,docker --run_genotyping --genotyping_tool 'angsd'
      - name: GENOTYPING_BCFTOOLS Test running FreeBayes with bcftools stats turned on
        run: |
          nextflow run ${GITHUB_WORKSPACE} -profile test,docker --run_genotyping --genotyping_tool 'freebayes' --run_bcftools_stats
      - name: SKIPPING Test checking all skip steps work i.e. input bam, skipping straight to genotyping
        run: |
          nextflow run ${GITHUB_WORKSPACE} -profile test_tsv_bam,docker --skip_fastqc --skip_adapterremoval --skip_deduplication --skip_qualimap --skip_preseq --skip_damage_calculation --run_genotyping --genotyping_tool 'freebayes'
      - name: TRIMBAM Test bamutils works alone
        run: |
          nextflow run ${GITHUB_WORKSPACE} -profile test,docker --run_trim_bam
      - name: PMDTOOLS Test PMDtools works alone
        run: |
          nextflow run ${GITHUB_WORKSPACE} -profile test,docker --run_pmdtools
      - name: GENOTYPING_UG AND MULTIVCFANALYZER Test running GATK UnifiedGenotyper and MultiVCFAnalyzer, additional VCFS
        run: |
          nextflow run ${GITHUB_WORKSPACE} -profile test,docker --run_genotyping --genotyping_tool 'ug' --gatk_ug_out_mode 'EMIT_ALL_SITES' --gatk_ug_genotype_model 'SNP' --run_multivcfanalyzer --additional_vcf_files 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/testdata/Mammoth/vcf/JK2772_CATCAGTGAGTAGA_L008_R1_001.fastq.gz.tengrand.fq.combined.fq.mapped_rmdup.bam.unifiedgenotyper.vcf.gz' --write_allele_frequencies
      - name: COMPLEX LANE/LIBRARY MERGING Test running lane and library merging prior to GATK UnifiedGenotyper and running MultiVCFAnalyzer
        run: |
          nextflow run ${GITHUB_WORKSPACE} -profile test_tsv_complex,docker --run_genotyping --genotyping_tool 'ug' --gatk_ug_out_mode 'EMIT_ALL_SITES' --gatk_ug_genotype_model 'SNP' --run_multivcfanalyzer
      - name: GENOTYPING_UG ON TRIMMED BAM Test
        run: |
          nextflow run ${GITHUB_WORKSPACE} -profile test,docker --run_genotyping --run_trim_bam --genotyping_source 'trimmed' --genotyping_tool 'ug' --gatk_ug_out_mode 'EMIT_ALL_SITES' --gatk_ug_genotype_model 'SNP'
      - name: BAM_INPUT Run the basic pipeline with the bam input profile, skip AdapterRemoval as no convertBam
        run: |
          nextflow run ${GITHUB_WORKSPACE} -profile test_tsv_bam,docker --skip_adapterremoval
      - name: BAM_INPUT Run the basic pipeline with the bam input profile, convert to FASTQ for adapterremoval test and downstream
        run: |
          nextflow run ${GITHUB_WORKSPACE} -profile test_tsv_bam,docker --run_convertinputbam
      - name: METAGENOMIC Download MALT database
        run: |
          mkdir -p databases/malt
          readlink -f databases/malt/
          for i in index0.idx ref.db ref.idx ref.inf table0.db table0.idx taxonomy.idx taxonomy.map taxonomy.tre; do wget https://github.com/nf-core/test-datasets/raw/eager/databases/malt/"$i" -P databases/malt/; done
      - name: METAGENOMIC Run the basic pipeline but with unmapped reads going into MALT
        run: |
          nextflow run ${GITHUB_WORKSPACE} -profile test,docker --run_bam_filtering  --bam_unmapped_type 'fastq' --run_metagenomic_screening --metagenomic_tool 'malt' --database "/home/runner/work/eager/eager/databases/malt/" --malt_sam_output
      - name: METAGENOMIC Run the basic pipeline but low-complexity filtered reads going into MALT
        run: |
          nextflow run ${GITHUB_WORKSPACE} -profile test,docker --run_bam_filtering  --bam_unmapped_type 'fastq' --run_metagenomic_screening --metagenomic_tool 'malt' --database "/home/runner/work/eager/eager/databases/malt/" --metagenomic_complexity_filter
      - name: MALTEXTRACT Download resource files
        run: |
          mkdir -p databases/maltextract
          for i in ncbi.tre ncbi.map; do wget https://github.com/rhuebler/HOPS/raw/0.33/Resources/"$i" -P databases/maltextract/; done
      - name: MALTEXTRACT Basic with MALT plus MaltExtract
        run: |
          nextflow run ${GITHUB_WORKSPACE} -profile test,docker --run_bam_filtering  --bam_unmapped_type 'fastq' --run_metagenomic_screening --metagenomic_tool 'malt' --database "/home/runner/work/eager/eager/databases/malt" --run_maltextract --maltextract_ncbifiles "/home/runner/work/eager/eager/databases/maltextract/" --maltextract_taxon_list 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/testdata/Mammoth/maltextract/MaltExtract_list.txt'
      - name: METAGENOMIC Run the basic pipeline but with unmapped reads going into Kraken
        run: |
          nextflow run ${GITHUB_WORKSPACE} -profile test_tsv_kraken,docker --run_bam_filtering  --bam_unmapped_type 'fastq'
      - name: SNPCAPTURE Run the basic pipeline with the bam input profile, generating statistics with a SNP capture bed
        run: |
          wget https://github.com/nf-core/test-datasets/raw/eager/reference/Human/1240K.pos.list_hs37d5.0based.bed.gz && gunzip 1240K.pos.list_hs37d5.0based.bed.gz
          nextflow run ${GITHUB_WORKSPACE} -profile test_tsv_humanbam,docker --skip_fastqc --skip_adapterremoval --skip_deduplication --snpcapture_bed 1240K.pos.list_hs37d5.0based.bed
      - name: SEXDETERMINATION Run the basic pipeline with the bam input profile, but don't convert BAM, skip everything but sex determination
        run: |
          nextflow run ${GITHUB_WORKSPACE} -profile test_tsv_humanbam,docker --skip_fastqc --skip_adapterremoval --skip_deduplication --skip_qualimap --run_sexdeterrmine
      - name: NUCLEAR CONTAMINATION Run basic pipeline with bam input profile, but don't convert BAM, skip everything but nuclear contamination estimation
        run: |
          nextflow run ${GITHUB_WORKSPACE} -profile test_tsv_humanbam,docker --skip_fastqc --skip_adapterremoval --skip_deduplication --skip_qualimap --run_nuclear_contamination
      - name: MTNUCRATIO Run basic pipeline with bam input profile, but don't convert BAM, skip everything but nmtnucratio
        run: |
          nextflow run ${GITHUB_WORKSPACE} -profile test_tsv_humanbam,docker --skip_fastqc --skip_adapterremoval --skip_deduplication --skip_qualimap --skip_preseq --skip_damage_calculation --run_mtnucratio
      - name: RESCALING Run basic pipeline with basic pipeline but with mapDamage rescaling of BAM files. Note this will be slow
        run: |
          nextflow run ${GITHUB_WORKSPACE} -profile test,docker --run_mapdamage_rescaling --run_genotyping --genotyping_tool hc --genotyping_source 'rescaled'


================================================
FILE: .github/workflows/linting.yml
================================================
name: nf-core linting
# This workflow is triggered on pushes and PRs to the repository.
# It runs the `nf-core lint` and markdown lint tests to ensure that the code meets the nf-core guidelines
on:
  push:
  pull_request:
  release:
    types: [published]

jobs:
  Markdown:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - uses: actions/setup-node@v2

      - name: Install markdownlint
        run: npm install -g markdownlint-cli
      - name: Run Markdownlint
        run: markdownlint ${GITHUB_WORKSPACE} -c ${GITHUB_WORKSPACE}/.github/markdownlint.yml

      # If the above check failed, post a comment on the PR explaining the failure
      - name: Post PR comment
        if: failure()
        uses: mshick/add-pr-comment@v1
        with:
          message: |
            ## Markdown linting is failing

            To keep the code consistent with lots of contributors, we run automated code consistency checks.
            To fix this CI test, please run:

            * Install `markdownlint-cli`
                * On Mac: `brew install markdownlint-cli`
                * Everything else: [Install `npm`](https://www.npmjs.com/get-npm) then [install `markdownlint-cli`](https://www.npmjs.com/package/markdownlint-cli) (`npm install -g markdownlint-cli`)
            * Fix the markdown errors
                * Automatically: `markdownlint . --config .github/markdownlint.yml --fix`
                * Manually resolve anything left from `markdownlint . --config .github/markdownlint.yml`

            Once you push these changes the test should pass, and you can hide this comment :+1:

            We highly recommend setting up markdownlint in your code editor so that this formatting is done automatically on save. Ask about it on Slack for help!

            Thanks again for your contribution!
          repo-token: ${{ secrets.GITHUB_TOKEN }}
          allow-repeats: false

  YAML:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v1
      - uses: actions/setup-node@v2

      - name: Install yaml-lint
        run: npm install -g yaml-lint
      - name: Run yaml-lint
        run: yamllint $(find ${GITHUB_WORKSPACE} -type f -name "*.yml" -o -name "*.yaml") -c .github/yamllint.yml

      # If the above check failed, post a comment on the PR explaining the failure
      - name: Post PR comment
        if: failure()
        uses: mshick/add-pr-comment@v1
        with:
          message: |
            ## YAML linting is failing

            To keep the code consistent with lots of contributors, we run automated code consistency checks.
            To fix this CI test, please run:

            * Install `yaml-lint`
                * [Install `npm`](https://www.npmjs.com/get-npm) then [install `yaml-lint`](https://www.npmjs.com/package/yaml-lint) (`npm install -g yaml-lint`)
            * Fix the markdown errors
                * Run the test locally: `yamllint $(find . -type f -name "*.yml" -o -name "*.yaml")`
                * Fix any reported errors in your YAML files

            Once you push these changes the test should pass, and you can hide this comment :+1:

            We highly recommend setting up yaml-lint in your code editor so that this formatting is done automatically on save. Ask about it on Slack for help!

            Thanks again for your contribution!
          repo-token: ${{ secrets.GITHUB_TOKEN }}
          allow-repeats: false

  nf-core:
    runs-on: ubuntu-latest
    steps:
      - name: Check out pipeline code
        uses: actions/checkout@v2

      - name: Install Nextflow
        env:
          CAPSULE_LOG: none
        run: |
          wget -qO- get.nextflow.io | bash
          sudo mv nextflow /usr/local/bin/

      - uses: actions/setup-python@v1
        with:
          python-version: "3.6"
          architecture: "x64"

      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install nf-core==1.14

      - name: Run nf-core lint
        env:
          GITHUB_COMMENTS_URL: ${{ github.event.pull_request.comments_url }}
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          GITHUB_PR_COMMIT: ${{ github.event.pull_request.head.sha }}
        run: nf-core -l lint_log.txt lint ${GITHUB_WORKSPACE} --markdown lint_results.md

      - name: Save PR number
        if: ${{ always() }}
        run: echo ${{ github.event.pull_request.number }} > PR_number.txt

      - name: Upload linting log file artifact
        if: ${{ always() }}
        uses: actions/upload-artifact@v2
        with:
          name: linting-logs
          path: |
            lint_log.txt
            lint_results.md
            PR_number.txt


================================================
FILE: .github/workflows/linting_comment.yml
================================================

name: nf-core linting comment
# This workflow is triggered after the linting action is complete
# It posts an automated comment to the PR, even if the PR is coming from a fork

on:
  workflow_run:
    workflows: ["nf-core linting"]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - name: Download lint results
        uses: dawidd6/action-download-artifact@v2
        with:
          workflow: linting.yml

      - name: Get PR number
        id: pr_number
        run: echo "::set-output name=pr_number::$(cat linting-logs/PR_number.txt)"

      - name: Post PR comment
        uses: marocchino/sticky-pull-request-comment@v2
        with:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          number: ${{ steps.pr_number.outputs.pr_number }}
          path: linting-logs/lint_results.md


================================================
FILE: .github/workflows/push_dockerhub_dev.yml
================================================
name: nf-core Docker push (dev)
# This builds the docker image and pushes it to DockerHub
# Runs on nf-core repo releases and push event to 'dev' branch (PR merges)
on:
  push:
    branches:
      - dev

jobs:
  push_dockerhub:
    name: Push new Docker image to Docker Hub (dev)
    runs-on: ubuntu-latest
    # Only run for the nf-core repo, for releases and merged PRs
    if: ${{ github.repository == 'nf-core/eager' }}
    env:
      DOCKERHUB_USERNAME: ${{ secrets.DOCKERHUB_USERNAME }}
      DOCKERHUB_PASS: ${{ secrets.DOCKERHUB_PASS }}
    steps:
      - name: Check out pipeline code
        uses: actions/checkout@v2

      - name: Build new docker image
        run: docker build --no-cache . -t nfcore/eager:dev

      - name: Push Docker image to DockerHub (dev)
        run: |
          echo "$DOCKERHUB_PASS" | docker login -u "$DOCKERHUB_USERNAME" --password-stdin
          docker push nfcore/eager:dev


================================================
FILE: .github/workflows/push_dockerhub_release.yml
================================================
name: nf-core Docker push (release)
# This builds the docker image and pushes it to DockerHub
# Runs on nf-core repo releases and push event to 'dev' branch (PR merges)
on:
  release:
    types: [published]

jobs:
  push_dockerhub:
    name: Push new Docker image to Docker Hub (release)
    runs-on: ubuntu-latest
    # Only run for the nf-core repo, for releases and merged PRs
    if: ${{ github.repository == 'nf-core/eager' }}
    env:
      DOCKERHUB_USERNAME: ${{ secrets.DOCKERHUB_USERNAME }}
      DOCKERHUB_PASS: ${{ secrets.DOCKERHUB_PASS }}
    steps:
      - name: Check out pipeline code
        uses: actions/checkout@v2

      - name: Build new docker image
        run: docker build --no-cache . -t nfcore/eager:latest

      - name: Push Docker image to DockerHub (release)
        run: |
          echo "$DOCKERHUB_PASS" | docker login -u "$DOCKERHUB_USERNAME" --password-stdin
          docker push nfcore/eager:latest
          docker tag nfcore/eager:latest nfcore/eager:${{ github.event.release.tag_name }}
          docker push nfcore/eager:${{ github.event.release.tag_name }}


================================================
FILE: .github/yamllint.yml
================================================
rules:
  document-start: disable
  comments: disable
  truthy: disable
  line-length: disable
  empty-lines: disable
  

================================================
FILE: .gitignore
================================================
.nextflow*
work/
data/
results/
.DS_Store
tests/
testing/
testing*
*.pyc
main_playground.nf
.vscode
*.code-workspace
nf-params.json

================================================
FILE: .gitpod.yml
================================================
image: nfcore/gitpod:latest

vscode:
  extensions: # based on nf-core.nf-core-extensionpack
    - codezombiech.gitignore # Language support for .gitignore files
    # - cssho.vscode-svgviewer                 # SVG viewer
    - esbenp.prettier-vscode # Markdown/CommonMark linting and style checking for Visual Studio Code
    - eamodio.gitlens # Quickly glimpse into whom, why, and when a line or code block was changed
    - EditorConfig.EditorConfig # override user/workspace settings with settings found in .editorconfig files
    - Gruntfuggly.todo-tree # Display TODO and FIXME in a tree view in the activity bar
    - mechatroner.rainbow-csv # Highlight columns in csv files in different colors
    # - nextflow.nextflow                      # Nextflow syntax highlighting
    - oderwat.indent-rainbow # Highlight indentation level
    - streetsidesoftware.code-spell-checker # Spelling checker for source code


================================================
FILE: .nf-core-lint.yml
================================================
files_unchanged:
  - assets/multiqc_config.yaml
  - .github/CONTRIBUTING.md
  - .github/ISSUE_TEMPLATE/bug_report.md
  - docs/README.md
  - .github/workflows/linting.yml


================================================
FILE: CHANGELOG.md
================================================
# nf-core/eager: Changelog

The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html).

## [2.5.3] - 2025-03-18

### `Added`

### `Fixed`

- [#1119](https://github.com/nf-core/eager/issues/1119) - Fix typo in variable of IndelRealigner step of UnifiedGenotyper when generating a targetIntervals file (♥ to @Dog13Golf for reporting, fix by @jfy133).

### `Dependencies`

### `Deprecated`

## [2.5.2] - 2024-06-28

### `Added`

- [#1079](https://github.com/nf-core/eager/issues/1079) - Added the `lanemerging` output directory in the output documentation (♥ to @TessaZei for reporting, fix by @TCLamnidis).

### `Fixed`

- [#1037](https://github.com/nf-core/eager/issues/1073) - Fixed post-adapterremoval trimmed FastQC results not being displayed in MultiQC (♥ to @kieren-j-mitchell for reporting, fix by @jfy133 and @TCLamnidis)

### `Dependencies`

### `Deprecated`

## [2.5.1] - 2024-02-21

### `Added`

- [#1037](https://github.com/nf-core/eager/issues/1037) Added an option to deactivate the `-sorted` option of bedtools coverage, in case the feature file is not sorted the same way as the fasta file, albeit with the caveat this will be very slow. (♥ Thanks to @IdoBar for reporting, and contributing.)

### `Fixed`

- [#1048](https://github.com/nf-core/eager/issues/1048) `--vcf2genome_outfile` parameter now gets prefixed by the sample_name and suffixed with `.fasta` (i.e. `<sample_name>_<vcf2genome_outfile>.fasta`). This ensures we avoid overwriting the output fasta of one sample with that of another when the option is provided. (♥ Thanks to @MeriamOs for reporting.)
- [#1047](https://github.com/nf-core/eager/issues/1047) Changed the row some statistics were reported in the General Stats table. The File name collision fixed in 2.5.0 (see #1017) caused these statistics to be reported in the wrong row due to an added suffix.
- [#1051](https://github.com/nf-core/eager/issues/1051) An error is now thrown if input BAM files end in `.unmapped.bam`, as this breaks the bam filtering process and empties the bam files in the process. (♥ Thanks to @PCQuilis for reporting.)

### `Dependencies`

### `Deprecated`

## [2.5.0] - Bopfingen - 2023-11-03

### `Added`

- [#1020](https://github.com/nf-core/eager/issues/1020) Added mapDamage2 as an alternative for damage calculation.

### `Fixed`

- [#1017](https://github.com/nf-core/eager/issues/1017) Fixed file name collision in niche cases with multiple libraries of multiple UDG treatments.
- [#1024](https://github.com/nf-core/eager/issues/1024) `multiqc_general_stats.txt` is now generated even if the table is a beeswarm plot in the report.
- [#655](https://github.com/nf-core/eager/issues/655) Updated RG tags for all mappers. RG-id now includes Sample as well as Library ID. Added `LB:` tag with the library ID.
- [#1031](https://github.com/nf-core/eager/issues/1031) Always index fasta regardless of mapper. This ensures that DamageProfiler and genotyping processes get submitted when using bowtie2 and not providing a fasta index.

### `Dependencies`

- `multiqc`: 1.14 -> 1.16

### `Deprecated`

## [2.4.7] - 2023-05-16

### `Added`

### `Fixed`

- [#983](https://github.com/nf-core/eager/issues/983) Bump `pygments` version due to incompatibility with MultiQC dependencies (♥ to @MinLuke for reporting)

### `Dependencies`

- `pygments`: 2.9 -> 2.14
- `multiqc`: 1.13 -> 1.14

### `Deprecated`

## [2.4.6] - 2022-11-14

### `Added`

- [#933](https://github.com/nf-core/eager/issues/933) Added support for customising --seq-length in mapDamage rescaling (♥ to @ashildv for requesting)

### `Fixed`

- Changed endors.py license from GPL to MIT (♥ to @aidaanva for fixing)
- Removed erroneous R2 in single-end example in input TSV of usage docs (♥ to @aidaanva for fixing)
- [#928](https://github.com/nf-core/eager/issues/928) Fixed read group incompatibility by re-adding picard AddOrReplaceReadGroups for MultiVCFAnalyzer (♥ to @aidaanva, @meganemichel for reporting)
- Fixed edge case of DamageProfiler occasionally requiring FASTA index (♥ to @asmaa-a-abdelwahab for reporting)
- [#834](https://github.com/nf-core/eager/issues/834) Increased significance values in general stats table for Qualimap mean/median coverages (♥ to @neija2611 for reporting)
- Fixed parameter documentation for `--snpcapture_bed` regarding on-target SNP stats to state these stats currently not displayed in MultiQC only in the Qualimap results (♥ to @meganemichel and @TCLamnidis for reporting)
- [#934](https://github.com/nf-core/eager/issues/934) Fixed broken parameter setting in mapDamage2 rescale length (♥ to @ashildv for reporting)

### `Dependencies`

- Updated MultiQC to official 1.13 version (rather than alpha)
- Added pinned MALT dependency to ensure working version in future versions of eager

### `Deprecated`

## [2.4.5] - 2022-08-02

### `Added`

### `Fixed`

- [#882](https://github.com/nf-core/eager/pull/882) Define DSL1 execution explicitly, as new versions Nextflow made DSL2 default (♥ to & fix from @Lehmann-Fabian)
- [#879](https://github.com/nf-core/eager/issues/879) Add missing threads parameter for pre-clipping FastQC for single end data that caused insufficient memory in some cases (♥ to @marcel-keller for reporting)
- [#880](https://github.com/nf-core/eager/issues/880) Fix failure of endorSpy to be cached or reexecuted on resume (♥ to @KathrinNaegele, @TCLamnidis, & @mahesh-panchal for reporting and debugging)
- [#885](https://github.com/nf-core/eager/issues/885) Specify task memory for all tools in get_software_versions to account for incompatibilty of java with some SGE clusters causing hanging of the process (♥ to @maxibor for reporting)
- [#887](https://github.com/nf-core/eager/issues/887) Clarify what is considered 'ultra-short' reads in the help text of clip_readlength, for when you may wish to turn of length filtering during AdapterRemoval (♥ to @TCLamnidis for reporting)
- [#889](https://github.com/nf-core/eager/issues/889) Remove/update parameters from benchmarking test profiles (♥ to @TCLamnidis for reporting)
- [#895](https://github.com/nf-core/eager/issues/895) Output documentation typo fix and added location of output docs in pipeline summary (♥ to @RodrigoBarquera for reporting)
- [#897](https://github.com/nf-core/eager/issues/897) Fix pipeline crash if no Kraken2 results generated (♥ to @alexandregilardet for reporting)
- [#899](https://github.com/nf-core/eager/issues/897) Fix pipeline crash for circulargenerator if reference file does not end in .fasta (♥ to @scarlhoff for reporting)
- Fixed some missing default values in the nextflow parameter schema JSON
- [#789](https://github.com/nf-core/eager/issues/789) Substantial speed and memory optimisation of the `extract_map_reads.py` script (♥ to @ivelsko for reporting, @maxibor for optimisation)
- Fix staging of input bams for genotyping_pileupcaller process. Downstream changes from changes introduced when fixing endorspy caching.
- Made slight correction on metro map diagram regarding input data to SexDeterrmine (only BAM trimming output files)

### `Dependencies`

- Updated MultiQC to latest stable alpha version on bioconda, correcting the previously nonsensical AdapterRemoval plots (♥ to @NiemannJ for fixing in MultiQC)

### `Deprecated`

## [2.4.4] - 2022-04-08

### `Added`

### `Fixed`

- Fixed some auxiliary files (adapater list, snpcapture/pileupcaller/sexdeterrmine BED files, and pileupCaller SNP file, PMD reference mask) in some cases only be used against one sample (❤ to @meganemichel for reporting, fix by @jfy133)

### `Dependencies`

### `Deprecated`

## [2.4.3] - 2022-03-24

### `Added`

### `Fixed`

- [#828](https://github.com/nf-core/eager/issues/828) Improved error message if required metagenomic screening parameters not set correctly
- [#836](https://github.com/nf-core/eager/issues/836) Remove deprecated parameters from test profiles
- [#838](https://github.com/nf-core/eager/issues/838) Fix --snpcapture_bed files not being picked up by Nextflow (❤ to @meganemichel for reporting)
- [#843](https://github.com/nf-core/eager/issues/843) Re-add direct piping of AdapterRemovalFixPrefix to pigz
- [#844](https://github.com/nf-core/eager/issues/844) Fixed reference masking prior to pmdtools
- [#845](https://github.com/nf-core/eager/issues/845) Updates parameter documention to specify `-s` preseq parameter also applies to lc_extrap
- [#851](https://github.com/nf-core/eager/issues/851) Fixes a file-name clash during additional_library_merge, post-BAM trimming of different UDG treated libraries of a sample (❤ to @alexandregilardet for reporting)
- Renamed a range of MultiQC general stats table headers to improve clarity, documentation has been updated accordingly
- [#857](https://github.com/nf-core/eager/issues/857) Corrected samtools fastq flag to _retain_ read-pair information when converting off-target BAM files to fastq in paired-end mapping (❤ to @alexhbnr for reporting)
- [#866](https://github.com/nf-core/eager/issues/866) Fixed a typo in the indexing step of BWA mem when not-collapsing (❤ to @alexhbnr for reporting)
- Corrected tutorials to reflect updated BAM trimming flags (❤ to @marcel-keller for reporting and correcting)

### `Dependencies`

- [#829](https://github.com/nf-core/eager/issues/829) Bumped sequencetools: 1.4.0.5 -> 1.5.2
- Bumped MultiQC: 1.11 -> 1.12 (for run-time optimisation and tool citation information)

### `Deprecated`

## [2.4.2] - 2022-01-24

### `Added`

### `Fixed`

- [#824](https://github.com/nf-core/eager/issues/824) Fixes large memory footprint of bedtools coverage calculation.
- [#822](https://github.com/nf-core/eager/issues/822) Fixed post-adapterremoval trimmed files not being lane-merged and included in downstream analyses
- Fixed a couple of software version reporting commands

### `Dependencies`

### `Deprecated`

## [2.4.1] - 2021-11-30

### `Added`

- [#805](https://github.com/nf-core/eager/issues/805) Changes to bam_trim options to allow flexible trimming by library strandedness (in addition to UDG treatment). (@TCLamnidis)
- [#808](https://github.com/nf-core/eager/issues/808) Retain read group information across bam merges. Sample set to sample name (rather than library name) in bwa output 'RG' readgroup tag. (@TCLamnidis)
- Map and base quality filters prior to genotyping with pileupcaller can now be specified. (@TCLamnidis)
- [#774](https://github.com/nf-core/eager/issues/774) Added support for multi-threaded Bowtie2 build reference genome indexing (@jfy133)
- [#804](https://github.com/nf-core/eager/issues/804) Improved output documentation description to add how 'cluster factor' is calculated (thanks to @meganemichel)

### `Fixed`

- [#803](https://github.com/nf-core/eager/issues/803) Fixed mistake in metro-map diagram (`samtools index` is now correctly `samtools faidx`) (@jfy133)

### `Dependencies`

### `Deprecated`

## [2.4.0] - Wangen - 2021-09-14

### `Added`

- [#317](https://github.com/nf-core/eager/issues/317) Added bcftools stats for general genotyping statistics of VCF files
- [#651](https://github.com/nf-core/eager/issues/651) - Adds removal of adapters specified in an AdapterRemoval adapter list file
- [#642](https://github.com/nf-core/eager/issues/642) and [#431](https://github.com/nf-core/eager/issues/431) adds post-adapter removal barcode/fastq trimming
- [#769](https://github.com/nf-core/eager/issues/769) - Adds lc_extrap mode to preseq (suggested by @roberta-davidson)

### `Fixed`

- Fixed some missing or incorrectly reported software versions
- [#771](https://github.com/nf-core/eager/issues/771) Remove legacy code
- Improved output documentation for MultiQC general stats table (thanks to @KathrinNaegele and @esalmela)
- Improved output documentation for BowTie2 (thanks to @isinaltinkaya)
- [#612](https://github.com/nf-core/eager/issues/612) Updated BAM trimming defaults to 0 to ensure no unwanted trimming when mixing half-UDG with no-UDG (thanks to @scarlhoff)
- [#722](https://github.com/nf-core/eager/issues/722) Updated BWA mapping mapping parameters to latest recommendations - primarily alnn back to 0.01 and alno to 2 as per Oliva et al. 2021 (10.1093/bib/bbab076)
- Updated workflow diagrams to reflect latest functionality
- [#787](https://github.com/nf-core/eager/issues/787) Adds memory specification flags for the GATK UnifiedGenotyper and HaplotyperCaller steps (thanks to @nylander)
- Fixed issue where MultiVCFAnalyzer would not pick up newly generated VCF files, when specifying additional VCF files.
- [#790](https://github.com/nf-core/eager/issues/790) Fixed kraken2 report file-name collision when sample names have `.` in them
- [#792](https://github.com/nf-core/eager/issues/792) Fixed java error messages for AdapterRemovalFixPrefix being hidden in output
- [#794](https://github.com/nf-core/eager/issues/794) Aligned default test profile with nf-core standards (`test_tsv` is now `test`)

### `Dependencies`

- Bumped python: 3.7.3 -> 3.9.4
- Bumped markdown: 3.2.2 -> 3.3.4
- Bumped pymdown-extensions: 7.1 -> 8.2
- Bumped pyments: 2.6.1 -> 2.9.0
- Bumped adapterremoval: 2.3.1 -> 2.3.2
- Bumped picard: 2.22.9 -> 2.26.0
- Bumped samtools 1.9 -> 1.12
- Bumped angsd: 0.933 -> 0.935
- Bumped gatk4: 4.1.7.0 -> 4.2.0.0
- Bumped multiqc: 1.10.1 -> 1.11
- Bumped bedtools 2.29.2 -> 2.30.0
- Bumped libiconv: 1.15 -> 1.16
- Bumped preseq: 2.0.3 -> 3.1.2
- Bumped bamutil: 1.0.14 -> 1.0.15
- Bumped pysam: 0.15.4 -> 0.16.0
- Bumped kraken2: 2.1.1 -> 2.1.2
- Bumped pandas: 1.0.4 -> 1.2.4
- Bumped freebayes: 1.3.2 -> 1.3.5
- Bumped biopython: 1.76 -> 1.79
- Bumped xopen: 0.9.0 -> 1.1.0
- Bumped bowtie2: 2.4.2 -> 2.4.4
- Bumped mapdamage2: 2.2.0 -> 2.2.1
- Bumped bbmap: 38.87 -> 38.92
- Added bcftools: 1.12

### `Deprecated`

## [2.3.5] - 2021-06-03

### `Added`

- [#722](https://github.com/nf-core/eager/issues/722) - Adds bwa `-o` flag for more flexibility in bwa parameters
- [#736](https://github.com/nf-core/eager/issues/736) - Add printing of multiqc run report location on successful completion
- New logo that is more visible when a user is using darkmode on GitHub or nf-core website!

### `Fixed`

- [#723](https://github.com/nf-core/eager/issues/723) - Fixes empty fields in TSV resulting in uninformative error
- Updated template to nf-core/tools 1.14
- [#688](https://github.com/nf-core/eager/issues/688) - Clarified the pipeline is not just for humans and microbes, but also plants and animals, and also for modern DNA
- [#751](https://github.com/nf-core/eager/pull/751) - Added missing label to mtnucratio
- General code cleanup and standardisation of parameters with no default setting
- [#750](https://github.com/nf-core/eager/issues/750) - Fixed piped commands requesting the same number of CPUs at each command step
- [#757](https://github.com/nf-core/eager/issues/757) - Removed confusing 'Data Type' variable from MultiQC workflow summary (not consistent with TSV input)
- [#759](https://github.com/nf-core/eager/pull/759) - Fixed malformed software scraping regex that resulted in N/A in MultiQC report
- [#761](https://github.com/nf-core/eager/pull/759) - Fixed issues related to instability of samtools filtering related CI tests

### `Dependencies`

### `Deprecated`

## [2.3.4] - 2021-05-05

### `Added`

- [#729](https://github.com/nf-core/eager/issues/729) - Added Bowtie2 flag `--maxins` for PE mapping modern DNA mapping contexts

### `Fixed`

- Corrected explanation of the "--min_adap_overlap" parameter for AdapterRemoval in the docs
- [#725](https://github.com/nf-core/eager/pull/725) - `bwa_index` doc update
- Re-adds gzip piping to AdapterRemovalFixPrefix to speed up process after reports of being very slow
- Updated DamageProfiler citation from bioRxiv to publication

### `Dependencies`

- Removed pinning of `tbb` (upstream bug in bioconda fixed)
- Bumped `pigz` to 2.6 to fix rare stall bug when compressing data after AdapterRemoval
- Bumped Bowtie2 to 2.4.2 to fix issues with `tbb` version

### `Deprecated`

## [2.3.3] - 2021-04-08

### `Added`

- [#349](https://github.com/nf-core/eager/issues/349) - Added option enabling platypus formatted output of pmdtools misincorporation frequencies.

### `Fixed`

- [#719](https://github.com/nf-core/eager/pull/719) - Fix filename for bam output of `mapdamage_rescaling`
- [#707](https://github.com/nf-core/eager/pull/707) - Fix typo in UnifiedGenotyper IndelRealigner command
- Fixed some Java tools not following process memory specifications
- Updated template to nf-core/tools 1.13.2
- [#711](https://github.com/nf-core/eager/pull/711) - Fix conditional execution preventing multivcfanalyze to run
- [#714](https://github.com/nf-core/eager/issues/714) - Fixes bug in nuc contamination by upgrading to latest MultiQC v1.10.1 bugfix release

### `Dependencies`

### `Deprecated`

## [2.3.2] - 2021-03-16

### `Added`

- [#687](https://github.com/nf-core/eager/pull/687) - Adds Kraken2 unique kmer counting report
- [#676](https://github.com/nf-core/eager/issues/676) - Refactor help message / summary message formatting to automatic versions using nf-core library
- [#682](https://github.com/nf-core/eager/issues/682) - Add AdapterRemoval `--qualitymax` flag to allow FASTQ Phred score range max more than 41

### `Fixed`

- [#666](https://github.com/nf-core/eager/issues/666) - Fixed input file staging for `print_nuclear_contamination`
- [#631](https://github.com/nf-core/eager/issues/631) - Update minimum Nextflow version to 20.07.1, due to unfortunate bug in Nextflow 20.04.1 causing eager to crash if patch pulled
- Made MultiQC crash behaviour stricter when dealing with large datasets, as reported by @ashildv
- [#652](https://github.com/nf-core/eager/issues/652) - Added note to documentation that when using `--skip_collapse` this will use _paired-end_ alignment mode with mappers when using PE data
- [#626](https://github.com/nf-core/eager/issues/626) - Add additional checks to ensure pipeline will give useful error if cells of a TSV column are empty
- Added note to documentation that when using `--skip_collapse` this will use _paired-end_ alignment mode with mappers when using PE data
- [#673](https://github.com/nf-core/eager/pull/673) - Fix Kraken database loading when loading from directory instead of compressed file
- [#688](https://github.com/nf-core/eager/issues/668) - Allow pipeline to complete, even if Qualimap crashes due to an empty or corrupt BAM file for one sample/library
- [#683](https://github.com/nf-core/eager/pull/683) - Sets `--igenomes_ignore` to true by default, as rarely used by users currently and makes resolving configs less complex
- Added exit code `140` to re-tryable exit code list to account for certain scheduler wall-time limit fails
- [#672](https://github.com/nf-core/eager/issues/672) - Removed java parameter from picard tools which could cause memory issues
- [#679](https://github.com/nf-core/eager/issues/679) - Refactor within-process bash conditions to groovy/nextflow, due to incompatibility with some servers environments
- [#690](https://github.com/nf-core/eager/pull/690) - Fixed ANGSD output mode for beagle by setting `-doMajorMinor 1` as default in that case
- [#693](https://github.com/nf-core/eager/issues/693) - Fixed broken TSV input validation for the Colour Chemistry column
- [#695](https://github.com/nf-core/eager/issues/695) - Fixed incorrect `-profile` order in tutorials (originally written reversed due to [nextflow bug](https://github.com/nextflow-io/nextflow/issues/1792))
- [#653](https://github.com/nf-core/eager/issues/653) - Fixed file collision errors with sexdeterrmine for two same-named libraries with different strandedness

### `Dependencies`

- Bumped MultiQC to 1.10 for improved functionality
- Bumped HOPS to 0.35 for MultiQC 1.10 compatibility

### `Deprecated`

## [2.3.1] - 2021-01-14

### `Added`

### `Fixed`

- [#654](https://github.com/nf-core/eager/issues/654) - Fixed some values in JSON schema (used in launch GUI) not passing validation checks during run
- [#655](https://github.com/nf-core/eager/issues/655) - Updated read groups for all mappers to allow proper GATK validation
- Fixed issue with Docker container not being pullable by Nextflow due to version-number inconsistencies

### `Dependencies`

### `Deprecated`

## [2.3.0] - Aalen - 2021-01-11

### `Added`

- [#640](https://github.com/nf-core/eager/issues/640) - Added a pre-metagenomic screening filtering of low-sequence complexity reads with `bbduk`
- [#583](https://github.com/nf-core/eager/issues/583) - Added `mapDamage2` rescaling of BAM files to remove damage
- Updated usage (merging files) and workflow images reflecting new functionality.

### `Fixed`

- Removed leftover old DockerHub push CI commands.
- [#627](https://github.com/nf-core/eager/issues/627) - Added de Barros Damgaard citation to README
- [#630](https://github.com/nf-core/eager/pull/630) - Better handling of Qualimap memory requirements and error strategy.
- Fixed some incomplete schema options to ensure users supply valid input values
- [#638](https://github.com/nf-core/eager/issues/638#issuecomment-748877567) Fixed inverted circularfilter filtering (previously filtering would happen by default, not when requested by user as originally recorded in documentation)
- [DeDup:](https://github.com/apeltzer/DeDup/commit/07d47868f10a6830da8c9161caa3755d9da155bf) Fixed Null Pointer Bug in DeDup by updating to 0.12.8 version
- [#650](https://github.com/nf-core/eager/pull/650) - Increased memory given to FastQC for larger files by making it multithreaded

### `Dependencies`

- Update: DeDup v0.12.7 to v0.12.8

### `Deprecated`

## [2.2.2] - 2020-12-09

### `Added`

- Added large scale 'stress-test' profile for AWS (using de Barros Damgaard et al. 2018's 137 ancient human genomes).
  - This will now be run automatically for every release. All processed data will be available on the nf-core website: <https://nf-co.re/eager/results>
    - You can run this yourself using `-profile test_full`

### `Fixed`

- Fixed AWS full test profile.
- [#587](https://github.com/nf-core/eager/issues/587) - Re-implemented AdapterRemovalFixPrefix for DeDup compatibility of including singletons
- [#602](https://github.com/nf-core/eager/issues/602) - Added the newly available GATK 3.5 conda package.
- [#610](https://github.com/nf-core/eager/issues/610) - Create bwa_index channel when specifying circularmapper as mapper
- Updated template to nf-core/tools 1.12.1
- General documentation improvements

### `Deprecated`

- Flag `--gatk_ug_jar` has now been removed as GATK 3.5 is now avaliable within the nf-core/eager software environment.

## [2.2.1] - 2020-10-20

### `Fixed`

- [#591](https://github.com/nf-core/eager/issues/591) - Fixed offset underlines in lane merging diagram in docs
- [#592](https://github.com/nf-core/eager/issues/592) - Fixed issue where supplying Bowtie2 index reported missing bwamem_index error
- [#590](https://github.com/nf-core/eager/issues/592) - Removed redundant dockstore.yml from root
- [#596](https://github.com/nf-core/eager/issues/596) - Add workaround for issue regarding gzipped FASTAs and pre-built indices
- [#589](https://github.com/nf-core/eager/issues/582) - Updated template to nf-core/tools 1.11
- [#582](https://github.com/nf-core/eager/issues/582) - Clarify memory limit issue on FAQ

## [2.2.0] - Ulm - 2020-10-20

### `Added`

- **Major** Automated cloud tests with large-scale data on [AWS](https://aws.amazon.com/)
- **Major** Re-wrote input logic to accept a TSV 'map' file in addition to direct paths to FASTQ files
- **Major** Added JSON Schema, enabling web GUI for configuration of pipeline available [here](https://nf-co.re/launch?pipeline=eager&release=2.2.0)
- **Major** Lane and library merging implemented
  - When using TSV input, one library with the multiple _lanes_ will be merged together, before mapping
  - Strip FASTQ will also produce a lane merged 'raw' but 'stripped' FASTQ file
  - When using TSV input, one sample with multiple (same treatment) libraries will be merged together
  - Important: direct FASTQ paths will not have this functionality. TSV is required.
- [#40](https://github.com/nf-core/eager/issues/40) - Added the pileupCaller genotyper from [sequenceTools](https://github.com/stschiff/sequenceTools)
- Added validation check and clearer error message when `--fasta_index` is provided and filepath does not end in `.fai`.
- Improved error messages
- Added ability for automated emails using `mailutils` to also send MultiQC reports
- General documentation additions, cleaning, and updated figures with CC-BY license
- Added large 'full size' dataset test-profiles for ancient fish and human contexts human
- [#257](https://github.com/nf-core/eager/issues/257) - Added the bowtie2 aligner as option for mapping, following Poullet and Orlando 2020 doi: [10.3389/fevo.2020.00105](https://doi.org/10.3389/fevo.2020.00105)
- [#451](https://github.com/nf-core/eager/issues/451) - Adds ANGSD genotype likelihood calculations as an alternative to typical 'genotypers'
- [#566](https://github.com/nf-core/eager/issues/466) - Add tutorials on how to set up nf-core/eager for different contexts
- Nuclear contamination results are now shown in the MultiQC report
- Tutorial on how to use profiles for reproducible science (i.e. parameter sharing between different groups)
- [#522](https://github.com/nf-core/eager/issues/522) - Added post-mapping length filter to assist in more realistic endogenous DNA calculations
- [#512](https://github.com/nf-core/eager/issues/512) - Added flexible trimming of BAMs by library type. 'half' and 'none' UDG libraries can now be trimmed differentially within a single eager run.
- Added a `.dockstore.yml` config file for automatic workflow registration with [dockstore.org](https://dockstore.org/)
- Updated template to nf-core/tools 1.10.2
- [#544](https://github.com/nf-core/eager/pull/544) - Add script to perform bam filtering on fragment length
- [#456](https://github.com/nf-core/eager/pull/546) - Bumps the base (default) runtime of all processes to 4 hours, and set shorter time limits for test profiles (1 hour)
- [#552](https://github.com/nf-core/eager/issues/552) - Adds optional creation of MALT SAM files alongside RMA6 files
- Added eigenstrat snp coverage statistics to MultiQC report. Process results are published in `genotyping/*_eigenstrat_coverage.txt`.

### `Fixed`

- [#368](https://github.com/nf-core/eager/issues/368) - Fixed the profile `test` to contain a parameter for `--paired_end`
- Mini bugfix for typo in line 1260+1261
- [#374](https://github.com/nf-core/eager/issues/374) - Fixed output documentation rendering not containing images
- [#379](https://github.com/nf-core/eager/issues/378) - Fixed insufficient memory requirements for FASTQC edge case
- [#390](https://github.com/nf-core/eager/issues/390) - Renamed clipped/merged output directory to be more descriptive
- [#398](https://github.com/nf-core/eager/issues/498) - Stopped incompatible FASTA indexes being accepted
- [#400](https://github.com/nf-core/eager/issues/400) - Set correct recommended bwa mapping parameters from [Schubert et al. 2012](https://doi.org/10.1186/1471-2164-13-178)
- [#410](https://github.com/nf-core/eager/issues/410) - Fixed nf-core/configs not being loaded properly
- [#473](https://github.com/nf-core/eager/issues/473) - Fixed bug in sexdet_process on AWS
- [#444](https://github.com/nf-core/eager/issues/444) - Provide option for preserving realigned bam + index
- Fixed deduplication output logic. Will now pass along only the post-rmdup bams if duplicate removal is not skipped, instead of both the post-rmdup and pre-rmdup bams
- [#497](https://github.com/nf-core/eager/issues/497) - Simplifies number of parameters required to run bam filtering
- [#501](https://github.com/nf-core/eager/issues/501) - Adds additional validation checks for MALT/MaltExtract database input files
- [#508](https://github.com/nf-core/eager/issues/508) - Made Markduplicates default dedupper due to narrower context specificity of dedup
- [#516](https://github.com/nf-core/eager/issues/516) - Made bedtools not report out of memory exit code when warning of inconsistent FASTA/Bed entry names
- [#504](https://github.com/nf-core/eager/issues/504) - Removed uninformative sexdeterrmine-snps plot from MultiQC report.
- Nuclear contamination is now reported with the correct library names.
- [#531](https://github.com/nf-core/eager/pull/531) - Renamed 'FASTQ stripping' to 'host removal'
- Merged all tutorials and FAQs into `usage.md` for display on [nf-co.re](https://www.nf-co.re)
- Corrected header of nuclear contamination table (`nuclear_contamination.txt`).
- Fixed a bug with `nSNPs` definition in `print_x_contamination.py`. Number of SNPs now correctly reported
- `print_x_contamination.py` now correctly converts all NA values to "N/A"
- Increased amount of memory MultiQC by default uses, to account for very large nf-core/eager runs (e.g. >1000 samples)

### `Dependencies`

- Added sequenceTools (1.4.0.6) that adds the ability to do genotyping with the 'pileupCaller'
- Latest version of DeDup (0.12.6) which now reports mapped reads after deduplication
- [#560](https://github.com/nf-core/eager/issues/560) Latest version of Dedup (0.12.7), which now correctly reports deduplication statistics based on calculations of mapped reads only (prior denominator was total reads of BAM file)
- Latest version of ANGSD (0.933) which doesn't seg fault when running contamination on BAMs with insufficient reads
- Latest version of MultiQC (1.9) with support for lots of extra tools in the pipeline (MALT, SexDetERRmine, DamageProfiler, MultiVCFAnalyzer)
- Latest versions of Pygments (7.1), Pymdown-Extensions (2.6.1) and Markdown (3.2.2) for documentation output
- Latest version of Picard (2.22.9)
- Latest version of GATK4 (4.1.7.0)
- Latest version of sequenceTools (1.4.0.6)
- Latest version of fastP (0.20.1)
- Latest version of Kraken2 (2.0.9beta)
- Latest version of FreeBayes (1.3.2)
- Latest version of xopen (0.9.0)
- Added Bowtie 2 (2.4.1)
- Latest version of Sex.DetERRmine (1.1.2)
- Latest version of endorS.py (0.4)

## [2.1.0] - Ravensburg - 2020-03-05

### `Added`

- Added Support for automated tests using [GitHub Actions](https://github.com/features/actions), replacing travis
- [#40](https://github.com/nf-core/eager/issues/40), [#231](https://github.com/nf-core/eager/issues/231) - Added genotyping capability through GATK UnifiedGenotyper (v3.5), GATK HaplotypeCaller (v4.1) and FreeBayes
- Added MultiVCFAnalyzer module
- [#240](https://github.com/nf-core/eager/issues/240) - Added human sex determination module
- [#226](https://github.com/nf-core/eager/issues/226) - Added `--preserve5p` function for AdapterRemoval
- [#212](https://github.com/nf-core/eager/issues/212) - Added ability to use only merged reads downstream from AdapterRemoval
- [#265](https://github.com/nf-core/eager/issues/265) - Adjusted full markdown linting in Travis CI
- [#247](https://github.com/nf-core/eager/issues/247) - Added nuclear contamination with angsd
- [#258](https://github.com/nf-core/eager/issues/258) - Added ability to report bedtools stats to features (e.g. depth/breadth of annotated genes)
- [#249](https://github.com/nf-core/eager/issues/249) - Added metagenomic classification of unmapped reads with MALT and aDNA authentication with MaltExtract
- [#302](https://github.com/nf-core/eager/issues/302) - Added mitochondrial to nuclear ratio calculation
- [#302](https://github.com/nf-core/eager/issues/302) - Added VCF2Genome for consensus sequence generation
- Fancy new logo from [ZandraFagernas](https://github.com/ZandraFagernas)
- [#286](https://github.com/nf-core/eager/issues/286) - Adds pipeline-specific profiles (loaded from nf-core configs)
- [#310](https://github.com/nf-core/eager/issues/310) - Generalises base.config
- [#326](https://github.com/nf-core/eager/pull/326) - Add Biopython and [xopen](https://github.com/marcelm/xopen/) dependencies
- [#336](https://github.com/nf-core/eager/issues/336) - Change default Y-axis maximum value of DamageProfiler to 30% to match popular (but slower) mapDamage, and allow user to set their own value.
- [#352](https://github.com/nf-core/eager/pull/352) - Add social preview image
- [#355](https://github.com/nf-core/eager/pull/355) - Add Kraken2 metagenomics classifier
- [#90](https://github.com/nf-core/eager/issues/90) - Added endogenous DNA calculator (original repository: [https://github.com/aidaanva/endorS.py/](https://github.com/aidaanva/endorS.py/))

### `Fixed`

- [#227](https://github.com/nf-core/eager/issues/227) - Large re-write of input/output process logic to allow maximum flexibility. Originally to address [#227](https://github.com/nf-core/eager/issues/227), but further expanded
- Fixed Travis-Ci.org to Travis-Ci.com migration issues
- [#266](https://github.com/nf-core/eager/issues/266) - Added sanity checks for input filetypes (i.e. only BAM files can be supplied if `--bam`)
- [#237](https://github.com/nf-core/eager/issues/237) - Fixed and Updated script scrape_software_versions
- [#322](https://github.com/nf-core/eager/pull/322) - Move extract map reads fastq compression to pigz
- [#327](https://github.com/nf-core/eager/pull/327) - Speed up strip_input_fastq process and make it more robust
- [#342](https://github.com/nf-core/eager/pull/342) - Updated to match nf-core tools 1.8 linting guidelines
- [#339](https://github.com/nf-core/eager/issues/339) - Converted unnecessary zcat + gzip to just cat for a performance boost
- [#344](https://github.com/nf-core/eager/issues/344) - Fixed pipeline still trying to run when using old nextflow version

### `Dependencies`

- adapterremoval=2.2.2 upgraded to 2.3.1
- adapterremovalfixprefix=0.0.4 upgraded to 0.0.5
- damageprofiler=0.4.3 upgraded to 0.4.9
- angsd=0.923 upgraded to 0.931
- gatk4=4.1.2.0 upgraded to 4.1.4.1
- mtnucratio=0.5 upgraded to 0.6
- conda-forge::markdown=3.1.1 upgraded to 3.2.1
- bioconda::fastqc=0.11.8 upgraded to 0.11.9
- bioconda::picard=2.21.4 upgraded to 2.22.0
- bioconda::bedtools=2.29.0 upgraded to 2.29.2
- pysam=0.15.3 upgraded to 0.15.4
- conda-forge::pandas=1.0.0 upgraded to 1.0.1
- bioconda::freebayes=1.3.1 upgraded to 1.3.2
- conda-forge::biopython=1.75 upgraded to 1.76

## [2.0.7] - 2019-06-10

### `Added`

- [#189](https://github.com/nf-core/eager/pull/189) - Outputting unmapped reads in a fastq files with the --strip_input_fastq flag
- [#186](https://github.com/nf-core/eager/pull/186) - Make FastQC skipping [possible](https://github.com/nf-core/eager/issues/182)
- Merged in [nf-core/tools](https://github.com/nf-core/tools) release V1.6 template changes
- A lot more automated tests using Travis CI
- Don't ignore DamageProfiler errors any more
- [#220](https://github.com/nf-core/eager/pull/220) - Added post-mapping filtering statistics module and corresponding MultiQC statistics [#217](https://github.com/nf-core/eager/issues/217)

### `Fixed`

- [#152](https://github.com/nf-core/eager/pull/152) - DamageProfiler errors [won't crash entire pipeline any more](https://github.com/nf-core/eager/issues/171)
- [#176](https://github.com/nf-core/eager/pull/176) - Increase runtime for DamageProfiler on [large reference genomes](https://github.com/nf-core/eager/issues/173)
- [#172](https://github.com/nf-core/eager/pull/152) - DamageProfiler errors [won't crash entire pipeline any more](https://github.com/nf-core/eager/issues/171)
- [#174](https://github.com/nf-core/eager/pull/190) - Publish DeDup files [properly](https://github.com/nf-core/eager/issues/183)
- [#196](https://github.com/nf-core/eager/pull/196) - Fix reference [issues](https://github.com/nf-core/eager/issues/150)
- [#196](https://github.com/nf-core/eager/pull/196) - Fix issues with PE data being mapped incompletely
- [#200](https://github.com/nf-core/eager/pull/200) - Fix minor issue with some [typos](https://github.com/nf-core/eager/pull/196)
- [#210](https://github.com/nf-core/eager/pull/210) - Fix PMDTools [encoding issue](https://github.com/pontussk/PMDtools/issues/6) from `samtools calmd` generated files by running through `sa]mtools view` first
- [#221](https://github.com/nf-core/eager/pull/221) - Fix BWA Index [not being reused by multiple samples](https://github.com/nf-core/eager/issues/219)

### `Dependencies`

- Added DeDup v0.12.5 (json support)
- Added mtnucratio v0.5 (json support)
- Updated Picard 2.18.27 -> 2.20.2
- Updated GATK 4.1.0.0 -> 4.1.2.0
- Updated damageprofiler 0.4.4 -> 0.4.5
- Updated r-rmarkdown 1.11 -> 1.12
- Updated fastp 0.19.7 -> 0.20.0
- Updated qualimap 2.2.2b -> 2.2.2c

## [2.0.6] - 2019-03-05

### `Added`

- [#152](https://github.com/nf-core/eager/pull/152) - Clarified `--complexity_filter` flag to be specifically for poly G trimming.
- [#155](https://github.com/nf-core/eager/pull/155) - Added [Dedup log to output folders](https://github.com/nf-core/eager/issues/154)
- [#159](https://github.com/nf-core/eager/pull/159) - Added Possibility to skip AdapterRemoval, skip merging, skip trimming fixing [#64](https://github.com/nf-core/eager/issues/64),[#137](https://github.com/nf-core/eager/issues/137) - thanks to @maxibor, @jfy133

### `Fixed`

- [#151](https://github.com/nf-core/eager/pull/151) - Fixed [post-deduplication step errors](https://github.com/nf-core/eager/issues/128)
- [#147](https://github.com/nf-core/eager/pull/147) - Fix Samtools Index for [large references](https://github.com/nf-core/eager/issues/146)
- [#145](https://github.com/nf-core/eager/pull/145) - Added Picard Memory Handling [fix](https://github.com/nf-core/eager/issues/144)

### `Dependencies`

- Picard Tools 2.18.23 -> 2.18.27
- GATK 4.0.12.0 -> 4.1.0.0
- FastP 0.19.6 -> 0.19.7

## [2.0.5] - 2019-01-28

### `Added`

- [#127](https://github.com/nf-core/eager/pull/127) - Added a second test case for testing the pipeline properly
- [#129](https://github.com/nf-core/eager/pull/129) - Support BAM files as [input format](https://github.com/nf-core/eager/issues/41)
- [#131](https://github.com/nf-core/eager/pull/131) - Support different [reference genome file extensions](https://github.com/nf-core/eager/issues/130)

### `Fixed`

- [#128](https://github.com/nf-core/eager/issues/128) - Fixed reference genome handling errors

### `Dependencies`

- Picard Tools 2.18.21 -> 2.18.23
- R-Markdown 1.10 -> 1.11
- FastP 0.19.5 -> 0.19.6

## [2.0.4] - 2019-01-09

### `Added`

- [#111](https://github.com/nf-core/eager/pull/110) - Allow [Zipped FastA reference input](https://github.com/nf-core/eager/issues/91)
- [#113](https://github.com/nf-core/eager/pull/113) - All files are now staged via channels, which is considered best practice by Nextflow
- [#114](https://github.com/nf-core/eager/pull/113) - Add proper runtime defaults for multiple processes
- [#118](https://github.com/nf-core/eager/pull/118) - Add [centralized configs handling](https://github.com/nf-core/configs)
- [#115](https://github.com/nf-core/eager/pull/115) - Add DamageProfiler MultiQC support
- [#122](https://github.com/nf-core/eager/pull/122) - Add pulling from Dockerhub again

### `Fixed`

- [#110](https://github.com/nf-core/eager/pull/110) - Fix for [MultiQC Missing Second FastQC report](https://github.com/nf-core/eager/issues/107)
- [#112](https://github.com/nf-core/eager/pull/112) - Remove [redundant UDG options](https://github.com/nf-core/eager/issues/89)

## [2.0.3] - 2018-12-12

### `Added`

- [#80](https://github.com/nf-core/eager/pull/80) - BWA Index file handling
- [#77](https://github.com/nf-core/eager/pull/77) - Lots of documentation updates by [@jfy133](https://github.com/jfy133)
- [#81](https://github.com/nf-core/eager/pull/81) - Renaming of certain BAM options
- [#92](https://github.com/nf-core/eager/issues/92) - Complete restructure of BAM options

### `Fixed`

- [#84](https://github.com/nf-core/eager/pull/85) - Fix for [Samtools index issues](https://github.com/nf-core/eager/issues/84)
- [#96](https://github.com/nf-core/eager/issues/96) - Fix for [MarkDuplicates issues](https://github.com/nf-core/eager/issues/96) found by [@nilesh-tawari](https://github.com/nilesh-tawari)

### Other

- Added Slack button to repository readme

## [2.0.2] - 2018-11-03

### `Changed`

- [#70](https://github.com/nf-core/eager/issues/70) - Uninitialized `readPaths` warning removed

### `Added`

- [#73](https://github.com/nf-core/eager/pull/73) - Travis CI Testing of Conda Environment added

### `Fixed`

- [#72](https://github.com/nf-core/eager/issues/72) - iconv Issue with R in conda environment

## [2.0.1] - 2018-11-02

### `Fixed`

- [#69](https://github.com/nf-core/eager/issues/67) - FastQC issues with conda environments

## [2.0.0] "Kaufbeuren" - 2018-10-17

Initial release of nf-core/eager:

### `Added`

- FastQC read quality control
- (Optional) Read complexity filtering with FastP
- Read merging and clipping using AdapterRemoval v2
- Mapping using BWA / BWA Mem or CircularMapper
- Library Complexity Estimation with Preseq
- Conversion and Filtering of BAM files using Samtools
- Damage assessment via DamageProfiler, additional filtering using PMDTools
- Duplication removal via DeDup
- BAM Clipping with BamUtil for UDGhalf protocols
- QualiMap BAM quality control analysis

Furthermore, this already creates an interactive report using MultiQC, which will be upgraded in V2.1 "Ulm" to contain more aDNA specific metrics.


================================================
FILE: CODE_OF_CONDUCT.md
================================================
# Code of Conduct at nf-core (v1.0)

## Our Pledge

In the interest of fostering an open, collaborative, and welcoming environment, we as contributors and maintainers of nf-core, pledge to making participation in our projects and community a harassment-free experience for everyone, regardless of:

- Age
- Body size
- Familial status
- Gender identity and expression
- Geographical location
- Level of experience
- Nationality and national origins
- Native language
- Physical and neurological ability
- Race or ethnicity
- Religion
- Sexual identity and orientation
- Socioeconomic status

Please note that the list above is alphabetised and is therefore not ranked in any order of preference or importance.

## Preamble

> Note: This Code of Conduct (CoC) has been drafted by the nf-core Safety Officer and been edited after input from members of the nf-core team and others. "We", in this document, refers to the Safety Officer and members of the nf-core core team, both of whom are deemed to be members of the nf-core community and are therefore required to abide by this Code of Conduct. This document will amended periodically to keep it up-to-date, and in case of any dispute, the most current version will apply.

An up-to-date list of members of the nf-core core team can be found [here](https://nf-co.re/about). Our current safety officer is Renuka Kudva.

nf-core is a young and growing community that welcomes contributions from anyone with a shared vision for [Open Science Policies](https://www.fosteropenscience.eu/taxonomy/term/8). Open science policies encompass inclusive behaviours and we strive to build and maintain a safe and inclusive environment for all individuals.

We have therefore adopted this code of conduct (CoC), which we require all members of our community and attendees in nf-core events to adhere to in all our workspaces at all times. Workspaces include but are not limited to Slack, meetings on Zoom, Jitsi, YouTube live etc.

Our CoC will be strictly enforced and the nf-core team reserve the right to exclude participants who do not comply with our guidelines from our workspaces and future nf-core activities.

We ask all members of our community to help maintain a supportive and productive workspace and to avoid behaviours that can make individuals feel unsafe or unwelcome. Please help us maintain and uphold this CoC.

Questions, concerns or ideas on what we can include? Contact safety [at] nf-co [dot] re

## Our Responsibilities

The safety officer is responsible for clarifying the standards of acceptable behavior and are expected to take appropriate and fair corrective action in response to any instances of unacceptable behaviour.

The safety officer in consultation with the nf-core core team have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, or to ban temporarily or permanently any contributor for other behaviors that they deem inappropriate, threatening, offensive, or harmful.

Members of the core team or the safety officer who violate the CoC will be required to recuse themselves pending investigation. They will not have access to any reports of the violations and be subject to the same actions as others in violation of the CoC.

## When are where does this Code of Conduct apply?

Participation in the nf-core community is contingent on following these guidelines in all our workspaces and events. This includes but is not limited to the following listed alphabetically and therefore in no order of preference:

- Communicating with an official project email address.
- Communicating with community members within the nf-core Slack channel.
- Participating in hackathons organised by nf-core (both online and in-person events).
- Participating in collaborative work on GitHub, Google Suite, community calls, mentorship meetings, email correspondence.
- Participating in workshops, training, and seminar series organised by nf-core (both online and in-person events). This applies to events hosted on web-based platforms such as Zoom, Jitsi, YouTube live etc.
- Representing nf-core on social media. This includes both official and personal accounts.

## nf-core cares 😊

nf-core's CoC and expectations of respectful behaviours for all participants (including organisers and the nf-core team) include but are not limited to the following (listed in alphabetical order):

- Ask for consent before sharing another community member’s personal information (including photographs) on social media.
- Be respectful of differing viewpoints and experiences. We are all here to learn from one another and a difference in opinion can present a good learning opportunity.
- Celebrate your accomplishments at events! (Get creative with your use of emojis 🎉 🥳 💯 🙌 !)
- Demonstrate empathy towards other community members. (We don’t all have the same amount of time to dedicate to nf-core. If tasks are pending, don’t hesitate to gently remind members of your team. If you are leading a task, ask for help if you feel overwhelmed.)
- Engage with and enquire after others. (This is especially important given the geographically remote nature of the nf-core community, so let’s do this the best we can)
- Focus on what is best for the team and the community. (When in doubt, ask)
- Graciously accept constructive criticism, yet be unafraid to question, deliberate, and learn.
- Introduce yourself to members of the community. (We’ve all been outsiders and we know that talking to strangers can be hard for some, but remember we’re interested in getting to know you and your visions for open science!)
- Show appreciation and **provide clear feedback**. (This is especially important because we don’t see each other in person and it can be harder to interpret subtleties. Also remember that not everyone understands a certain language to the same extent as you do, so **be clear in your communications to be kind.**)
- Take breaks when you feel like you need them.
- Using welcoming and inclusive language. (Participants are encouraged to display their chosen pronouns on Zoom or in communication on Slack.)

## nf-core frowns on 😕

The following behaviours from any participants within the nf-core community (including the organisers) will be considered unacceptable under this code of conduct. Engaging or advocating for any of the following could result in expulsion from nf-core workspaces.

- Deliberate intimidation, stalking or following and sustained disruption of communication among participants of the community. This includes hijacking shared screens through actions such as using the annotate tool in conferencing software such as Zoom.
- “Doxing” i.e. posting (or threatening to post) another person’s personal identifying information online.
- Spamming or trolling of individuals on social media.
- Use of sexual or discriminatory imagery, comments, or jokes and unwelcome sexual attention.
- Verbal and text comments that reinforce social structures of domination related to gender, gender identity and expression, sexual orientation, ability, physical appearance, body size, race, age, religion or work experience.

### Online Trolling

The majority of nf-core interactions and events are held online. Unfortunately, holding events online comes with the added issue of online trolling. This is unacceptable, reports of such behaviour will be taken very seriously, and perpetrators will be excluded from activities immediately.

All community members are required to ask members of the group they are working within for explicit consent prior to taking screenshots of individuals during video calls.

## Procedures for Reporting CoC violations

If someone makes you feel uncomfortable through their behaviours or actions, report it as soon as possible.

You can reach out to members of the [nf-core core team](https://nf-co.re/about) and they will forward your concerns to the safety officer(s).

Issues directly concerning members of the core team will be dealt with by other members of the core team and the safety manager, and possible conflicts of interest will be taken into account. nf-core is also in discussions about having an ombudsperson, and details will be shared in due course.

All reports will be handled with utmost discretion and confidentially.

## Attribution and Acknowledgements

- The [Contributor Covenant, version 1.4](http://contributor-covenant.org/version/1/4)
- The [OpenCon 2017 Code of Conduct](http://www.opencon2017.org/code_of_conduct) (CC BY 4.0 OpenCon organisers, SPARC and Right to Research Coalition)
- The [eLife innovation sprint 2020 Code of Conduct](https://sprint.elifesciences.org/code-of-conduct/)
- The [Mozilla Community Participation Guidelines v3.1](https://www.mozilla.org/en-US/about/governance/policies/participation/) (version 3.1, CC BY-SA 3.0 Mozilla)

## Changelog

### v1.0 - March 12th, 2021

- Complete rewrite from original [Contributor Covenant](http://contributor-covenant.org/) CoC.


================================================
FILE: Dockerfile
================================================
FROM nfcore/base:1.14
LABEL authors="The nf-core/eager community" \
      description="Docker image containing all software requirements for the nf-core/eager pipeline"

# Install the conda environment
COPY environment.yml /
RUN conda env create --quiet -f /environment.yml && conda clean -a

# Add conda installation dir to PATH (instead of doing 'conda activate')
ENV PATH /opt/conda/envs/nf-core-eager-2.5.3/bin:$PATH

# Dump the details of the installed packages to a file for posterity
RUN conda env export --name nf-core-eager-2.5.3 > nf-core-eager-2.5.3.yml

================================================
FILE: LICENSE
================================================
MIT License

Copyright (c) The nf-core/eager community

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.


================================================
FILE: README.md
================================================
# ![nf-core/eager](docs/images/nf-core_eager_logo_outline_drop.png)

**A fully reproducible and state-of-the-art ancient DNA analysis pipeline**.

[![GitHub Actions CI Status](https://github.com/nf-core/eager/workflows/nf-core%20CI/badge.svg)](https://github.com/nf-core/eager/actions)
[![GitHub Actions Linting Status](https://github.com/nf-core/eager/workflows/nf-core%20linting/badge.svg)](https://github.com/nf-core/eager/actions)
[![Nextflow](https://img.shields.io/badge/nextflow-%E2%89%A520.07.1-brightgreen.svg)](https://www.nextflow.io/)
[![nf-core](https://img.shields.io/badge/nf--core-pipeline-brightgreen.svg)](https://nf-co.re/)
[![DOI](https://zenodo.org/badge/135918251.svg)](https://zenodo.org/badge/latestdoi/135918251)
[![Published in PeerJ](https://img.shields.io/badge/peerj-published-%2300B2FF)](https://peerj.com/articles/10947/)

[![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg)](https://bioconda.github.io/)
[![Docker](https://img.shields.io/docker/automated/nfcore/eager.svg)](https://hub.docker.com/r/nfcore/eager)
![Singularity Container available](https://img.shields.io/badge/singularity-available-7E4C74.svg)

[![Get help on Slack](http://img.shields.io/badge/slack-nf--core%20%23eager-4A154B?logo=slack)](https://nfcore.slack.com/channels/eager)

>[!IMPORTANT]  
> nf-core/eager versions 2.* are only compatible with Nextflow versions up to 22.10.6!

## Introduction

<!-- nf-core: Write a 1-2 sentence summary of what data the pipeline is for and what it does -->
**nf-core/eager** is a scalable and reproducible bioinformatics best-practise processing pipeline for genomic NGS sequencing data, with a focus on ancient DNA (aDNA) data. It is ideal for the (palaeo)genomic analysis of humans, animals, plants, microbes and even microbiomes.

The pipeline is built using [Nextflow](https://www.nextflow.io), a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker containers making installation trivial and results highly reproducible. The pipeline pre-processes raw data from FASTQ inputs, or preprocessed BAM inputs. It can align reads and performs extensive general NGS and aDNA specific quality-control on the results. It comes with docker, singularity or conda containers making installation trivial and results highly reproducible.

<p align="center">
    <img src="docs/images/usage/eager2_workflow.png" alt="nf-core/eager schematic workflow" width="70%"
</p>

## Quick Start

1. Install [`nextflow`](https://nf-co.re/usage/installation) (`>=20.07.1` && `<=22.10.6`)

2. Install any of [`Docker`](https://docs.docker.com/engine/installation/), [`Singularity`](https://www.sylabs.io/guides/3.0/user-guide/), [`Podman`](https://podman.io/), [`Shifter`](https://nersc.gitlab.io/development/shifter/how-to-use/) or [`Charliecloud`](https://hpc.github.io/charliecloud/) for full pipeline reproducibility _(please only use [`Conda`](https://conda.io/miniconda.html) as a last resort; see [docs](https://nf-co.re/usage/configuration#basic-configuration-profiles))_

3. Download the pipeline and test it on a minimal dataset with a single command:

    ```bash
    nextflow run nf-core/eager -profile test,<docker/singularity/podman/shifter/charliecloud/conda/institute>
    ```

    > Please check [nf-core/configs](https://github.com/nf-core/configs#documentation) to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use `-profile <institute>` in your command. This will enable either `docker` or `singularity` and set the appropriate execution settings for your local compute environment.

4. Start running your own analysis!

    ```bash
    nextflow run nf-core/eager -profile <docker/singularity/podman/conda/institute> --input '*_R{1,2}.fastq.gz' --fasta '<your_reference>.fasta'
    ```

5. Once your run has completed successfully, clean up the intermediate files.

    ```bash
    nextflow clean -f -k
    ```

See [usage docs](https://nf-co.re/eager/usage) for all of the available options when running the pipeline.

**N.B.** You can see an overview of the run in the MultiQC report located at `./results/MultiQC/multiqc_report.html`

Modifications to the default pipeline are easily made using various options as described in the documentation.

## Pipeline Summary

### Default Steps

By default the pipeline currently performs the following:

* Create reference genome indices for mapping (`bwa`, `samtools`, and `picard`)
* Sequencing quality control (`FastQC`)
* Sequencing adapter removal, paired-end data merging (`AdapterRemoval`)
* Read mapping to reference using (`bwa aln`, `bwa mem`, `CircularMapper`, or `bowtie2`)
* Post-mapping processing, statistics and conversion to bam (`samtools`)
* Ancient DNA C-to-T damage pattern visualisation (`DamageProfiler` or `mapDamage`)
* PCR duplicate removal (`DeDup` or `MarkDuplicates`)
* Post-mapping statistics and BAM quality control (`Qualimap`)
* Library Complexity Estimation (`preseq`)
* Overall pipeline statistics summaries (`MultiQC`)

### Additional Steps

Additional functionality contained by the pipeline currently includes:

#### Input

* Automatic merging of complex sequencing setups (e.g. multiple lanes, sequencing configurations, library types)

#### Preprocessing

* Illumina two-coloured sequencer poly-G tail removal (`fastp`)
* Post-AdapterRemoval trimming of FASTQ files prior mapping (`fastp`)
* Automatic conversion of unmapped reads to FASTQ (`samtools`)
* Host DNA (mapped reads) stripping from input FASTQ files (for sensitive samples)

#### aDNA Damage manipulation

* Damage removal/clipping for UDG+/UDG-half treatment protocols (`BamUtil`)
* Damaged reads extraction and assessment (`PMDTools`)
* Nuclear DNA contamination estimation of human samples (`angsd`)

#### Genotyping

* Creation of VCF genotyping files (`GATK UnifiedGenotyper`, `GATK HaplotypeCaller` and `FreeBayes`)
* Creation of EIGENSTRAT genotyping files (`pileupCaller`)
* Creation of Genotype Likelihood files (`angsd`)
* Consensus sequence FASTA creation (`VCF2Genome`)
* SNP Table generation (`MultiVCFAnalyzer`)

#### Biological Information

* Mitochondrial to Nuclear read ratio calculation (`MtNucRatioCalculator`)
* Statistical sex determination of human individuals (`Sex.DetERRmine`)

#### Metagenomic Screening

* Low-sequenced complexity filtering (`BBduk`)
* Taxonomic binner with alignment (`MALT`)
* Taxonomic binner without alignment (`Kraken2`)
* aDNA characteristic screening of taxonomically binned data from MALT (`MaltExtract`)

#### Functionality Overview

A graphical overview of suggested routes through the pipeline depending on context can be seen below.

<p align="center">
    <img src="docs/images/usage/eager2_metromap_complex.png" alt="nf-core/eager metro map" width="70%"
</p>

## Documentation

The nf-core/eager pipeline comes with documentation about the pipeline: [usage](https://nf-co.re/eager/usage) and [output](https://nf-co.re/eager/output).

1. [Nextflow installation](https://nf-co.re/usage/installation)
2. Pipeline configuration
    * [Pipeline installation](https://nf-co.re/usage/local_installation)
    * [Adding your own system config](https://nf-co.re/usage/adding_own_config)
    * [Reference genomes](https://nf-co.re/usage/reference_genomes)
3. [Running the pipeline](https://nf-co.re/eager/usage)
   * This includes tutorials, FAQs, and troubleshooting instructions
4. [Output and how to interpret the results](https://nf-co.re/eager/output)

## Credits

This pipeline was mostly written by Alexander Peltzer ([apeltzer](https://github.com/apeltzer)) and [James A. Fellows Yates](https://github.com/jfy133), with contributions from [Stephen Clayton](https://github.com/sc13-bioinf), [Thiseas C. Lamnidis](https://github.com/TCLamnidis), [Maxime Borry](https://github.com/maxibor), [Zandra Fagernäs](https://github.com/ZandraFagernas), [Aida Andrades Valtueña](https://github.com/aidaanva) and [Maxime Garcia](https://github.com/MaxUlysse) and the nf-core community.

We thank the following people for their extensive assistance in the development
of this pipeline:

## Authors (alphabetical)

* [Aida Andrades Valtueña](https://github.com/aidaanva)
* [Alexander Peltzer](https://github.com/apeltzer)
* [James A. Fellows Yates](https://github.com/jfy133)
* [Judith Neukamm](https://github.com/JudithNeukamm)
* [Maxime Borry](https://github.com/maxibor)
* [Maxime Garcia](https://github.com/MaxUlysse)
* [Stephen Clayton](https://github.com/sc13-bioinf)
* [Thiseas C. Lamnidis](https://github.com/TCLamnidis)
* [Zandra Fagernäs](https://github.com/ZandraFagernas)

## Additional Contributors (alphabetical)

Those who have provided conceptual guidance, suggestions, bug reports etc.

* [Alex Hübner](https://github.com/alexhbnr)
* [Alexandre Gilardet](https://github.com/alexandregilardet)
* Arielle Munters
* [Åshild Vågene](https://github.com/ashildv)
* [Asmaa Ali](https://github.com/asmaa-a-abdelwahab)
* [Charles Plessy](https://github.com/charles-plessy)
* [Elina Salmela](https://github.com/esalmela)
* [Fabian Lehmann](https://github.com/Lehmann-Fabian)
* [He Yu](https://github.com/paulayu)
* [Hester van Schalkwyk](https://github.com/hesterjvs)
* [Ido Bar](https://github.com/IdoBar)
* [Irina Velsko](https://github.com/ivelsko)
* [Işın Altınkaya](https://github.com/isinaltinkaya)
* [Johan Nylander](https://github.com/nylander)
* [Jonas Niemann](https://github.com/NiemannJ)
* [Katerine Eaton](https://github.com/ktmeaton)
* [Kathrin Nägele](https://github.com/KathrinNaegele)
* [Kevin Lord](https://github.com/lordkev)
* [Laura Lacher](https://github.com/neija2611)
* [Luc Venturini](https://github.com/lucventurini)
* [Mahesh Binzer-Panchal](https://github.com/mahesh-panchal)
* [Marcel Keller](https://github.com/marcel-keller)
* [Megan Michel](https://github.com/meganemichel)
* [Pierre Lindenbaum](https://github.com/lindenb)
* [Pontus Skoglund](https://github.com/pontussk)
* [Raphael Eisenhofer](https://github.com/EisenRa)
* [Roberta Davidson](https://github.com/roberta-davidson)
* [Rodrigo Barquera](https://github.com/RodrigoBarquera)
* [Selina Carlhoff](https://github.com/scarlhoff)
* [Torsten Günter](https://bitbucket.org/tguenther)

If you've contributed and you're missing in here, please let us know and we will add you in of course!

## Contributions and Support

If you would like to contribute to this pipeline, please see the [contributing guidelines](.github/CONTRIBUTING.md).

For further information or help, don't hesitate to get in touch on the [Slack `#eager` channel](https://nfcore.slack.com/channels/eager) (you can join with [this invite](https://nf-co.re/join/slack)).

## Citations

If you use `nf-core/eager` for your analysis, please cite the `eager` preprint as follows:

> Fellows Yates JA, Lamnidis TC, Borry M, Valtueña Andrades A, Fagernäs Z, Clayton S, Garcia MU, Neukamm J, Peltzer A. 2021. Reproducible, portable, and efficient ancient genome reconstruction with nf-core/eager. PeerJ 9:e10947. DOI: [10.7717/peerj.10947](https://doi.org/10.7717/peerj.10947).

You can cite the eager zenodo record for a specific version using the following [doi: 10.5281/zenodo.3698082](https://zenodo.org/badge/latestdoi/135918251)

You can cite the `nf-core` publication as follows:

> **The nf-core framework for community-curated bioinformatics pipelines.**
>
> Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
>
> _Nat Biotechnol._ 2020 Feb 13. doi: [10.1038/s41587-020-0439-x](https://dx.doi.org/10.1038/s41587-020-0439-x).

In addition, references of tools and data used in this pipeline are as follows:

* **EAGER v1**, CircularMapper, DeDup* Peltzer, A., Jäger, G., Herbig, A., Seitz, A., Kniep, C., Krause, J., & Nieselt, K. (2016). EAGER: efficient ancient genome reconstruction. Genome Biology, 17(1), 1–14. [https://doi.org/10.1186/s13059-016-0918-z](https://doi.org/10.1186/s13059-016-0918-z).  Download: [https://github.com/apeltzer/EAGER-GUI](https://github.com/apeltzer/EAGER-GUI) and [https://github.com/apeltzer/EAGER-CLI](https://github.com/apeltzer/EAGER-CLI)
* **FastQC** Download: [https://www.bioinformatics.babraham.ac.uk/projects/fastqc/](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
* **AdapterRemoval v2** Schubert, M., Lindgreen, S., & Orlando, L. (2016). AdapterRemoval v2: rapid adapter trimming, identification, and read merging. BMC Research Notes, 9, 88. [https://doi.org/10.1186/s13104-016-1900-2](https://doi.org/10.1186/s13104-016-1900-2). Download: [https://github.com/MikkelSchubert/adapterremoval](https://github.com/MikkelSchubert/adapterremoval)
* **bwa** Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics , 25(14), 1754–1760. [https://doi.org/10.1093/bioinformatics/btp324](https://doi.org/10.1093/bioinformatics/btp324). Download: [http://bio-bwa.sourceforge.net/bwa.shtml](http://bio-bwa.sourceforge.net/bwa.shtml)
* **SAMtools** Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., … 1000 Genome Project Data Processing Subgroup. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics , 25(16), 2078–2079. [https://doi.org/10.1093/bioinformatics/btp352](https://doi.org/10.1093/bioinformatics/btp352). Download: [http://www.htslib.org/](http://www.htslib.org/)
* **DamageProfiler** Neukamm, J., Peltzer, A., & Nieselt, K. (2020). DamageProfiler: Fast damage pattern calculation for ancient DNA. In Bioinformatics (btab190). [https://doi.org/10.1093/bioinformatics/btab190](https://doi.org/10.1093/bioinformatics/btab190). Download: [https://github.com/Integrative-Transcriptomics/DamageProfiler](https://github.com/Integrative-Transcriptomics/DamageProfiler)
* **QualiMap** Okonechnikov, K., Conesa, A., & García-Alcalde, F. (2016). Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data. Bioinformatics , 32(2), 292–294. [https://doi.org/10.1093/bioinformatics/btv566](https://doi.org/10.1093/bioinformatics/btv566). Download: [http://qualimap.bioinfo.cipf.es/](http://qualimap.bioinfo.cipf.es/)
* **preseq** Daley, T., & Smith, A. D. (2013). Predicting the molecular complexity of sequencing libraries. Nature Methods, 10(4), 325–327. [https://doi.org/10.1038/nmeth.2375](https://doi.org/10.1038/nmeth.2375). Download: [http://smithlabresearch.org/software/preseq/](http://smithlabresearch.org/software/preseq/)
* **PMDTools** Skoglund, P., Northoff, B. H., Shunkov, M. V., Derevianko, A. P., Pääbo, S., Krause, J., & Jakobsson, M. (2014). Separating endogenous ancient DNA from modern day contamination in a Siberian Neandertal. Proceedings of the National Academy of Sciences of the United States of America, 111(6), 2229–2234. [https://doi.org/10.1073/pnas.1318934111](https://doi.org/10.1073/pnas.1318934111). Download: [https://github.com/pontussk/PMDtools](https://github.com/pontussk/PMDtools)
* **MultiQC** Ewels, P., Magnusson, M., Lundin, S., & Käller, M. (2016). MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics , 32(19), 3047–3048. [https://doi.org/10.1093/bioinformatics/btw354](https://doi.org/10.1093/bioinformatics/btw354). Download: [https://multiqc.info/](https://multiqc.info/)
* **BamUtils** Jun, G., Wing, M. K., Abecasis, G. R., & Kang, H. M. (2015). An efficient and scalable analysis framework for variant extraction and refinement from population-scale DNA sequence data. Genome Research, 25(6), 918–925. [https://doi.org/10.1101/gr.176552.114](https://doi.org/10.1101/gr.176552.114). Download: [https://genome.sph.umich.edu/wiki/BamUtil](https://genome.sph.umich.edu/wiki/BamUtil)
* **FastP** Chen, S., Zhou, Y., Chen, Y., & Gu, J. (2018). fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics , 34(17), i884–i890. [https://doi.org/10.1093/bioinformatics/bty560](https://doi.org/10.1093/bioinformatics/bty560). Download: [https://github.com/OpenGene/fastp](https://github.com/OpenGene/fastp)
* **GATK 3.5** DePristo, M. A., Banks, E., Poplin, R., Garimella, K. V., Maguire, J. R., Hartl, C., … Daly, M. J. (2011). A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics, 43(5), 491–498. [https://doi.org/10.1038/ng.806](https://doi.org/10.1038/ng.806.).Download: [https://console.cloud.google.com/storage/browser/gatk](https://console.cloud.google.com/storage/browser/gatk)
* **GATK 4.X** - no citation available yet. Download: [https://github.com/broadinstitute/gatk/releases](https://github.com/broadinstitute/gatk/releases)
* **VCF2Genome** - Alexander Herbig and Alex Peltzer (unpublished). Download: [https://github.com/apeltzer/VCF2Genome](https://github.com/apeltzer/VCF2Genome)
* **MultiVCFAnalyzer** Bos, K.I. et al., 2014. Pre-Columbian mycobacterial genomes reveal seals as a source of New World human tuberculosis. Nature, 514(7523), pp.494–497. Available at: [http://dx.doi.org/10.1038/nature13591](http://dx.doi.org/10.1038/nature13591). Download: [https://github.com/alexherbig/MultiVCFAnalyzer](https://github.com/alexherbig/MultiVCFAnalyzer)
* **MTNucRatioCalculator** Alex Peltzter (Unpublished). Download: [https://github.com/apeltzer/MTNucRatioCalculator](https://github.com/apeltzer/MTNucRatioCalculator)
* **Sex.DetERRmine.py** Lamnidis, T.C. et al., 2018. Ancient Fennoscandian genomes reveal origin and spread of Siberian ancestry in Europe. Nature communications, 9(1), p.5018. Available at: [http://dx.doi.org/10.1038/s41467-018-07483-5](http://dx.doi.org/10.1038/s41467-018-07483-5). Download: [https://github.com/TCLamnidis/Sex.DetERRmine.git](https://github.com/TCLamnidis/Sex.DetERRmine.git)
* **ANGSD** Korneliussen, T.S., Albrechtsen, A. & Nielsen, R., 2014. ANGSD: Analysis of Next Generation Sequencing Data. BMC bioinformatics, 15, p.356. Available at: [http://dx.doi.org/10.1186/s12859-014-0356-4](http://dx.doi.org/10.1186/s12859-014-0356-4). Download: [https://github.com/ANGSD/angsd](https://github.com/ANGSD/angsd)
* **bedtools** Quinlan, A.R. & Hall, I.M., 2010. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics , 26(6), pp.841–842. Available at: [http://dx.doi.org/10.1093/bioinformatics/btq033](http://dx.doi.org/10.1093/bioinformatics/btq033). Download: [https://github.com/arq5x/bedtools2/releases](https://github.com/arq5x/bedtools2/)
* **MALT**. Download: [https://software-ab.informatik.uni-tuebingen.de/download/malt/welcome.html](https://software-ab.informatik.uni-tuebingen.de/download/malt/welcome.html)
  * Vågene, Å.J. et al., 2018. Salmonella enterica genomes from victims of a major sixteenth-century epidemic in Mexico. Nature ecology & evolution, 2(3), pp.520–528. Available at: [http://dx.doi.org/10.1038/s41559-017-0446-6](http://dx.doi.org/10.1038/s41559-017-0446-6).
  * Herbig, A. et al., 2016. MALT: Fast alignment and analysis of metagenomic DNA sequence data applied to the Tyrolean Iceman. bioRxiv, p.050559. Available at: [http://biorxiv.org/content/early/2016/04/27/050559](http://biorxiv.org/content/early/2016/04/27/050559).
* **MaltExtract** Huebler, R. et al., 2019. HOPS: Automated detection and authentication of pathogen DNA in archaeological remains. bioRxiv, p.534198. Available at: [https://www.biorxiv.org/content/10.1101/534198v1?rss=1](https://www.biorxiv.org/content/10.1101/534198v1?rss=1). Download: [https://github.com/rhuebler/MaltExtract](https://github.com/rhuebler/MaltExtract)
* **Kraken2** Wood, D et al., 2019. Improved metagenomic analysis with Kraken 2. Genome Biology volume 20, Article number: 257. Available at: [https://doi.org/10.1186/s13059-019-1891-0](https://doi.org/10.1186/s13059-019-1891-0). Download: [https://ccb.jhu.edu/software/kraken2/](https://ccb.jhu.edu/software/kraken2/)
* **endorS.py** Aida Andrades Valtueña (Unpublished). Download: [https://github.com/aidaanva/endorS.py](https://github.com/aidaanva/endorS.py)
* **Bowtie2**  Langmead, B. and Salzberg, S. L. 2012 Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4), p. 357–359. doi: [10.1038/nmeth.1923](https:/dx.doi.org/10.1038/nmeth.1923).
* **sequenceTools** Stephan Schiffels (Unpublished). Download: [https://github.com/stschiff/sequenceTools](https://github.com/stschiff/sequenceTools)
* **EigenstratDatabaseTools** Thiseas C. Lamnidis (Unpublished). Download: [https://github.com/TCLamnidis/EigenStratDatabaseTools.git](https://github.com/TCLamnidis/EigenStratDatabaseTools.git)
* **mapDamage** Jónsson, H., et al 2013. mapDamage2.0: fast approximate Bayesian estimates of ancient DNA damage parameters. Bioinformatics , 29(13), 1682–1684. [https://doi.org/10.1093/bioinformatics/btt193](https://doi.org/10.1093/bioinformatics/btt193)
* **BBduk** Brian Bushnell (Unpublished). Download: [https://sourceforge.net/projects/bbmap/](sourceforge.net/projects/bbmap/)

## Data References

This repository uses test data from the following studies:

* Fellows Yates, J. A. et al. (2017) ‘Central European Woolly Mammoth Population Dynamics: Insights from Late Pleistocene Mitochondrial Genomes’, Scientific reports, 7(1), p. 17714. [doi: 10.1038/s41598-017-17723-1](https://doi.org/10.1038/s41598-017-17723-1).
* Gamba, C. et al. (2014) ‘Genome flux and stasis in a five millennium transect of European prehistory’, Nature communications, 5, p. 5257. [doi: 10.1038/ncomms6257](https://doi.org/10.1038/ncomms6257).
* Star, B. et al. (2017) ‘Ancient DNA reveals the Arctic origin of Viking Age cod from Haithabu, Germany’, Proceedings of the National Academy of Sciences of the United States of America, 114(34), pp. 9152–9157. [doi: 10.1073/pnas.1710186114](https://doi.org/10.1073/pnas.1710186114).
* de Barros Damgaard, P. et al. (2018). '137 ancient human genomes from across the Eurasian steppes.', Nature, 557(7705), 369–374. [doi: 10.1038/s41586-018-0094-2](https://doi.org/10.1038/s41586-018-0094-2)


================================================
FILE: assets/angsd_resources/README
================================================
**These files are originally part of angsd (release 0.931). They have been added here for convinence.**

This file describes how the 'hapmap' and mappability files used by angsd is generated

##download
wget http://hapmap.ncbi.nlm.nih.gov/downloads/frequencies/2010-08_phaseII+III/allele_freqs_chrX_CEU_r28_nr.b36_fwd.txt.gz
wget http://hapmap.ncbi.nlm.nih.gov/downloads/frequencies/2010-08_phaseII+III/allele_freqs_chr21_CEU_r28_nr.b36_fwd.txt.gz

#with the md5sum
a105316eaa2ebbdb3f8d62a9cb10a2d5  allele_freqs_chr21_CEU_r28_nr.b36_fwd.txt.gz
5a0f920951ce2ded4afe2f10227110ac  allele_freqs_chrX_CEU_r28_nr.b36_fwd.txt.gz


##create dummy bed file to use the liftover tools
gunzip -c allele_freqs_chrX_CEU_r28_nr.b36_fwd.txt.gz| awk '{print $2" "$3-1" "$3" "$11" "$12" "$4" "$14}'|sed 1d >allele.txt

##do the liftover
liftOver allele.txt /opt/liftover/hg18ToHg19.over.chain.gz hit nohit

##now remove invarible sites, and redundant columns
cut -f1,3 --complement hit |grep -v -P "\t1.0"|grep -v -P "\t0\t"|gzip -c  >HapMapchrX.gz


##create dummy bed file to use the liftover tools
gunzip -c allele_freqs_chr21_CEU_r28_nr.b36_fwd.txt| awk '{print $2" "$3-1" "$3" "$11" "$12" "$4" "$14}'|sed 1d >allele.txt

##do the liftover
liftOver allele.txt /opt/liftover/hg18ToHg19.over.chain.gz hit nohit

##now remove invarible sites, and redundant columns
cut -f1,3 --complement hit |grep -v -P "\t1.0"|grep -v -P "\t0\t"|gzip -c  >HapMapchr21.gz


#######
##download 100kmer mappability
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/wgEncodeCrgMapabilityAlign100mer.bigWig

#md5sum
a1b1a8c99431fedf6a3b4baef028cca4  wgEncodeCrgMapabilityAlign100mer.bigWig

##download convert program
wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/bigWigToBedGraph

##convert
./bigWigToBedGraph wgEncodeCrgMapabilityAlign100mer.bigWig chrX -chrom=chrX
./bigWigToBedGraph wgEncodeCrgMapabilityAlign100mer.bigWig chr21 -chrom=chr21

##only keep unique regions and discard the chr* column
grep -P "\t1$" chr21 |cut -f2-3 |gzip -c >chr21.unique.gz
grep -P "\t1$" chrX |cut -f2-3 |gzip -c >chrX.unique.gz


================================================
FILE: assets/angsd_resources/getALL.txt
================================================
F="ASW CEU CHB CHD GIH JPT LWK MEX MKK TSI YRI"
for f in $F
do 
    echo $f
    wget http://hapmap.ncbi.nlm.nih.gov/downloads/frequencies/2010-08_phaseII+III/allele_freqs_chrX_${f}_r28_nr.b36_fwd.txt.gz
done

cat allele*.gz >allele_freqs_chrX_ALL_r28_nr.b36_fwd.txt.gz

gunzip -c allele_freqs_chrX_ALL_r28_nr.b36_fwd.txt.gz| awk '{print $2" "$3-1" "$3" "$11" "$12" "$4" "$14}'|grep -v pos >allele.txt


/opt/liftover/liftOver allele.txt /opt/liftover/hg18ToHg19.over.chain.gz hit nohit
cut -f1,3 --complement hit |grep -v -P "\t1.0"|grep -v -P "\t0\t"|gzip -c  >HapMapALL.gz


================================================
FILE: assets/email_template.html
================================================
<html>
<head>
  <meta charset="utf-8">
  <meta http-equiv="X-UA-Compatible" content="IE=edge">
  <meta name="viewport" content="width=device-width, initial-scale=1">

  <meta name="description" content="nf-core/eager: A fully reproducible and state-of-the-art ancient DNA analysis pipeline">
  <title>nf-core/eager Pipeline Report</title>
</head>
<body>
<div style="font-family: Helvetica, Arial, sans-serif; padding: 30px; max-width: 800px; margin: 0 auto;">

<img src="cid:nfcorepipelinelogo">

<h1>nf-core/eager v${version}</h1>
<h2>Run Name: $runName</h2>

<% if (!success){
    out << """
    <div style="color: #a94442; background-color: #f2dede; border-color: #ebccd1; padding: 15px; margin-bottom: 20px; border: 1px solid transparent; border-radius: 4px;">
        <h4 style="margin-top:0; color: inherit;">nf-core/eager execution completed unsuccessfully!</h4>
        <p>The exit status of the task that caused the workflow execution to fail was: <code>$exitStatus</code>.</p>
        <p>The full error message was:</p>
        <pre style="white-space: pre-wrap; overflow: visible; margin-bottom: 0;">${errorReport}</pre>
    </div>
    """
} else {
    out << """
    <div style="color: #3c763d; background-color: #dff0d8; border-color: #d6e9c6; padding: 15px; margin-bottom: 20px; border: 1px solid transparent; border-radius: 4px;">
        nf-core/eager execution completed successfully!
    </div>
    """
}
%>

<p>The workflow was completed at <strong>$dateComplete</strong> (duration: <strong>$duration</strong>)</p>
<p>The command used to launch the workflow was as follows:</p>
<pre style="white-space: pre-wrap; overflow: visible; background-color: #ededed; padding: 15px; border-radius: 4px; margin-bottom:30px;">$commandLine</pre>

<h3>Pipeline Configuration:</h3>
<table style="width:100%; max-width:100%; border-spacing: 0; border-collapse: collapse; border:0; margin-bottom: 30px;">
    <tbody style="border-bottom: 1px solid #ddd;">
        <% out << summary.collect{ k,v -> "<tr><th style='text-align:left; padding: 8px 0; line-height: 1.42857143; vertical-align: top; border-top: 1px solid #ddd;'>$k</th><td style='text-align:left; padding: 8px; line-height: 1.42857143; vertical-align: top; border-top: 1px solid #ddd;'><pre style='white-space: pre-wrap; overflow: visible;'>$v</pre></td></tr>" }.join("\n") %>
    </tbody>
</table>

<p>nf-core/eager</p>
<p><a href="https://github.com/nf-core/eager">https://github.com/nf-core/eager</a></p>

</div>

</body>
</html>


================================================
FILE: assets/email_template.txt
================================================
----------------------------------------------------
                                        ,--./,-.
        ___     __   __   __   ___     /,-._.--~\\
  |\\ | |__  __ /  ` /  \\ |__) |__         }  {
  | \\| |       \\__, \\__/ |  \\ |___     \\`-._,-`-,
                                        `._,._,'
  nf-core/eager v${version}
----------------------------------------------------

Run Name: $runName

<% if (success){
    out << "## nf-core/eager execution completed successfully! ##"
} else {
    out << """####################################################
## nf-core/eager execution completed unsuccessfully! ##
####################################################
The exit status of the task that caused the workflow execution to fail was: $exitStatus.
The full error message was:

${errorReport}
"""
} %>


The workflow was completed at $dateComplete (duration: $duration)

The command used to launch the workflow was as follows:

  $commandLine


Pipeline Configuration:
-----------------------
<% out << summary.collect{ k,v -> " - $k: $v" }.join("\n") %>

--
nf-core/eager
https://github.com/nf-core/eager


================================================
FILE: assets/multiqc_config.yaml
================================================
custom_logo: "nf-core_eager_logo_outline_drop.png"
custom_logo_url: https://github.com/nf-core/eager/
custom_logo_title: "nf-core/eager"

report_comment: >
  This report has been generated by the <a href="https://github.com/nf-core/eager" target="_blank">nf-core/eager</a>
  analysis pipeline. For information about how to interpret these results, please see the
  <a href="https://github.com/nf-core/eager" target="_blank">documentation</a>.
run_modules:
  - adapterRemoval
  - bowtie2
  - custom_content
  - damageprofiler
  - dedup
  - fastp
  - fastqc
  - gatk
  - kraken
  - malt
  - mapdamage
  - mtnucratio
  - multivcfanalyzer
  - picard
  - preseq
  - qualimap
  - samtools
  - sexdeterrmine
  - hops
  - bcftools

extra_fn_clean_exts:
  - "_fastp"
  - ".pe.settings"
  - ".se.settings"
  - ".settings"
  - ".pe.combined"
  - ".se.truncated"
  - ".mapped"
  - ".mapped_rmdup"
  - ".mapped_rmdup_stats"
  - "_libmerged_rg_rmdup"
  - "_libmerged_rg_rmdup_stats"
  - "_postfilterflagstat.stats"
  - "_flagstat.stat"
  - ".filtered"
  - ".filtered_rmdup"
  - ".filtered_rmdup_stats"
  - "_libmerged_rg_add"
  - "_libmerged_rg_add_stats"
  - "_rmdup"
  - ".unmapped"
  - ".fastq.gz"
  - ".fastq"
  - ".fq.gz"
  - ".fq"
  - ".bam"
  - ".kreport"
  - ".unifiedgenotyper"
  - ".trimmed_stats"
  - "_libmerged"
  - "_bt2"
  - type: "regex"
    pattern: "_udg(half|none|full)"

top_modules:
  - "fastqc":
      name: "FastQC (pre-Trimming)"
      path_filters:
        - "*_raw_fastqc.zip"
  - "fastp"
  - "adapterRemoval"
  - "fastqc":
      name: "FastQC (post-Trimming)"
      path_filters:
        - "*.truncated_fastqc.zip"
        - "*.combined*_fastqc.zip"
        - "*_postartrimmed_fastqc.zip"
  - "bowtie2":
      path_filters:
        - "*_bt2.log"
  - "malt"
  - "hops"
  - "kraken"
  - "samtools":
      name: "Samtools Flagstat (pre-samtools filter)"
      path_filters:
        - "*_flagstat.stats"
  - "samtools":
      name: "Samtools Flagstat (post-samtools filter)"
      path_filters:
        - "*_postfilterflagstat.stats"
  - "dedup"
  - "picard"
  - "preseq":
      path_filters:
        - "*.preseq"
  - "damageprofiler"
  - "mapdamage"
  - "mtnucratio"
  - "qualimap"
  - "sexdeterrmine"
  - "bcftools"
  - "multivcfanalyzer":
      path_filters:
        - "*MultiVCFAnalyzer.json"
qualimap_config:
  general_stats_coverage:
    - 1
    - 2
    - 3
    - 4
    - 5

remove_sections:
  - sexdeterrmine-snps

table_columns_visible:
  FastQC (pre-Trimming):
    percent_duplicates: False
    percent_gc: True
    avg_sequence_length: True
  fastp:
    pct_duplication: False
    after_filtering_gc_content: True
    pct_surviving: False
  Adapter Removal:
    aligned_total: False
    percent_aligned: True
  FastQC (post-Trimming):
    avg_sequence_length: True
    percent_duplicates: False
    total_sequences: True
    percent_gc: True
  bowtie2:
    overall_alignment_rate: True
  MALT:
    Taxonomic assignment success: False
    Assig. Taxonomy: False
    Mappability: True
    Total reads: False
    Num. of queries: False
  Kraken:
    "% Unclassified": True
    "% Top 5": False
  Samtools Flagstat (pre-samtools filter):
    flagstat_total: True
    mapped_passed: True
  Samtools Flagstat (post-samtools filter):
    mapped_passed: True
  DeDup:
    dup_rate: False
    clusterfactor: True
    mapped_after_dedup: True
  Picard:
    PERCENT_DUPLICATION: True
  DamageProfiler:
    5 Prime1: True
    5 Prime2: True
    3 Prime1: False
    3 Prime2: False
    mean_readlength: True
    median: True
  mapDamage:
    5 Prime1: True
    5 Prime2: True
    3 Prime1: False
    3 Prime2: False
  mtnucratio:
    mt_nuc_ratio: True
  QualiMap:
    mapped_reads: True
    mean_coverage: True
    1_x_pc: True
    5_x_pc: True
    percentage_aligned: False
    median_insert_size: False
  MultiVCFAnalyzer:
    Heterozygous SNP alleles (percent): True
  endorSpy:
    endogenous_dna: True
    endogenous_dna_post: True
  nuclear_contamination:
    Num_SNPs: True
    Method1_MOM_estimate: False
    Method1_MOM_SE: False
    Method1_ML_estimate: True
    Method1_ML_SE: True
    Method2_MOM_estimate: False
    Method2_MOM_SE: False
    Method2_ML_estimate: False
    Method2_ML_SE: False
  snp_coverage:
    Covered_Snps: True
    Total_Snps: False

table_columns_placement:
  FastQC (pre-Trimming):
    total_sequences: 100
    avg_sequence_length: 110
    percent_gc: 120
  fastp:
    after_filtering_gc_content: 200
  Adapter Removal:
    percent_aligned: 300
  FastQC (post-Trimming):
    total_sequences: 400
    avg_sequence_length: 410
    percent_gc: 420
  Bowtie 2 / HiSAT2:
    overall_alignment_rate: 450
  MALT:
    Num. of queries: 430
    Total reads: 440
    Mappability: 450
    Assig. Taxonomy: 460
    Taxonomic assignment success: 470
  Kraken:
    "% Unclassified": 480
  Samtools Flagstat (pre-samtools filter):
    flagstat_total: 551
    mapped_passed: 552
  Samtools Flagstat (post-samtools filter):
    flagstat_total: 600
    mapped_passed: 620
  endorSpy:
    endogenous_dna: 610
    endogenous_dna_post: 640
  nuclear_contamination:
    Num_SNPs: 1100
    Method1_MOM_estimate: 1110
    Method1_MOM_SE: 1120
    Method1_ML_estimate: 1130
    Method1_ML_SE: 1140
    Method2_MOM_estimate: 1150
    Method2_MOM_SE: 1160
    Method2_ML_estimate: 1170
    Method2_ML_SE: 1180
  snp_coverage:
    Covered_Snps: 1050
    Total_Snps: 1060
  DeDup:
    mapped_after_dedup: 620
    clusterfactor: 630
  Picard:
    PERCENT_DUPLICATION: 650
  DamageProfiler:
    5 Prime1: 700
    5 Prime2: 710
    3 Prime1: 720
    3 Prime2: 730
    mean_readlength: 740
    median: 750
  mapDamage:
    5 Prime1: 760
    5 Prime2: 765
    3 Prime1: 770
    3 Prime2: 775
  mtnucratio:
    mtreads: 780
    mt_cov_avg: 785
    mt_nuc_ratio: 790
  QualiMap:
    mapped_reads: 800
    mean_coverage: 805
    median_coverage: 810
    1_x_pc: 820
    2_x_pc: 830
    3_x_pc: 840
    4_x_pc: 850
    5_x_pc: 860
    avg_gc: 870
  sexdeterrmine:
    RateX: 1000
    RateY: 1010
  MultiVCFAnalyzer:
    Heterozygous SNP alleles (percent): 1200
read_count_multiplier: 1
read_count_prefix: ""
read_count_desc: ""
ancient_read_count_prefix: ""
ancient_read_count_desc: ""
ancient_read_count_multiplier: 1
decimalPoint_format: "."
thousandsSep_format: ","
report_section_order:
  software_versions:
    order: -1000
  nf-core-eager-summary:
    order: -1001
export_plots: true
table_columns_name:
  FastQC (pre-Trimming):
    total_sequences: "Nr. Input Reads"
    avg_sequence_length: "Length Input Reads"
    percent_gc: "% GC Input Reads"
    percent_duplicates: "% Dups Input Reads"
    percent_fails: "% Failed Input Reads"
  FastQC (post-Trimming):
    total_sequences: "Nr. Processed Reads"
    avg_sequence_length: "Length Processed Reads"
    percent_gc: "% GC Processed Reads"
    percent_duplicates: "% Dups Processed Reads"
    percent_fails: "%Failed Processed Reads"
  Samtools Flagstat (pre-samtools filter):
    flagstat_total: "Nr. Reads Into Mapping"
    mapped_passed: "Nr. Mapped Reads"
  Samtools Flagstat (post-samtools filter):
    flagstat_total: "Nr. Mapped Reads Post-Filter"
    mapped_passed: "Nr. Mapped Reads Passed Post-Filter"
  Endogenous DNA Post (%):
    endogenous_dna_post (%): "Endogenous DNA Post-Filter (%)"
  Picard:
    PERCENT_DUPLICATION: "% Dup. Mapped Reads"
  DamageProfiler:
    mean_readlength: "Mean Length Mapped Reads"
    median_readlength: "Median Length Mapped Reads"
  QualiMap:
    mapped_reads: "Nr. Dedup. Mapped Reads"
    total_reads: "Nr. Dedup. Total Reads"
    avg_gc: "% GC Dedup. Mapped Reads"
  Bcftools Stats:
    number_of_records: "Nr. Overall Variants"
    number_of_SNPs: "Nr. SNPs"
    number_of_indels: "Nr. InDels"
  MALT:
    Mappability: "% Metagenomic Mappability"
  SexDetErrmine:
    RateErrX: "SexDet Err X Chr"
    RateErrY: "SexDet Err Y Chr"
    RateX: "SexDet Rate X Chr"
    RateY: "SexDet Rate Y Chr"
  custom_table_header_config:
    general_stats_table:
      median_coverage:
        format: "{:,.3f}"
      mean_coverage:
        format: "{:,.3f}"


================================================
FILE: assets/nf-core_eager_dummy.txt
================================================
This is a dummy file for when we need a 'fake' file to satisfy all nextflow channel inputs being filled, even if we actually only use one.

================================================
FILE: assets/nf-core_eager_dummy2.txt
================================================
This is a second dummy file for when we need a 'fake' file to satisfy all nextflow channel inputs being filled, even if we actually only use one.

================================================
FILE: assets/sendmail_template.txt
================================================
To: $email
Subject: $subject
Mime-Version: 1.0
Content-Type: multipart/related;boundary="nfcoremimeboundary"

--nfcoremimeboundary
Content-Type: text/html; charset=utf-8

$email_html

--nfcoremimeboundary
Content-Type: image/png;name="nf-core-eager_logo.png"
Content-Transfer-Encoding: base64
Content-ID: <nfcorepipelinelogo>
Content-Disposition: inline; filename="nf-core-eager_logo.png"

<% out << new File("$projectDir/assets/nf-core-eager_logo.png").
  bytes.
  encodeBase64().
  toString().
  tokenize( '\n' )*.
  toList()*.
  collate( 76 )*.
  collect { it.join() }.
  flatten().
  join( '\n' ) %>

<%
if (mqcFile){
def mqcFileObj = new File("$mqcFile")
if (mqcFileObj.length() < mqcMaxSize){
out << """
--nfcoremimeboundary
Content-Type: text/html; name=\"multiqc_report\"
Content-Transfer-Encoding: base64
Content-ID: <mqcreport>
Content-Disposition: attachment; filename=\"${mqcFileObj.getName()}\"

${mqcFileObj.
  bytes.
  encodeBase64().
  toString().
  tokenize( '\n' )*.
  toList()*.
  collate( 76 )*.
  collect { it.join() }.
  flatten().
  join( '\n' )}
"""
}}
%>

--nfcoremimeboundary--


================================================
FILE: assets/where_are_my_files.txt
================================================
=====================
 Where are my files?
=====================

By default, the nfcore/eager pipeline does not save large intermediate files to the
results directory. This is to try to conserve disk space.

These files can be found in the pipeline `work` directory if needed.
Alternatively, re-run the pipeline using `-resume` in addition to one of
the below command-line options and they will be copied into the results directory:

`--saveReference`
Save any downloaded or generated reference genome files to your results folder.
These can then be used for future pipeline runs, reducing processing times.

-----------------------------------
 Setting defaults in a config file
-----------------------------------
If you would always like these files to be saved without having to specify this on
the command line, you can save the following to your personal configuration file
(eg. `~/.nextflow/config`):

params.saveReference = true

For more help, see the following documentation:

https://github.com/nf-core/eager/blob/master/docs/usage.md
https://www.nextflow.io/docs/latest/getstarted.html
https://www.nextflow.io/docs/latest/config.html


================================================
FILE: bin/endorS.py
================================================
#!/usr/bin/env python3

# Written by Aida Andrades Valtueña and released under MIT license. 
# See git repository (https://github.com/aidaanva/endorS.py) for full license text.

"""Script to calculate the endogenous DNA in a sample from samtools flag stats.
It can accept up to two files: pre-quality and post-quality filtering. We recommend
to use both files but you can also use the pre-quality filtering.
"""
import re
import sys
import json
import argparse
import textwrap

parser = argparse.ArgumentParser(prog='endorS.py',
   usage='python %(prog)s [-h] [--version] <samplesfile>.stats [<samplesfile>.stats]',
   formatter_class=argparse.RawDescriptionHelpFormatter,
   description=textwrap.dedent('''\
   author:
     Aida Andrades Valtueña (aida.andrades[at]gmail.com)

   description:
     %(prog)s calculates endogenous DNA from samtools flagstat files and print to screen
     Use --output flag to write results to a file
   '''))
parser.add_argument('samtoolsfiles', metavar='<samplefile>.stats', type=str, nargs='+',
                    help='output of samtools flagstat in a txt file (at least one required). If two files are supplied, the mapped reads of the second file is divided by the total reads in the first, since it assumes that the <samplefile.stats> are related to the same sample. Useful after BAM filtering')
parser.add_argument('-v','--version', action='version', version='%(prog)s 0.4')
parser.add_argument('--output', '-o', nargs='?', help='specify a file format for an output file. Options: <json> for a MultiQC json output. Default: none')
parser.add_argument('--name', '-n', nargs='?', help='specify name for the output file. Default: extracted from the first samtools flagstat file provided')
args = parser.parse_args()

#Open the samtools flag stats pre-quality filtering:
try:
    with open(args.samtoolsfiles[0], 'r') as pre:
        contentsPre = pre.read()
    #Extract number of total reads
    totalReads = float((re.findall(r'^([0-9]+) \+ [0-9]+ in total',contentsPre))[0])
    #Extract number of mapped reads pre-quality filtering:
    mappedPre = float((re.findall(r'([0-9]+) \+ [0-9]+ mapped ',contentsPre))[0])
    #Calculation of endogenous DNA pre-quality filtering:
    if totalReads == 0.0:
        endogenousPre = 0.000000
        print("WARNING: no reads in the fastq input, Endogenous DNA raw (%) set to 0.000000")
    elif mappedPre == 0.0:
        endogenousPre = 0.000000
        print("WARNING: no mapped reads, Endogenous DNA raw (%) set to 0.000000")
    else:
        endogenousPre = float("{0:.6f}".format(round((mappedPre / totalReads * 100), 6)))
except:
    print("Incorrect input, please provide at least a samtools flag stats as input\nRun:\npython endorS.py --help \nfor more information on how to run this script")
    sys.exit()
#Check if the samtools stats post-quality filtering have been provided:
try:
    #Open the samtools flag stats post-quality filtering:
    with open(args.samtoolsfiles[1], 'r') as post:
        contentsPost = post.read()
    #Extract number of mapped reads post-quality filtering:
    mappedPost = float((re.findall(r'([0-9]+) \+ [0-9]+ mapped',contentsPost))[0])
    #Calculation of endogenous DNA post-quality filtering:
    if totalReads == 0.0:
        endogenousPost = 0.000000
        print("WARNING: no reads in the fastq input, Endogenous DNA modified (%) set to 0.000000")
    elif mappedPost == 0.0:
        endogenousPost = 0.000000
        print("WARNING: no mapped reads, Endogenous DNA modified (%) set to 0.000000")
    else:
        endogenousPost = float("{0:.6f}".format(round((mappedPost / totalReads * 100),6)))
except:
    print("Only one samtools flagstat file provided")
    #Set the number of reads post-quality filtering to 0 if samtools
    #samtools flag stats not provided:
    mappedPost = "NA"

#Setting the name depending on the -name flag:
if args.name is not None:
    name = args.name
else:
    #Set up the name based on the first samtools flagstats:
    name= str(((args.samtoolsfiles[0].rsplit(".",1)[0]).rsplit("/"))[-1])
#print(name)


if mappedPost == "NA":
    #Creating the json file
    jsonOutput={
    "id": "endorSpy",
    "plot_type": "generalstats",
    "pconfig": {
        "endogenous_dna": { "max": 100, "min": 0, "title": "Endogenous DNA (%)", "format": '{:,.2f}'}
    },
    "data": {
        name : { "endogenous_dna": endogenousPre}
    }
    }
else:
    #Creating the json file
    jsonOutput={
    "id": "endorSpy",
    "plot_type": "generalstats",
    "pconfig": {
        "endogenous_dna": { "max": 100, "min": 0, "title": "Endogenous DNA (%)", "format": '{:,.2f}'},
        "endogenous_dna_post": { "max": 100, "min": 0, "title": "Endogenous DNA Post (%)", "format": '{:,.2f}'}
    },
    "data": {
        name : { "endogenous_dna": endogenousPre, "endogenous_dna_post": endogenousPost}
    },
    }
#Checking for print to screen argument:
if args.output is not None:
   #Creating file with the named after the name variable:
   #Writing the json output:
   fileName = name + "_endogenous_dna_mqc.json"
   #print(fileName)
   with open(fileName, "w+") as outfile:
      json.dump(jsonOutput, outfile)
      print(fileName,"has been generated")
else:
   if mappedPost == "NA":
      print("Endogenous DNA (%):",endogenousPre)
   else:
      print("Endogenous DNA raw (%):",endogenousPre)
      print("Endogenous DNA modified (%):",endogenousPost)


================================================
FILE: bin/extract_map_reads.py
================================================
#!/usr/bin/env python3

# Written by Maxime Borry and released under the MIT license.
# See git repository (https://github.com/nf-core/eager) for full license text.

import argparse
import pysam
from xopen import xopen
import logging
import os
from pathlib import Path


def _get_args():
    """This function parses and return arguments passed in"""
    parser = argparse.ArgumentParser(
        prog="extract_mapped_reads",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        description="Remove mapped in bam file from fastq files",
    )
    parser.add_argument("bam_file", help="path to bam file")
    parser.add_argument("fwd", help="path to forward fastq file")
    parser.add_argument(
        "-merged",
        dest="merged",
        default=False,
        action="store_true",
        help="specify if bam file was created from merged fastq files",
    )
    parser.add_argument(
        "-rev", dest="rev", default=None, help="path to reverse fastq file"
    )
    parser.add_argument(
        "-of", dest="out_fwd", default=None, help="path to forward output fastq file"
    )
    parser.add_argument(
        "-or", dest="out_rev", default=None, help="path to forward output fastq file"
    )
    parser.add_argument(
        "-m",
        dest="mode",
        default="remove",
        help="Read removal mode: remove reads (remove) or replace sequence by N (replace). Default = remove",
    )
    parser.add_argument(
        "-t", dest="threads", default=4, help="Number of parallel threads"
    )

    args = parser.parse_args()

    bam = args.bam_file
    in_fwd = args.fwd
    merged = args.merged
    in_rev = args.rev
    out_fwd = args.out_fwd
    out_rev = args.out_rev
    mode = args.mode
    threads = int(args.threads)

    return (bam, in_fwd, merged, in_rev, out_fwd, out_rev, mode, threads)


def extract_mapped(bamfile, merged):
    """Get mapped reads in parallel
    Args:
        threads(int): number of threads to use
        bam(str): path to bamfile
    Returns:
        bamfile(str): path to bam alignment file
        result(set): list of mapped reads name (str)
    """
    if bamfile.endswith(".bam") or bamfile.endswith(".gz"):
        read_mode = "rb"
    else:
        read_mode = "r"
    mapped_reads = set()
    bamfile = pysam.AlignmentFile(bamfile, mode=read_mode)
    for read in bamfile.fetch():
        if read.flag != 4:
            if merged:
                if read.query_name.startswith("M_"):
                    mapped_reads.add(read.query_name[2:])
                elif read.query_name.startswith("MT_"):
                    mapped_reads.add(read.query_name[3:])
                else:
                    mapped_reads.add(read.query_name)
            else:
                mapped_reads.add(read.query_name)
    return mapped_reads


def read_write_fq(fq_in, fq_out, mapped_reads, mode, write_mode, proc):
    """
    Read and write fastq file with mapped reads removed
    Args:
        fq_in(str): path to input fastq file
        fq_out(str): path to output fastq file
        mapped_reads(set): set of mapped reads name (str)
        mode(str): read removal mode (remove or replace)
        write_mode(str): write mode (w or wb)
        proc(int): number of parallel processes
        merged(bool): True if bam file was created from merged fastq files
    """
    if write_mode == "w":
        cm = open(fq_out, write_mode)
    elif write_mode == "wb":
        cm = xopen(fq_out, mode=write_mode, threads=proc)
    with pysam.FastxFile(fq_in) as fh:
        with cm as fh_out:
            for read in fh:
                try:
                    if read.name in mapped_reads:
                        if mode == "replace":
                            read.sequence = "N" * len(read.sequence)
                            read = str(read) + "\n"
                            if write_mode == "w":
                                fh_out.write(read)
                            elif write_mode == "wb":
                                fh_out.write(read.encode())
                    else:
                        read = str(read) + "\n"
                        if write_mode == "w":
                            fh_out.write(read)
                        elif write_mode == "wb":
                            fh_out.write(read.encode())
                except Exception as e:
                    logging.error(f"Problem with {str(read)}")
                    logging.error(e)

def check_remove_mode(mode):
    if mode.lower() not in ["replace", "remove"]:
        logging.info(f"Mode must be {' or '.join(mode)}")
    return mode.lower()


if __name__ == "__main__":
    BAM, IN_FWD, MERGED, IN_REV, OUT_FWD, OUT_REV, MODE, PROC = _get_args()

    logging.basicConfig(level=logging.INFO, format="%(message)s")

    if OUT_FWD == None:
        out_fwd = os.path.join(os.getcwd(), Path(IN_FWD).stem + ".r1.fq.gz")
    else:
        out_fwd = OUT_FWD

    if out_fwd.endswith(".gz"):
        write_mode = "wb"
    else:
        write_mode = "w"

    remove_mode = check_remove_mode(MODE)

    # FORWARD OR SE FILE
    logging.info(f"- Extracting mapped reads from {BAM}")
    mapped_reads = extract_mapped(BAM, merged=MERGED)
    logging.info(f"- Checking forward fq file {IN_FWD}")
    read_write_fq(
        fq_in=IN_FWD,
        fq_out=out_fwd,
        mapped_reads=mapped_reads,
        mode=remove_mode,
        write_mode=write_mode,
        proc=PROC,
    )
    logging.info(f"- Cleaned forward FastQ file written to {out_fwd}")

    # REVERSE FILE
    if IN_REV:
        if OUT_REV == None:
            out_rev = os.path.join(os.getcwd(), Path(IN_REV).stem + ".r2.fq.gz")
        else:
            out_rev = OUT_REV
        logging.info(f"- Checking reverse fq file {IN_FWD}")
        read_write_fq(
            fq_in=IN_REV,
            fq_out=out_rev,
            mapped_reads=mapped_reads,
            mode=remove_mode,
            write_mode=write_mode,
            proc=PROC,
        )
        logging.info(f"- Cleaned reverse FastQ file written to {out_rev}")


================================================
FILE: bin/filter_bam_fragment_length.py
================================================
#!/usr/bin/env python3

# Written by Maxime Borry and released under the MIT license. 
# See git repository (https://github.com/nf-core/eager) for full license text.

import argparse
import pysam


def get_args():
    """This function parses and return arguments passed in"""
    parser = argparse.ArgumentParser(
        prog="bam_filter", description="Filter bam on fragment length"
    )
    parser.add_argument("bam", help="Bam aligment file")
    parser.add_argument(
        "-l",
        dest="fraglen",
        default=35,
        type=int,
        help="Minimum fragment length. Default = 35",
    )
    parser.add_argument(
        "-a",
        dest="all",
        default=False,
        action="store_true",
        help="Include all reads, even unmapped",
    )
    parser.add_argument(
        "-o",
        dest="output",
        default=None,
        help="Output bam basename. Default = {bam_basename}.filtered.bam",
    )

    args = parser.parse_args()

    bam = args.bam
    fraglen = args.fraglen
    allreads = args.all
    outfile = args.output

    return (bam, fraglen, allreads, outfile)


def getBasename(file_name):
    if ("/") in file_name:
        basename = file_name.split("/")[-1].split(".")[0]
    else:
        basename = file_name.split(".")[0]
    return basename


def filter_bam(infile, outfile, fraglen, allreads):
    """Write bam to file

    Args:
        infile (stream): pysam stream
        outfile (str): Path to output bam
        fraglen(int): Minimum fragment length to keep
        allreads(bool): Apply on all reads, not only mapped
    """
    bamfile = pysam.AlignmentFile(infile, "rb")
    bamwrite = pysam.AlignmentFile(outfile + ".filtered.bam", "wb", template=bamfile)

    for read in bamfile.fetch(until_eof=True):
        if allreads:
            if read.query_length >= fraglen:
                bamwrite.write(read)
        else:
            if read.is_unmapped == False and read.query_length >= fraglen:
                bamwrite.write(read)


if __name__ == "__main__":
    BAM, FRAGLEN, ALLREADS, OUTFILE = get_args()

    BAMFILE = pysam.AlignmentFile(BAM, "rb")

    if OUTFILE is None:
        OUTFILE = getBasename(BAM)

    filter_bam(BAM, OUTFILE, FRAGLEN, ALLREADS)


================================================
FILE: bin/kraken_parse.py
================================================
#!/usr/bin/env python

# Written by Maxime Borry and released under the MIT license. 
# See git repository (https://github.com/nf-core/eager) for full license text.

import argparse
import csv

def _get_args():
    '''This function parses and return arguments passed in'''
    parser = argparse.ArgumentParser(
        prog='kraken_parse',
        formatter_class=argparse.RawDescriptionHelpFormatter,
        description='Parsing kraken')
    parser.add_argument('krakenReport', help="path to kraken report file")
    parser.add_argument(
        '-c',
        dest="count",
        default=50,
        help="Minimum number of hits on clade to report it. Default = 50")
    parser.add_argument(
        '-or',
        dest="readout",
        default=None,
        help="Read count output file. Default = <basename>.read_kraken_parsed.csv")
    parser.add_argument(
        '-ok',
        dest="kmerout",
        default=None,
        help="Kmer Output file. Default = <basename>.kmer_kraken_parsed.csv")

    args = parser.parse_args()

    infile = args.krakenReport
    countlim = int(args.count)
    readout = args.readout
    kmerout = args.kmerout

    return(infile, countlim, readout, kmerout)


def _get_basename(file_name):
    if ("/") in file_name:
        basename = file_name.split("/")[-1].split(".")[0]
    else:
        basename = file_name.split(".")[0]
    return(basename)


def parse_kraken(infile, countlim):
    '''
    INPUT:
        infile (str): path to kraken report file
        countlim (int): lowest count threshold to report hit
    OUTPUT:
        resdict (dict): key=taxid, value=readCount

    '''
    with open(infile, 'r') as f:
        read_dict = {}
        kmer_dict = {}
        csvreader = csv.reader(f, delimiter='\t')
        for line in csvreader:
            reads = int(line[1])
            if reads >= countlim:
                taxid = line[6]
                kmer = line[3]
                unique_kmer = line[4]
                try:
                    kmer_duplicity = float(kmer)/float(unique_kmer)
                except ZeroDivisionError:
                    kmer_duplicity = 0
                read_dict[taxid] = reads
                kmer_dict[taxid] = kmer_duplicity

        return(read_dict, kmer_dict)


def write_output(resdict, infile, outfile):
    with open(outfile, 'w') as f:
        basename = _get_basename(infile)
        f.write(f"TAXID,{basename}\n")
        for akey in resdict.keys():
            f.write(f"{akey},{resdict[akey]}\n")


if __name__ == '__main__':
    INFILE, COUNTLIM, readout, kmerout = _get_args()

    if not readout:
        read_outfile = _get_basename(INFILE)+".read_kraken_parsed.csv"
    else:
        read_outfile = readout
    if not kmerout:    
        kmer_outfile = _get_basename(INFILE)+".kmer_kraken_parsed.csv"
    else:
        kmer_outfile = kmerout

    read_dict, kmer_dict = parse_kraken(infile=INFILE, countlim=COUNTLIM)
    write_output(resdict=read_dict, infile=INFILE, outfile=read_outfile)
    write_output(resdict=kmer_dict, infile=INFILE, outfile=kmer_outfile)


================================================
FILE: bin/markdown_to_html.py
================================================
#!/usr/bin/env python
from __future__ import print_function
import argparse
import markdown
import os
import sys
import io


def convert_markdown(in_fn):
    input_md = io.open(in_fn, mode="r", encoding="utf-8").read()
    html = markdown.markdown(
        "[TOC]\n" + input_md,
        extensions=["pymdownx.extra", "pymdownx.b64", "pymdownx.highlight", "pymdownx.emoji", "pymdownx.tilde", "toc"],
        extension_configs={
            "pymdownx.b64": {"base_path": os.path.dirname(in_fn)},
            "pymdownx.highlight": {"noclasses": True},
            "toc": {"title": "Table of Contents"},
        },
    )
    return html


def wrap_html(contents):
    header = """<!DOCTYPE html><html>
    <head>
        <link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css" integrity="sha384-ggOyR0iXCbMQv3Xipma34MD+dH/1fQ784/j6cY/iJTQUOhcWr7x9JvoRxT2MZw1T" crossorigin="anonymous">
        <style>
            body {
              font-family: -apple-system,BlinkMacSystemFont,"Segoe UI",Roboto,"Helvetica Neue",Arial,"Noto Sans",sans-serif,"Apple Color Emoji","Segoe UI Emoji","Segoe UI Symbol","Noto Color Emoji";
              padding: 3em;
              margin-right: 350px;
              max-width: 100%;
            }
            .toc {
              position: fixed;
              right: 20px;
              width: 300px;
              padding-top: 20px;
              overflow: scroll;
              height: calc(100% - 3em - 20px);
            }
            .toctitle {
              font-size: 1.8em;
              font-weight: bold;
            }
            .toc > ul {
              padding: 0;
              margin: 1rem 0;
              list-style-type: none;
            }
            .toc > ul ul { padding-left: 20px; }
            .toc > ul > li > a { display: none; }
            img { max-width: 800px; }
            pre {
              padding: 0.6em 1em;
            }
            h2 {

            }
        </style>
    </head>
    <body>
    <div class="container">
    """
    footer = """
    </div>
    </body>
    </html>
    """
    return header + contents + footer


def parse_args(args=None):
    parser = argparse.ArgumentParser()
    parser.add_argument("mdfile", type=argparse.FileType("r"), nargs="?", help="File to convert. Defaults to stdin.")
    parser.add_argument(
        "-o", "--out", type=argparse.FileType("w"), default=sys.stdout, help="Output file name. Defaults to stdout."
    )
    return parser.parse_args(args)


def main(args=None):
    args = parse_args(args)
    converted_md = convert_markdown(args.mdfile.name)
    html = wrap_html(converted_md)
    args.out.write(html)


if __name__ == "__main__":
    sys.exit(main())


================================================
FILE: bin/merge_kraken_res.py
================================================
#!/usr/bin/env python

# Written by Maxime Borry and released under the MIT license. 
# See git repository (https://github.com/nf-core/eager) for full license text.

import argparse
import os
import pandas as pd
import numpy as np

def _get_args():
    '''This function parses and return arguments passed in'''
    parser = argparse.ArgumentParser(
        prog='merge_kraken_res',
        formatter_class=argparse.RawDescriptionHelpFormatter,
        description='Merging csv count files in one table')
    parser.add_argument(
        '-or',
        dest="readout",
        default="kraken_read_count_table.csv",
        help="Read count output file. Default = kraken_read_count_table.csv")
    parser.add_argument(
        '-ok',
        dest="kmerout",
        default="kraken_kmer_unicity_table.csv",
        help="Kmer unicity output file. Default = kraken_kmer_unicity_table.csv")

    args = parser.parse_args()

    readout = args.readout
    kmerout = args.kmerout

    return(readout, kmerout)


def get_csv():
    tmp = [i for i in os.listdir() if ".csv" in i]
    kmer = [i for i in tmp if '.kmer_' in i]
    read = [i for i in tmp if '.read_' in i]
    return(read, kmer)


def _get_basename(file_name):
    if ("/") in file_name:
        basename = file_name.split("/")[-1].split(".")[0]
    else:
        basename = file_name.split(".")[0]
    return(basename)


def merge_csv(all_csv):
    df = pd.read_csv(all_csv[0], index_col=0)
    for i in range(1, len(all_csv)):
        df_tmp = pd.read_csv(all_csv[i], index_col=0)
        df = pd.merge(left=df, right=df_tmp, on='TAXID', how='outer')
    df.fillna(0, inplace=True)
    return(df)


def write_csv(pd_dataframe, outfile):
    pd_dataframe.to_csv(outfile)


if __name__ == "__main__":
    READOUT, KMEROUT = _get_args()
    reads, kmers = get_csv()
    read_df = merge_csv(reads)
    kmer_df = merge_csv(kmers)
    write_csv(read_df, READOUT)
    write_csv(kmer_df, KMEROUT)

================================================
FILE: bin/parse_snp_cov.py
================================================
#!/usr/bin/env python3

# Written by Thiseas C. Lamnidis and released under the MIT license. 
# See git repository (https://github.com/nf-core/eager) for full license text.

import sys, json
from collections import OrderedDict

jsonOut = OrderedDict()
data = OrderedDict()


input = open(sys.argv[1], 'r')
for line in input:
  fields = line.strip().split()
  sample_id = fields[0]
  covered_snps = fields[1]
  total_snps = fields[2]
  if sample_id[0] == "#":
    continue
  
  data[sample_id] = {"Covered_Snps":covered_snps, "Total_Snps":total_snps}

jsonOut = {"plot_type": "generalstats", "id": "snp_coverage",
    "pconfig": {
        "Covered_Snps" : {"title" : "#SNPs Covered"},
        "Total_Snps" : {"title": "#SNPs Total"}
    }, 
    "data" : data
}

with open(sys.argv[1].rstrip('.txt')+'_mqc.json', 'w') as outfile:
    json.dump(jsonOut, outfile)


================================================
FILE: bin/print_x_contamination.py
================================================
#!/usr/bin/env python3

# Written by Thiseas C. Lamnidis and released under the MIT license. 
# See git repository (https://github.com/nf-core/eager) for full license text.

import sys, re, json
from collections import OrderedDict

jsonOut=OrderedDict()
data=OrderedDict()

## Function to convert a set of elements into floating point numbers, when possible, else leave them be.
def make_float(x):
    # print (x)
    output=[None for i in range(len(x))]
    ## If value for an estimate/error is -nan, replace with "NA". JSON does not accept NaN as a valid field.
    for i in range(len(x)):
        if x[i] == "-nan" or x[i] == "nan":
            output[i]="N/A"
            continue
        try:
            output[i]=float(x[i])
        except:
            output[i]=x[i]
    
    return(tuple(output))


Input_files=sys.argv[1:]

output = open("nuclear_contamination.txt", 'w')
print ("Individual", "Num_SNPs", "Method1_MOM_estimate", "Method1_MOM_SE", "Method1_ML_estimate", "Method1_ML_SE", "Method2_MOM_estimate", "Method2_MOM_SE", "Method2_ML_estimate", "Method2_ML_SE", sep="\t", file=output)
for fn in Input_files:
    ## For each file, reset the values to "N/A" so they don't carry over from last file.
    mom1, err_mom1= "N/A","N/A"
    ml1, err_ml1="N/A","N/A"
    mom2, err_mom2= "N/A","N/A"
    ml2, err_ml2="N/A","N/A"
    nSNPs="0"
    with open(fn, 'r') as f:
        Estimates={}
        Ind=re.sub('\.X.contamination.out$', '', fn).split("/")[-1]
        for line in f:
            fields=line.strip().split()
            if line.strip()[0:19] == "We have nSNP sites:":
                nSNPs=fields[4].rstrip(",")
            elif line.strip()[0:7] == "Method1" and line.strip()[9:16] == 'new_llh':
                mom1=fields[3].split(":")[1]
                err_mom1=fields[4].split(":")[1]
                ml1=fields[5].split(":")[1]
                err_ml1=fields[6].split(":")[1]
                ## Sometimes angsd fails to run method 2, and the error is printed directly after the SE for ML. When that happens, exclude the first word in the error from the output. (Method 2 jsonOut will be shown as NA)
                if err_ml1.endswith("contamination"):
                    err_ml1 = err_ml1[:-13]
            elif line.strip()[0:7] == "Method2" and line.strip()[9:16] == 'new_llh':
                mom2=fields[3].split(":")[1]
                err_mom2=fields[4].split(":")[1]
                ml2=fields[5].split(":")[1]
                err_ml2=fields[6].split(":")[1]
        ## Convert estimates and errors to floating point numbers
        (ml1, err_ml1, mom1, err_mom1, ml2, err_ml2, mom2, err_mom2) = make_float((ml1, err_ml1, mom1, err_mom1, ml2, err_ml2, mom2, err_mom2))
        data[Ind]={ "Num_SNPs" : int(nSNPs), "Method1_MOM_estimate" : mom1, "Method1_MOM_SE" : err_mom1, "Method1_ML_estimate" : ml1, "Method1_ML_SE" : err_ml1, "Method2_MOM_estimate" : mom2, "Method2_MOM_SE" : err_mom2, "Method2_ML_estimate" : ml2, "Method2_ML_SE" : err_ml2 }
        print (Ind, nSNPs, mom1, err_mom1, ml1, err_ml1, mom2, err_mom2, ml2, err_ml2, sep="\t", file=output)


jsonOut = {"plot_type": "generalstats", "id": "nuclear_contamination",
    "pconfig": {
        "Num_SNPs" : {"title" : "Number of SNPs"},
        "Method1_MOM_estimate" : {"title": "Contamination Estimate (Method1_MOM)"},
        "Method1_MOM_SE" : {"title": "Estimate Error (Method1_MOM)"},
        "Method1_ML_estimate" : {"title": "Contamination Estimate (Method1_ML)"},
        "Method1_ML_SE" : {"title": "Estimate Error (Method1_ML)"},
        "Method2_MOM_estimate" : {"title": "Contamination Estimate (Method2_MOM)"},
        "Method2_MOM_SE" : {"title": "Estimate Error (Method2_MOM)"},
        "Method2_ML_estimate" : {"title": "Contamination Estimate (Method2_ML)"},
        "Method2_ML_SE" : {"title": "Estimate Error (Method2_ML)"}
    }, 
    "data" : data
}
with open('nuclear_contamination_mqc.json', 'w') as outfile:
    json.dump(jsonOut, outfile)


================================================
FILE: bin/scrape_software_versions.py
================================================
#!/usr/bin/env python
from __future__ import print_function
from collections import OrderedDict
import re

regexes = {
    "nf-core/eager": ["v_pipeline.txt", r"(\S+)"],
    "Nextflow": ["v_nextflow.txt", r"(\S+)"],
    "FastQC": ["v_fastqc.txt", r"FastQC v(\S+)"],
    "MultiQC": ["v_multiqc.txt", r"multiqc, version (\S+)"],
    'AdapterRemoval':['v_adapterremoval.txt', r"AdapterRemoval ver. (\S+)"],
    'Picard MarkDuplicates': ['v_markduplicates.txt', r"Version:(\S+)"],
    'Samtools': ['v_samtools.txt', r"samtools (\S+)"],
    'Preseq': ['v_preseq.txt', r"Version: (\S+)"],
    'BWA': ['v_bwa.txt', r"Version: (\S+)"], 
    'Bowtie2': ['v_bowtie2.txt', r"bowtie2-([0-9]+\.[0-9]+\.[0-9]+) -fdebug"],
    'Qualimap': ['v_qualimap.txt', r"QualiMap v.(\S+)"],
    'GATK HaplotypeCaller': ['v_gatk.txt', r"The Genome Analysis Toolkit \(GATK\) v(\S+)"],
    'GATK UnifiedGenotyper': ['v_gatk3.txt', r"(\S+)"],
    'bamUtil' : ['v_bamutil.txt', r"Version: (\S+);"],
    'fastP': ['v_fastp.txt', r"([\d\.]+)"],
    'DamageProfiler' : ['v_damageprofiler.txt', r"DamageProfiler v(\S+)"],
    'angsd':['v_angsd.txt',r"version: (\S+)"],
    'bedtools':['v_bedtools.txt',r"bedtools v(\S+)"],
    'circulargenerator':['v_circulargenerator.txt',r"CircularGeneratorv(\S+)"],
    'DeDup':['v_dedup.txt',r"DeDup v(\S+)"],
    'freebayes':['v_freebayes.txt',r"v([0-9]\S+)"],
    'sequenceTools':['v_sequencetools.txt',r"(\S+)"],
    'maltextract':['v_maltextract.txt', r"version(\S+)"],
    'malt':['v_malt.txt',r"version (\S+)"],
    'multivcfanalyzer':['v_multivcfanalyzer.txt', r"MultiVCFAnalyzer - (\S+)"],
    'pmdtools':['v_pmdtools.txt',r"pmdtools v(\S+)"],
    'sexdeterrmine':['v_sexdeterrmine.txt',r"(\S+)"],
    'MTNucRatioCalculator':['v_mtnucratiocalculator.txt',r"Version: (\S+)"],
    'VCF2genome':['v_vcf2genome.txt', r"VCF2Genome \(v. ([0-9].[0-9]+) "],
    'endorS.py':['v_endorSpy.txt', r"endorS.py (\S+)"],
    'kraken':['v_kraken.txt', r"Kraken version (\S+)"],
    'eigenstrat_snp_coverage':['v_eigenstrat_snp_coverage.txt',r"(\S+)"],
    'mapDamage2':['v_mapdamage.txt',r"(\S+)"],
    'bbduk':['v_bbduk.txt',r"(.*)"],
    'bcftools':['v_bcftools.txt',r"(\S+)"]
}

results = OrderedDict()
results["nf-core/eager"] = '<span style="color:#999999;">N/A</span>'
results["Nextflow"] = '<span style="color:#999999;">N/A</span>'
results["FastQC"] = '<span style="color:#999999;">N/A</span>'
results["MultiQC"] = '<span style="color:#999999;">N/A</span>'
results['AdapterRemoval'] = '<span style="color:#999999;\">N/A</span>'
results['fastP'] = '<span style="color:#999999;\">N/A</span>'
results['BWA'] = '<span style="color:#999999;\">N/A</span>'
results['Bowtie2'] = '<span style="color:#999999;\">N/A</span>'
results['circulargenerator'] = '<span style="color:#999999;\">N/A</span>'
results['Samtools'] = '<span style="color:#999999;\">N/A</span>'
results['endorS.py'] = '<span style="color:#999999;\">N/A</span>'
results['DeDup'] = '<span style="color:#999999;\">N/A</span>'
results['Picard MarkDuplicates'] = '<span style="color:#999999;\">N/A</span>'
results['Qualimap'] = '<span style="color:#999999;\">N/A</span>'
results['Preseq'] = '<span style="color:#999999;\">N/A</span>'
results['GATK HaplotypeCaller'] = '<span style="color:#999999;\">N/A</span>'
results['GATK UnifiedGenotyper'] = '<span style="color:#999999;\">N/A</span>'
results['freebayes'] = '<span style="color:#999999;\">N/A</span>'
results['sequenceTools'] = '<span style="color:#999999;\">N/A</span>'
results['VCF2genome'] = '<span style="color:#999999;\">N/A</span>'
results['MTNucRatioCalculator'] = '<span style="color:#999999;\">N/A</span>'
results['bedtools'] = '<span style="color:#999999;\">N/A</span>'
results['DamageProfiler'] = '<span style="color:#999999;\">N/A</span>'
results['bamUtil'] = '<span style="color:#999999;\">N/A</span>'
results['pmdtools'] = '<span style="color:#999999;\">N/A</span>'
results['angsd'] = '<span style="color:#999999;\">N/A</span>'
results['sexdeterrmine'] = '<span style="color:#999999;\">N/A</span>'
results['multivcfanalyzer'] = '<span style="color:#999999;\">N/A</span>'
results['malt'] = '<span style="color:#999999;\">N/A</span>'
results['kraken'] = '<span style="color:#999999;\">N/A</span>'
results['maltextract'] = '<span style="color:#999999;\">N/A</span>'
results['eigenstrat_snp_coverage'] = '<span style="color:#999999;\">N/A</span>'
results['mapDamage2'] = '<span style="color:#999999;\">N/A</span>'
results['bbduk'] = '<span style="color:#999999;\">N/A</span>'
results['bcftools'] = '<span style="color:#999999;\">N/A</span>'

# Search each file using its regex
for k, v in regexes.items():
    try:
        with open(v[0]) as x:
            versions = x.read()
            match = re.search(v[1], versions)
            if match:
                results[k] = "v{}".format(match.group(1))
    except IOError:
        results[k] = False

# Remove software set to false in results
for k in list(results):
    if not results[k]:
        del results[k]

# Dump to YAML
print(
    """
id: 'software_versions'
section_name: 'nf-core/eager Software Versions'
section_href: 'https://github.com/nf-core/eager'
plot_type: 'html'
description: 'are collected at run time from the software output.'
data: |
    <dl class="dl-horizontal">
"""
)
for k, v in results.items():
    print("        <dt>{}</dt><dd><samp>{}</samp></dd>".format(k, v))
print("    </dl>")

# Write out regexes as csv file:
with open("software_versions.csv", "w") as f:
    for k, v in results.items():
        f.write("{}\t{}\n".format(k, v))


================================================
FILE: conf/base.config
================================================
/*
 * -------------------------------------------------
 *  nf-core/eager Nextflow base config file
 * -------------------------------------------------
 * A 'blank slate' config file, appropriate for general
 * use on most high performace compute environments.
 * Assumes that all software is installed and available
 * on the PATH. Runs in `local` mode - all jobs will be
 * run on the logged in environment.
 */

process {
  cpus = { check_max( 1 * task.attempt, 'cpus' ) }
  memory = { check_max( 7.GB * task.attempt, 'memory' ) }
  time = { check_max( 24.h * task.attempt, 'time' ) }

  errorStrategy = { task.exitStatus in [143,137,104,134,139, 140] ? 'retry' : 'finish' }
  maxRetries = 3
  maxErrors = '-1'

  // Process-specific resource requirements
  // NOTE - Only one of the labels below are used in the fastqc process in the main script.
  //        If possible, it would be nice to keep the same label naming convention when
  //        adding in your processes.
  // See https://www.nextflow.io/docs/latest/config.html#config-process-selectors

  // Generic resource requirements - s(ingle)c(ore)/m(ulti)c(ore)

  withLabel:'sc_tiny'{
      cpus = { check_max( 1, 'cpus' ) }
      memory = { check_max( 1.GB * task.attempt, 'memory' ) }
      time = { check_max( 4.h * task.attempt, 'time' ) }
  }

  withLabel:'sc_small'{
      cpus = { check_max( 1, 'cpus' ) }
      memory = { check_max( 4.GB * task.attempt, 'memory' ) }
      time = { check_max( 4.h * task.attempt, 'time' ) }
  }

  withLabel:'sc_medium'{
      cpus = { check_max( 1, 'cpus' ) }
      memory = { check_max( 8.GB * task.attempt, 'memory' ) }
      time = { check_max( 4.h * task.attempt, 'time' ) }
  }

  withLabel:'mc_small'{
      cpus = { check_max( 2 * task.attempt, 'cpus' ) }
      memory = { check_max( 4.GB * task.attempt, 'memory' ) }
      time = { check_max( 4.h * task.attempt, 'time' ) }
  }

  withLabel:'mc_medium' {
      cpus = { check_max( 4 * task.attempt, 'cpus' ) }
      memory = { check_max( 8.GB * task.attempt, 'memory' ) }
      time = { check_max( 4.h * task.attempt, 'time' ) }
  }

  withLabel:'mc_large'{
      cpus = { check_max( 8 * task.attempt, 'cpus' ) }
      memory = { check_max( 16.GB * task.attempt, 'memory' ) }
      time = { check_max( 4.h * task.attempt, 'time' ) }
  }

  withLabel:'mc_huge'{
      cpus = { check_max( 32 * task.attempt, 'cpus' ) }
      memory = { check_max( 256.GB * task.attempt, 'memory' ) }
      time = { check_max( 4.h * task.attempt, 'time' ) }
  }

  // Process-specific resource requirements (others leave at default, e.g. Fastqc)
  withName:get_software_versions {
    cache = false
  }

  withName:qualimap{
    errorStrategy = { task.exitStatus in [1,143,137,104,134,139, 140] ? 'retry' : task.exitStatus in [255] ? 'ignore' : 'finish' }
  }

  withName:preseq {
    errorStrategy = 'ignore'
  }

  withName:damageprofiler {
    errorStrategy = { task.exitStatus in [1,143,137,104,134,139, 140] ? 'retry' : 'finish' }
  }

  // Add 1 retry for certain java tools as not enough heap space java errors gives exit code 1
  withName: dedup {
    errorStrategy = { task.exitStatus in [1,143,137,104,134,139, 140] ? 'retry' : 'finish' } 
  }
  
  withName: markduplicates {
    errorStrategy = { task.exitStatus in [143,137, 140] ? 'retry' : 'finish' } 
  }

  // Add 1 retry as not enough heapspace java error gives exit code 1
  withName: malt {
    errorStrategy = { task.exitStatus in [1,143,137,104,134,139, 140] ? 'retry' : 'finish' } 
  }

  // other process specific exit statuses
  withName: nuclear_contamination {
    errorStrategy = { task.exitStatus in [143,137,104,134,139, 140] ? 'ignore' : 'retry' }
  }

}

params {
  // Defaults only, expecting to be overwritten
  max_memory = 128.GB
  max_cpus = 16
  max_time = 240.h
  igenomes_base = 's3://ngi-igenomes/igenomes/'
}


================================================
FILE: conf/benchmarking_human.config
================================================
/*
 * -------------------------------------------------
 *  Nextflow config file for running tests
 * -------------------------------------------------
 * Defines bundled input files and everything required
 * to run a fast and simple test. Use as follows:
 * nextflow run nf-core/eager -profile test, docker (or singularity, or conda)
 */

params {
   config_profile_name = 'nf-core/eager benchmarking - human profile'
   config_profile_description = "A 'fullsized' benchmarking profile for deepish Human sequencing aDNA data" 

   //Input data
   input = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/testdata/Benchmarking/benchmarking_human.tsv'
   // Genome reference
   fasta = 'https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz'

   run_bam_filtering = true
   bam_unmapped_type = 'discard'
   bam_mapping_quality_threshold = 30

   dedupper = 'markduplicates'
  
   run_trim_bam = true
   bamutils_clip_double_stranded_none_udg_left = 1
   bamutils_clip_double_stranded_none_udg_right = 1
   
   // JAR will need to be downloaded first!
   run_genotyping = true
   genotyping_tool = 'ug'
   genotyping_source = 'trimmed'
   gatk_call_conf = 20

   run_sexdeterrmine = true
   sexdeterrmine_bedfile = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/reference/Human/1240K.pos.list_HG19.0based.bed.gz'

   run_nuclear_contamination = true
   contamination_chrom_name = 'chrX'

   run_mtnucratio = true
}

process {
   withName:'makeBWAIndex'{
      time = { check_max( 4.h * task.attempt, 'time' ) }
   }
   withName:'adapter_removal'{
      cpus = { check_max( 8, 'cpus' ) }
      memory = { check_max( 16.GB * task.attempt, 'memory' ) }
      time = { check_max( 2.h * task.attempt, 'time' ) }
   }
   withName:'bwa'{
      cpus = { check_max( 8, 'cpus' ) }
      memory = { check_max( 16.GB * task.attempt, 'memory' ) }
      time = { check_max( 4.h * task.attempt, 'time' ) }
   }
   withName:'markDup'{
      cpus = { check_max( 16, 'cpus' ) }
      memory = { check_max( 64.GB * task.attempt, 'memory' ) }
      time = { check_max( 4.h * task.attempt, 'time' ) }
   }
   withName:'damageprofiler'{
      cpus = 1
      memory = { check_max( 8.GB * task.attempt, 'memory' ) }
      time = { check_max( 2.h * task.attempt, 'time' ) }
   }
}


================================================
FILE: conf/benchmarking_vikingfish.config
================================================
/*
 * -------------------------------------------------
 *  Nextflow config file for running tests
 * -------------------------------------------------
 * Defines bundled input files and everything required
 * to run a fast and simple test. Use as follows:
 * nextflow run nf-core/eager -profile test, docker (or singularity, or conda)
 */

params {
   config_profile_name = 'nf-core/eager benchmarking - Viking Fish profile'
   config_profile_description = "A 'fullsized' benchmarking profile for deepish sequencing aDNA data" 
   
   //Input data
   input = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/testdata/Benchmarking/benchmarking_vikingfish.tsv'   
   // Genome reference
   fasta = 's3://nf-core-awsmegatests/eager/ENA_Data_Fish/GCF_902167405.1_gadMor3.0_genomic.fna.gz'
   
   bwaalnn = 0.04
   bwaalnl = 1024
   
   run_bam_filtering = true
   bam_unmapped_type = 'discard'
   bam_mapping_quality_threshold = 25
     
   run_genotyping = true
   genotyping_tool = 'hc'
   genotyping_source = 'raw'
   gatk_ploidy = 2
}

process {
   withName:'adapter_removal'{
      cpus = { check_max( 8, 'cpus' ) }
      memory = { check_max( 16.GB * task.attempt, 'memory' ) }
      time = { check_max( 2.h * task.attempt, 'time' ) }
   }
   withName:'bwa'{
      cpus = { check_max( 8, 'cpus' ) }
      memory = { check_max( 16.GB * task.attempt, 'memory' ) }
      time = { check_max( 8.h * task.attempt, 'time' ) }
   }
   withName:'dedup'{
      cpus = { check_max( 8, 'cpus' ) }
      memory = { check_max( 16.GB * task.attempt, 'memory' ) }
      time = { check_max( 4.h * task.attempt, 'time' ) }
   }
   withName:'genotyping_hc'{
     cpus = { check_max( 8, 'cpus' ) }
     memory = { check_max( 16.GB * task.attempt, 'memory' ) }
     time = { check_max( 8.h * task.attempt, 'time' ) }
   }

}


================================================
FILE: conf/igenomes.config
================================================
/*
 * -------------------------------------------------
 *  Nextflow config file for iGenomes paths
 * -------------------------------------------------
 * Defines reference genomes, using iGenome paths
 * Can be used by any config that customises the base
 * path using $params.igenomes_base / --igenomes_base
 */

params {
  // illumina iGenomes reference file paths
  genomes {
    'GRCh37' {
      fasta       = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/WholeGenomeFasta/genome.fa"
      bwa         = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/genome.fa"
      bowtie2     = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/"
      star        = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/"
      bismark     = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BismarkIndex/"
      gtf         = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.gtf"
      bed12       = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.bed"
      readme      = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Annotation/README.txt"
      mito_name   = "MT"
      macs_gsize  = "2.7e9"
      blacklist   = "${projectDir}/assets/blacklists/GRCh37-blacklist.bed"
    }
    'GRCh38' {
      fasta       = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/WholeGenomeFasta/genome.fa"
      bwa         = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/BWAIndex/genome.fa"
      bowtie2     = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/Bowtie2Index/"
      star        = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/STARIndex/"
      bismark     = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/BismarkIndex/"
      gtf         = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Annotation/Genes/genes.gtf"
      bed12       = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Annotation/Genes/genes.bed"
      mito_name   = "chrM"
      macs_gsize  = "2.7e9"
      blacklist   = "${projectDir}/assets/blacklists/hg38-blacklist.bed"
    }
    'GRCm38' {
      fasta       = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Sequence/WholeGenomeFasta/genome.fa"
      bwa         = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Sequence/BWAIndex/genome.fa"
      bowtie2     = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Sequence/Bowtie2Index/"
      star        = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Sequence/STARIndex/"
      bismark     = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Sequence/BismarkIndex/"
      gtf         = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Annotation/Genes/genes.gtf"
      bed12       = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Annotation/Genes/genes.bed"
      readme      = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Annotation/README.txt"
      mito_name   = "MT"
      macs_gsize  = "1.87e9"
      blacklist   = "${projectDir}/assets/blacklists/GRCm38-blacklist.bed"
    }
    'TAIR10' {
      fasta       = "${params.igenomes_base}/Arabidopsis_thaliana/Ensembl/TAIR10/Sequence/WholeGenomeFasta/genome.fa"
      bwa         = "${params.igenomes_base}/Arabidopsis_thaliana/Ensembl/TAIR10/Sequence/BWAIndex/genome.fa"
      bowtie2     = "${params.igenomes_base}/Arabidopsis_thaliana/Ensembl/TAIR10/Sequence/Bowtie2Index/"
      star        = "${params.igenomes_base}/Arabidopsis_thaliana/Ensembl/TAIR10/Sequence/STARIndex/"
      bismark     = "${params.igenomes_base}/Arabidopsis_thaliana/Ensembl/TAIR10/Sequence/BismarkIndex/"
      gtf         = "${params.igenomes_base}/Arabidopsis_thaliana/Ensembl/TAIR10/Annotation/Genes/genes.gtf"
      bed12       = "${params.igenomes_base}/Arabidopsis_thaliana/Ensembl/TAIR10/Annotation/Genes/genes.bed"
      readme      = "${params.igenomes_base}/Arabidopsis_thaliana/Ensembl/TAIR10/Annotation/README.txt"
      mito_name   = "Mt"
    }
    'EB2' {
      fasta       = "${params.igenomes_base}/Bacillus_subtilis_168/Ensembl/EB2/Sequence/WholeGenomeFasta/genome.fa"
      bwa         = "${params.igenomes_base}/Bacillus_subtilis_168/Ensembl/EB2/Sequence/BWAIndex/genome.fa"
      bowtie2     = "${params.igenomes_base}/Bacillus_subtilis_168/Ensembl/EB2/Sequence/Bowtie2Index/"
      star        = "${params.igenomes_base}/Bacillus_subtilis_168/Ensembl/EB2/Sequence/STARIndex/"
      bismark     = "${params.igenomes_base}/Bacillus_subtilis_168/Ensembl/EB2/Sequence/BismarkIndex/"
      gtf         = "${params.igenomes_base}/Bacillus_subtilis_168/Ensembl/EB2/Annotation/Genes/genes.gtf"
      bed12       = "${params.igenomes_base}/Bacillus_subtilis_168/Ensembl/EB2/Annotation/Genes/genes.bed"
      readme      = "${params.igenomes_base}/Bacillus_subtilis_168/Ensembl/EB2/Annotation/README.txt"
    }
    'UMD3.1' {
      fasta       = "${params.igenomes_base}/Bos_taurus/Ensembl/UMD3.1/Sequence/WholeGenomeFasta/genome.fa"
      bwa         = "${params.igenomes_base}/Bos_taurus/Ensembl/UMD3.1/Sequence/BWAIndex/genome.fa"
      bowtie2     = "${params.igenomes_base}/Bos_taurus/Ensembl/UMD3.1/Sequence/Bowtie2Index/"
      star        = "${params.igenomes_base}/Bos_taurus/Ensembl/UMD3.1/Sequence/STARIndex/"
      bismark     = "${params.igenomes_base}/Bos_taurus/Ensembl/UMD3.1/Sequence/BismarkIndex/"
      gtf         = "${params.igenomes_base}/Bos_taurus/Ensembl/UMD3.1/Annotation/Genes/genes.gtf"
      bed12       = "${params.igenomes_base}/Bos_taurus/Ensembl/UMD3.1/Annotation/Genes/genes.bed"
      readme      = "${params.igenomes_base}/Bos_taurus/Ensembl/UMD3.1/Annotation/README.txt"
      mito_name   = "MT"
    }
    'WBcel235' {
      fasta       = "${params.igenomes_base}/Caenorhabditis_elegans/Ensembl/WBcel235/Sequence/WholeGenomeFasta/genome.fa"
      bwa         = "${params.igenomes_base}/Caenorhabditis_elegans/Ensembl/WBcel235/Sequence/BWAIndex/genome.fa"
      bowtie2     = "${params.igenomes_base}/Caenorhabditis_elegans/Ensembl/WBcel235/Sequence/Bowtie2Index/"
      star        = "${params.igenomes_base}/Caenorhabditis_elegans/Ensembl/WBcel235/Sequence/STARIndex/"
      bismark     = "${params.igenomes_base}/Caenorhabditis_elegans/Ensembl/WBcel235/Sequence/BismarkIndex/"
      gtf         = "${params.igenomes_base}/Caenorhabditis_elegans/Ensembl/WBcel235/Annotation/Genes/genes.gtf"
      bed12       = "${params.igenomes_base}/Caenorhabditis_elegans/Ensembl/WBcel235/Annotation/Genes/genes.bed"
      mito_name   = "MtDNA"
      macs_gsize  = "9e7"
    }
    'CanFam3.1' {
      fasta       = "${params.igenomes_base}/Canis_familiaris/Ensembl/CanFam3.1/Sequence/WholeGenomeFasta/genome.fa"
      bwa         = "${params.igenomes_base}/Canis_familiaris/Ensembl/CanFam3.1/Sequence/BWAIndex/genome.fa"
      bowtie2     = "${params.igenomes_base}/Canis_familiaris/Ensembl/CanFam3.1/Sequence/Bowtie2Index/"
      star        = "${params.igenomes_base}/Canis_familiaris/Ensembl/CanFam3.1/Sequence/STARIndex/"
      bismark     = "${params.igenomes_base}/Canis_familiaris/Ensembl/CanFam3.1/Sequence/BismarkIndex/"
      gtf         = "${params.igenomes_base}/Canis_familiaris/Ensembl/CanFam3.1/Annotation/Genes/genes.gtf"
      bed12       = "${params.igenomes_base}/Canis_familiaris/Ensembl/CanFam3.1/Annotation/Genes/genes.bed"
      readme      = "${params.igenomes_base}/Canis_familiaris/Ensembl/CanFam3.1/Annotation/README.txt"
      mito_name   = "MT"
    }
    'GRCz10' {
      fasta       = "${params.igenomes_base}/Danio_rerio/Ensembl/GRCz10/Sequence/WholeGenomeFasta/genome.fa"
      bwa         = "${params.igenomes_base}/Danio_rerio/Ensembl/GRCz10/Sequence/BWAIndex/genome.fa"
      bowtie2     = "${params.igenomes_base}/Danio_rerio/Ensembl/GRCz10/Sequence/Bowtie2Index/"
      star        = "${params.igenomes_base}/Danio_rerio/Ensembl/GRCz10/Sequence/STARIndex/"
      bismark     = "${params.igenomes_base}/Danio_rerio/Ensembl/GRCz10/Sequence/BismarkIndex/"
      gtf         = "${params.igenomes_base}/Danio_rerio/Ensembl/GRCz10/Annotation/Genes/genes.gtf"
      bed12       = "${params.igenomes_base}/Danio_rerio/Ensembl/GRCz10/Annotation/Genes/genes.bed"
      mito_name   = "MT"
    }
    'BDGP6' {
      fasta       = "${params.igenomes_base}/Drosophila_melanogaster/Ensembl/BDGP6/Sequence/WholeGenomeFasta/genome.fa"
      bwa         = "${params.igenomes_base}/Drosophila_melanogaster/Ensembl/BDGP6/Sequence/BWAIndex/genome.fa"
      bowtie2     = "${params.igenomes_base}/Drosophila_melanogaster/Ensembl/BDGP6/Sequence/Bowtie2Index/"
      star        = "${params.igenomes_base}/Drosophila_melanogaster/Ensembl/BDGP6/Sequence/STARIndex/"
      bismark     = "${params.igenomes_base}/Drosophila_melanogaster/Ensembl/BDGP6/Sequence/BismarkIndex/"
      gtf         = "${params.igenomes_base}/Drosophila_melanogaster/Ensembl/BDGP6/Annotation/Genes/genes.gtf"
      bed12       = "${params.igenomes_base}/Drosophila_melanogaster/Ensembl/BDGP6/Annotation/Genes/genes.bed"
      mito_name   = "M"
      macs_gsize  = "1.2e8"
    }
    'EquCab2' {
      fasta       = "${params.igenomes_base}/Equus_caballus/Ensembl/EquCab2/Sequence/WholeGenomeFasta/genome.fa"
      bwa         = "${params.igenomes_base}/Equus_caballus/Ensembl/EquCab2/Sequence/BWAIndex/genome.fa"
      bowtie2     = "${params.igenomes_base}/Equus_caballus/Ensembl/EquCab2/Sequence/Bowtie2Index/"
      star        = "${params.igenomes_base}/Equus_caballus/Ensembl/EquCab2/Sequence/STARIndex/"
      bismark     = "${params.igenomes_base}/Equus_caballus/Ensembl/EquCab2/Sequence/BismarkIndex/"
      gtf         = "${params.igenomes_base}/Equus_caballus/Ensembl/EquCab2/Annotation/Genes/genes.gtf"
      bed12       = "${params.igenomes_base}/Equus_caballus/Ensembl/EquCab2/Annotation/Genes/genes.bed"
      readme      = "${params.igenomes_base}/Equus_caballus/Ensembl/EquCab2/Annotation/README.txt"
      mito_name   = "MT"
    }
    'EB1' {
      fasta       = "${params.igenomes_base}/Escherichia_coli_K_12_DH10B/Ensembl/EB1/Sequence/WholeGenomeFasta/genome.fa"
      bwa         = "${params.igenomes_base}/Escherichia_coli_K_12_DH10B/Ensembl/EB1/Sequence/BWAIndex/genome.fa"
      bowtie2     = "${params.igenomes_base}/Escherichia_coli_K_12_DH10B/Ensembl/EB1/Sequence/Bowtie2Index/"
      star        = "${params.igenomes_base}/Escherichia_coli_K_12_DH10B/Ensembl/EB1/Sequence/STARIndex/"
      bismark     = "${params.igenomes_base}/Escherichia_coli_K_12_DH10B/Ensembl/EB1/Sequence/BismarkIndex/"
      gtf         = "${params.igenomes_base}/Escherichia_coli_K_12_DH10B/Ensembl/EB1/Annotation/Genes/genes.gtf"
      bed12       = "${params.igenomes_base}/Escherichia_coli_K_12_DH10B/Ensembl/EB1/Annotation/Genes/genes.bed"
      readme      = "${params.igenomes_base}/Escherichia_coli_K_12_DH10B/Ensembl/EB1/Annotation/README.txt"
    }
    'Galgal4' {
      fasta       = "${params.igenomes_base}/Gallus_gallus/Ensembl/Galgal4/Sequence/WholeGenomeFasta/genome.fa"
      bwa         = "${params.igenomes_base}/Gallus_gallus/Ensembl/Galgal4/Sequence/BWAIndex/genome.fa"
      bowtie2     = "${params.igenomes_base}/Gallus_gallus/Ensembl/Galgal4/Sequence/Bowtie2Index/"
      star        = "${params.igenomes_base}/Gallus_gallus/Ensembl/Galgal4/Sequence/STARIndex/"
      bismark     = "${params.igenomes_base}/Gallus_gallus/Ensembl/Galgal4/Sequence/BismarkIndex/"
      gtf         = "${params.igenomes_base}/Gallus_gallus/Ensembl/Galgal4/Annotation/Genes/genes.gtf"
      bed12       = "${params.igenomes_base}/Gallus_gallus/Ensembl/Galgal4/Annotation/Genes/genes.bed"
      mito_name   = "MT"
    }
    'Gm01' {
      fasta       = "${params.igenomes_base}/Glycine_max/Ensembl/Gm01/Sequence/WholeGenomeFasta/genome.fa"
      bwa         = "${params.igenomes_base}/Glycine_max/Ensembl/Gm01/Sequence/BWAIndex/genome.fa"
      bowtie2     = "${params.igenomes_base}/Glycine_max/Ensembl/Gm01/Sequence/Bowtie2Index/"
      star        = "${params.igenomes_base}/Glycine_max/Ensembl/Gm01/Sequence/STARIndex/"
      bismark     = "${params.igenomes_base}/Glycine_max/Ensembl/Gm01/Sequence/BismarkIndex/"
      gtf         = "${params.igenomes_base}/Glycine_max/Ensembl/Gm01/Annotation/Genes/genes.gtf"
      bed12       = "${params.igenomes_base}/Glycine_max/Ensembl/Gm01/Annotation/Genes/genes.bed"
      readme      = "${params.igenomes_base}/Glycine_max/Ensembl/Gm01/Annotation/README.txt"
    }
    'Mmul_1' {
      fasta       = "${params.igenomes_base}/Macaca_mulatta/Ensembl/Mmul_1/Sequence/WholeGenomeFasta/genome.fa"
      bwa         = "${params.igenomes_base}/Macaca_mulatta/Ensembl/Mmul_1/Sequence/BWAIndex/genome.fa"
      bowtie2     = "${params.igenomes_base}/Macaca_mulatta/Ensembl/Mmul_1/Sequence/Bowtie2Index/"
      star        = "${params.igenomes_base}/Macaca_mulatta/Ensembl/Mmul_1/Sequence/STARIndex/"
      bismark     = "${params.igenomes_base}/Macaca_mulatta/Ensembl/Mmul_1/Sequence/BismarkIndex/"
      gtf         = "${params.igenomes_base}/Macaca_mulatta/Ensembl/Mmul_1/Annotation/Genes/genes.gtf"
      bed12       = "${params.igenomes_base}/Macaca_mulatta/Ensembl/Mmul_1/Annotation/Genes/genes.bed"
      readme      = "${params.igenomes_base}/Macaca_mulatta/Ensembl/Mmul_1/Annotation/README.txt"
      mito_name   = "MT"
    }
    'IRGSP-1.0' {
      fasta       = "${params.igenomes_base}/Oryza_sativa_japonica/Ensembl/IRGSP-1.0/Sequence/WholeGenomeFasta/genome.fa"
      bwa         = "${params.igenomes_base}/Oryza_sativa_japonica/Ensembl/IRGSP-1.0/Sequence/BWAIndex/genome.fa"
      bowtie2     = "${params.igenomes_base}/Oryza_sativa_japonica/Ensembl/IRGSP-1.0/Sequence/Bowtie2Index/"
      star        = "${params.igenomes_base}/Oryza_sativa_japonica/Ensembl/IRGSP-1.0/Sequence/STARIndex/"
      bismark     = "${params.igenomes_base}/Oryza_sativa_japonica/Ensembl/IRGSP-1.0/Sequence/BismarkIndex/"
      gtf         = "${params.igenomes_base}/Oryza_sativa_japonica/Ensembl/IRGSP-1.0/Annotation/Genes/genes.gtf"
      bed12       = "${params.igenomes_base}/Oryza_sativa_japonica/Ensembl/IRGSP-1.0/Annotation/Genes/genes.bed"
      mito_name   = "Mt"
    }
    'CHIMP2.1.4' {
      fasta       = "${params.igenomes_base}/Pan_troglodytes/Ensembl/CHIMP2.1.4/Sequence/WholeGenomeFasta/genome.fa"
      bwa         = "${params.igenomes_base}/Pan_troglodytes/Ensembl/CHIMP2.1.4/Sequence/BWAIndex/genome.fa"
      bowtie2     = "${params.igenomes_base}/Pan_troglodytes/Ensembl/CHIMP2.1.4/Sequence/Bowtie2Index/"
      star        = "${params.igenomes_base}/Pan_troglodytes/Ensembl/CHIMP2.1.4/Sequence/STARIndex/"
      bismark     = "${params.igenomes_base}/Pan_troglodytes/Ensembl/CHIMP2.1.4/Sequence/BismarkIndex/"
      gtf         = "${params.igenomes_base}/Pan_troglodytes/Ensembl/CHIMP2.1.4/Annotation/Genes/genes.gtf"
      bed12       = "${params.igenomes_base}/Pan_troglodytes/Ensembl/CHIMP2.1.4/Annotation/Genes/genes.bed"
      readme      = "${params.igenomes_base}/Pan_troglodytes/Ensembl/CHIMP2.1.4/Annotation/README.txt"
      mito_name   = "MT"
    }
    'Rnor_6.0' {
      fasta       = "${params.igenomes_base}/Rattus_norvegicus/Ensembl/Rnor_6.0/Sequence/WholeGenomeFasta/genome.fa"
      bwa         = "${params.igenomes_base}/Rattus_norvegicus/Ensembl/Rnor_6.0/Sequence/BWAIndex/genome.fa"
      bowtie2     = "${params.igenomes_base}/Rattus_norvegicus/Ensembl/Rnor_6.0/Sequence/Bowtie2Index/"
      star        = "${params.igenomes_base}/Rattus_norvegicus/Ensembl/Rnor_6.0/Sequence/STARIndex/"
      bismark     = "${params.igenomes_base}/Rattus_norvegicus/Ensembl/Rnor_6.0/Sequence/BismarkIndex/"
      gtf         = "${params.igenomes_base}/Rattus_norvegicus/Ensembl/Rnor_6.0/Annotation/Genes/genes.gtf"
      bed12       = "${params.igenomes_base}/Rattus_norvegicus/Ensembl/Rnor_6.0/Annotation/Genes/genes.bed"
      mito_name   = "MT"
    }
    'R64-1-1' {
      fasta       = "${params.igenomes_base}/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Sequence/WholeGenomeFasta/genome.fa"
      bwa         = "${params.igenomes_base}/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Sequence/BWAIndex/genome.fa"
      bowtie2     = "${params.igenomes_base}/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Sequence/Bowtie2Index/"
      star        = "${params.igenomes_base}/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Sequence/STARIndex/"
      bismark     = "${params.igenomes_base}/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Sequence/BismarkIndex/"
      gtf         = "${params.igenomes_base}/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Annotation/Genes/genes.gtf"
      bed12       = "${params.igenomes_base}/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Annotation/Genes/genes.bed"
      mito_name   = "MT"
      macs_gsize  = "1.2e7"
    }
    'EF2' {
      fasta       = "${params.igenomes_base}/Schizosaccharomyces_pombe/Ensembl/EF2/Sequence/WholeGenomeFasta/genome.fa"
      bwa         = "${params.igenomes_base}/Schizosaccharomyces_pombe/Ensembl/EF2/Sequence/BWAIndex/genome.fa"
      bowtie2     = "${params.igenomes_base}/Schizosaccharomyces_pombe/Ensembl/EF2/Sequence/Bowtie2Index/"
      star        = "${params.igenomes_base}/Schizosaccharomyces_pombe/Ensembl/EF2/Sequence/STARIndex/"
      bismark     = "${params.igenomes_base}/Schizosaccharomyces_pombe/Ensembl/EF2/Sequence/BismarkIndex/"
      gtf         = "${params.igenomes_base}/Schizosaccharomyces_pombe/Ensembl/EF2/Annotation/Genes/genes.gtf"
      bed12       = "${params.igenomes_base}/Schizosaccharomyces_pombe/Ensembl/EF2/Annotation/Genes/genes.bed"
      readme      = "${params.igenomes_base}/Schizosaccharomyces_pombe/Ensembl/EF2/Annotation/README.txt"
      mito_name   = "MT"
      macs_gsize  = "1.21e7"
    }
    'Sbi1' {
      fasta       = "${params.igenomes_base}/Sorghum_bicolor/Ensembl/Sbi1/Sequence/WholeGenomeFasta/genome.fa"
      bwa         = "${params.igenomes_base}/Sorghum_bicolor/Ensembl/Sbi1/Sequence/BWAIndex/genome.fa"
      bowtie2     = "${params.igenomes_base}/Sorghum_bicolor/Ensembl/Sbi1/Sequence/Bowtie2Index/"
      star        = "${params.igenomes_base}/Sorghum_bicolor/Ensembl/Sbi1/Sequence/STARIndex/"
      bismark     = "${params.igenomes_base}/Sorghum_bicolor/Ensembl/Sbi1/Sequence/BismarkIndex/"
      gtf         = "${params.igenomes_base}/Sorghum_bicolor/Ensembl/Sbi1/Annotation/Genes/genes.gtf"
      bed12       = "${params.igenomes_base}/Sorghum_bicolor/Ensembl/Sbi1/Annotation/Genes/genes.bed"
      readme      = "${params.igenomes_base}/Sorghum_bicolor/Ensembl/Sbi1/Annotation/README.txt"
    }
    'Sscrofa10.2' {
      fasta       = "${params.igenomes_base}/Sus_scrofa/Ensembl/Sscrofa10.2/Sequence/WholeGenomeFasta/genome.fa"
      bwa         = "${params.igenomes_base}/Sus_scrofa/Ensembl/Sscrofa10.2/Sequence/BWAIndex/genome.fa"
      bowtie2     = "${params.igenomes_base}/Sus_scrofa/Ensembl/Sscrofa10.2/Sequence/Bowtie2Index/"
      star        = "${params.igenomes_base}/Sus_scrofa/Ensembl/Sscrofa10.2/Sequence/STARIndex/"
      bismark     = "${params.igenomes_base}/Sus_scrofa/Ensembl/Sscrofa10.2/Sequence/BismarkIndex/"
      gtf         = "${params.igenomes_base}/Sus_scrofa/Ensembl/Sscrofa10.2/Annotation/Genes/genes.gtf"
      bed12       = "${params.igenomes_base}/Sus_scrofa/Ensembl/Sscrofa10.2/Annotation/Genes/genes.bed"
      readme      = "${params.igenomes_base}/Sus_scrofa/Ensembl/Sscrofa10.2/Annotation/README.txt"
      mito_name   = "MT"
    }
    'AGPv3' {
      fasta       = "${params.igenomes_base}/Zea_mays/Ensembl/AGPv3/Sequence/WholeGenomeFasta/genome.fa"
      bwa         = "${params.igenomes_base}/Zea_mays/Ensembl/AGPv3/Sequence/BWAIndex/genome.fa"
      bowtie2     = "${params.igenomes_base}/Zea_mays/Ensembl/AGPv3/Sequence/Bowtie2Index/"
      star        = "${params.igenomes_base}/Zea_mays/Ensembl/AGPv3/Sequence/STARIndex/"
      bismark     = "${params.igenomes_base}/Zea_mays/Ensembl/AGPv3/Sequence/BismarkIndex/"
      gtf         = "${params.igenomes_base}/Zea_mays/Ensembl/AGPv3/Annotation/Genes/genes.gtf"
      bed12       = "${params.igenomes_base}/Zea_mays/Ensembl/AGPv3/Annotation/Genes/genes.bed"
      mito_name   = "Mt"
    }
    'hg38' {
      fasta       = "${params.igenomes_base}/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa"
      bwa         = "${params.igenomes_base}/Homo_sapiens/UCSC/hg38/Sequence/BWAIndex/genome.fa"
      bowtie2     = "${params.igenomes_base}/Homo_sapiens/UCSC/hg38/Sequence/Bowtie2Index/"
      star        = "${params.igenomes_base}/Homo_sapiens/UCSC/hg38/Sequence/STARIndex/"
      bismark     = "${params.igenomes_base}/Homo_sapiens/UCSC/hg38/Sequence/BismarkIndex/"
      gtf         = "${params.igenomes_base}/Homo_sapiens/UCSC/hg38/Annotation/Genes/genes.gtf"
      bed12       = "${params.igenomes_base}/Homo_sapiens/UCSC/hg38/Annotation/Genes/genes.bed"
      mito_name   = "chrM"
      macs_gsize  = "2.7e9"
      blacklist   = "${projectDir}/assets/blacklists/hg38-blacklist.bed"
    }
    'hg19' {
      fasta       = "${params.igenomes_base}/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa"
      bwa         = "${params.igenomes_base}/Homo_sapiens/UCSC/hg19/Sequence/BWAIndex/genome.fa"
      bowtie2     = "${params.igenomes_base}/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/"
      star        = "${params.igenomes_base}/Homo_sapiens/UCSC/hg19/Sequence/STARIndex/"
      bismark     = "${params.igenomes_base}/Homo_sapiens/UCSC/hg19/Sequence/BismarkIndex/"
      gtf         = "${params.igenomes_base}/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.gtf"
      bed12       = "${params.igenomes_base}/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.bed"
      readme      = "${params.igenomes_base}/Homo_sapiens/UCSC/hg19/Annotation/README.txt"
      mito_name   = "chrM"
      macs_gsize  = "2.7e9"
      blacklist   = "${projectDir}/assets/blacklists/hg19-blacklist.bed"
    }
    'mm10' {
      fasta       = "${params.igenomes_base}/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa"
      bwa         = "${params.igenomes_base}/Mus_musculus/UCSC/mm10/Sequence/BWAIndex/genome.fa"
      bowtie2     = "${params.igenomes_base}/Mus_musculus/UCSC/mm10/Sequence/Bowtie2Index/"
      star        = "${params.igenomes_base}/Mus_musculus/UCSC/mm10/Sequence/STARIndex/"
      bismark     = "${params.igenomes_base}/Mus_musculus/UCSC/mm10/Sequence/BismarkIndex/"
      gtf         = "${params.igenomes_base}/Mus_musculus/UCSC/mm10/Annotation/Genes/genes.gtf"
      bed12       = "${params.igenomes_base}/Mus_musculus/UCSC/mm10/Annotation/Genes/genes.bed"
      readme      = "${params.igenomes_base}/Mus_musculus/UCSC/mm10/Annotation/README.txt"
      mito_name   = "chrM"
      macs_gsize  = "1.87e9"
      blacklist   = "${projectDir}/assets/blacklists/mm10-blacklist.bed"
    }
    'bosTau8' {
      fasta       = "${params.igenomes_base}/Bos_taurus/UCSC/bosTau8/Sequence/WholeGenomeFasta/genome.fa"
      bwa         = "${params.igenomes_base}/Bos_taurus/UCSC/bosTau8/Sequence/BWAIndex/genome.fa"
      bowtie2     = "${params.igenomes_base}/Bos_taurus/UCSC/bosTau8/Sequence/Bowtie2Index/"
      star        = "${params.igenomes_base}/Bos_taurus/UCSC/bosTau8/Sequence/STARIndex/"
      bismark     = "${params.igenomes_base}/Bos_taurus/UCSC/bosTau8/Sequence/BismarkIndex/"
      gtf         = "${params.igenomes_base}/Bos_taurus/UCSC/bosTau8/Annotation/Genes/genes.gtf"
      bed12       = "${params.igenomes_base}/Bos_taurus/UCSC/bosTau8/Annotation/Genes/genes.bed"
      mito_name   = "chrM"
    }
    'ce10' {
      fasta       = "${params.igenomes_base}/Caenorhabditis_elegans/UCSC/ce10/Sequence/WholeGenomeFasta/genome.fa"
      bwa         = "${params.igenomes_base}/Caenorhabditis_elegans/UCSC/ce10/Sequence/BWAIndex/genome.fa"
      bowtie2     = "${params.igenomes_base}/Caenorhabditis_elegans/UCSC/ce10/Sequence/Bowtie2Index/"
      star        = "${params.igenomes_base}/Caenorhabditis_elegans/UCSC/ce10/Sequence/STARIndex/"
      bismark     = "${params.igenomes_base}/Caenorhabditis_elegans/UCSC/ce10/Sequence/BismarkIndex/"
      gtf         = "${params.igenomes_base}/Caenorhabditis_elegans/UCSC/ce10/Annotation/Genes/genes.gtf"
      bed12       = "${params.igenomes_base}/Caenorhabditis_elegans/UCSC/ce10/Annotation/Genes/genes.bed"
      readme      = "${params.igenomes_base}/Caenorhabditis_elegans/UCSC/ce10/Annotation/README.txt"
      mito_name   = "chrM"
      macs_gsize  = "9e7"
    }
    'canFam3' {
      fasta       = "${params.igenomes_base}/Canis_familiaris/UCSC/canFam3/Sequence/WholeGenomeFasta/genome.fa"
      bwa         = "${params.igenomes_base}/Canis_familiaris/UCSC/canFam3/Sequence/BWAIndex/genome.fa"
      bowtie2     = "${params.igenomes_base}/Canis_familiaris/UCSC/canFam3/Sequence/Bowtie2Index/"
      star        = "${params.igenomes_base}/Canis_familiaris/UCSC/canFam3/Sequence/STARIndex/"
      bismark     = "${params.igenomes_base}/Canis_familiaris/UCSC/canFam3/Sequence/BismarkIndex/"
      gtf         = "${params.igenomes_base}/Canis_familiaris/UCSC/canFam3/Annotation/Genes/genes.gtf"
      bed12       = "${params.igenomes_base}/Canis_familiaris/UCSC/canFam3/Annotation/Genes/genes.bed"
      readme      = "${params.igenomes_base}/Canis_familiaris/UCSC/canFam3/Annotation/README.txt"
      mito_name   = "chrM"
    }
    'danRer10' {
      fasta       = "${params.igenomes_base}/Danio_rerio/UCSC/danRer10/Sequence/WholeGenomeFasta/genome.fa"
      bwa         = "${params.igenomes_base}/Danio_rerio/UCSC/danRer10/Sequence/BWAIndex/genome.fa"
      bowtie2     = "${params.igenomes_base}/Danio_rerio/UCSC/danRer10/Sequence/Bowtie2Index/"
      star        = "${params.igenomes_base}/Danio_rerio/UCSC/danRer10/Sequence/STARIndex/"
      bismark     = "${params.igenomes_base}/Danio_rerio/UCSC/danRer10/Sequence/BismarkIndex/"
      gtf         = "${params.igenomes_base}/Danio_rerio/UCSC/danRer10/Annotation/Genes/genes.gtf"
      bed12       = "${params.igenomes_base}/Danio_rerio/UCSC/danRer10/Annotation/Genes/genes.bed"
      mito_name   = "chrM"
      macs_gsize  = "1.37e9"
    }
    'dm6' {
      fasta       = "${params.igenomes_base}/Drosophila_melanogaster/UCSC/dm6/Sequence/WholeGenomeFasta/genome.fa"
      bwa         = "${params.igenomes_base}/Drosophila_melanogaster/UCSC/dm6/Sequence/BWAIndex/genome.fa"
      bowtie2     = "${params.igenomes_base}/Drosophila_melanogaster/UCSC/dm6/Sequence/Bowtie2Index/"
      star        = "${params.igenomes_base}/Drosophila_melanogaster/UCSC/dm6/Sequence/STARIndex/"
      bismark     = "${params.igenomes_base}/Drosophila_melanogaster/UCSC/dm6/Sequence/BismarkIndex/"
      gtf         = "${params.igenomes_base}/Drosophila_melanogaster/UCSC/dm6/Annotation/Genes/genes.gtf"
      bed12       = "${params.igenomes_base}/Drosophila_melanogaster/UCSC/dm6/Annotation/Genes/genes.bed"
      mito_name   = "chrM"
      macs_gsize  = "1.2e8"
    }
    'equCab2' {
      fasta       = "${params.igenomes_base}/Equus_caballus/UCSC/equCab2/Sequence/WholeGenomeFasta/genome.fa"
      bwa         = "${params.igenomes_base}/Equus_caballus/UCSC/equCab2/Sequence/BWAIndex/genome.fa"
      bowtie2     = "${params.igenomes_base}/Equus_caballus/UCSC/equCab2/Sequence/Bowtie2Index/"
      star        = "${params.igenomes_base}/Equus_caballus/UCSC/equCab2/Sequence/STARIndex/"
      bismark     = "${params.igenomes_base}/Equus_caballus/UCSC/equCab2/Sequence/BismarkIndex/"
      gtf         = "${params.igenomes_base}/Equus_caballus/UCSC/equCab2/Annotation/Genes/genes.gtf"
      bed12       = "${params.igenomes_base}/Equus_caballus/UCSC/equCab2/Annotation/Genes/genes.bed"
      readme      = "${params.igenomes_base}/Equus_caballus/UCSC/equCab2/Annotation/README.txt"
      mito_name   = "chrM"
    }
    'galGal4' {
      fasta       = "${params.igenomes_base}/Gallus_gallus/UCSC/galGal4/Sequence/WholeGenomeFasta/genome.fa"
      bwa         = "${params.igenomes_base}/Gallus_gallus/UCSC/galGal4/Sequence/BWAIndex/genome.fa"
      bowtie2     = "${params.igenomes_base}/Gallus_gallus/UCSC/galGal4/Sequence/Bowtie2Index/"
      star        = "${params.igenomes_base}/Gallus_gallus/UCSC/galGal4/Sequence/STARIndex/"
      bismark     = "${params.igenomes_base}/Gallus_gallus/UCSC/galGal4/Sequence/BismarkIndex/"
      gtf         = "${params.igenomes_base}/Gallus_gallus/UCSC/galGal4/Annotation/Genes/genes.gtf"
      bed12       = "${params.igenomes_base}/Gallus_gallus/UCSC/galGal4/Annotation/Genes/genes.bed"
      readme      = "${params.igenomes_base}/Gallus_gallus/UCSC/galGal4/Annotation/README.txt"
      mito_name   = "chrM"
    }
    'panTro4' {
      fasta       = "${params.igenomes_base}/Pan_troglodytes/UCSC/panTro4/Sequence/WholeGenomeFasta/genome.fa"
      bwa         = "${params.igenomes_base}/Pan_troglodytes/UCSC/panTro4/Sequence/BWAIndex/genome.fa"
      bowtie2     = "${params.igenomes_base}/Pan_troglodytes/UCSC/panTro4/Sequence/Bowtie2Index/"
      star        = "${params.igenomes_base}/Pan_troglodytes/UCSC/panTro4/Sequence/STARIndex/"
      bismark     = "${params.igenomes_base}/Pan_troglodytes/UCSC/panTro4/Sequence/BismarkIndex/"
      gtf         = "${params.igenomes_base}/Pan_troglodytes/UCSC/panTro4/Annotation/Genes/genes.gtf"
      bed12       = "${params.igenomes_base}/Pan_troglodytes/UCSC/panTro4/Annotation/Genes/genes.bed"
      readme      = "${params.igenomes_base}/Pan_troglodytes/UCSC/panTro4/Annotation/README.txt"
      mito_name   = "chrM"
    }
    'rn6' {
      fasta       = "${params.igenomes_base}/Rattus_norvegicus/UCSC/rn6/Sequence/WholeGenomeFasta/genome.fa"
      bwa         = "${params.igenomes_base}/Rattus_norvegicus/UCSC/rn6/Sequence/BWAIndex/genome.fa"
      bowtie2     = "${params.igenomes_base}/Rattus_norvegicus/UCSC/rn6/Sequence/Bowtie2Index/"
      star        = "${params.igenomes_base}/Rattus_norvegicus/UCSC/rn6/Sequence/STARIndex/"
      bismark     = "${params.igenomes_base}/Rattus_norvegicus/UCSC/rn6/Sequence/BismarkIndex/"
      gtf         = "${params.igenomes_base}/Rattus_norvegicus/UCSC/rn6/Annotation/Genes/genes.gtf"
      bed12       = "${params.igenomes_base}/Rattus_norvegicus/UCSC/rn6/Annotation/Genes/genes.bed"
      mito_name   = "chrM"
    }
    'sacCer3' {
      fasta       = "${params.igenomes_base}/Saccharomyces_cerevisiae/UCSC/sacCer3/Sequence/WholeGenomeFasta/genome.fa"
      bwa         = "${params.igenomes_base}/Saccharomyces_cerevisiae/UCSC/sacCer3/Sequence/BWAIndex/genome.fa"
      bowtie2     = "${params.igenomes_base}/Saccharomyces_cerevisiae/UCSC/sacCer3/Sequence/Bowtie2Index/"
      star        = "${params.igenomes_base}/Saccharomyces_cerevisiae/UCSC/sacCer3/Sequence/STARIndex/"
      bismark     = "${params.igenomes_base}/Saccharomyces_cerevisiae/UCSC/sacCer3/Sequence/BismarkIndex/"
      readme      = "${params.igenomes_base}/Saccharomyces_cerevisiae/UCSC/sacCer3/Annotation/README.txt"
      mito_name   = "chrM"
      macs_gsize  = "1.2e7"
    }
    'susScr3' {
      fasta       = "${params.igenomes_base}/Sus_scrofa/UCSC/susScr3/Sequence/WholeGenomeFasta/genome.fa"
      bwa         = "${params.igenomes_base}/Sus_scrofa/UCSC/susScr3/Sequence/BWAIndex/genome.fa"
      bowtie2     = "${params.igenomes_base}/Sus_scrofa/UCSC/susScr3/Sequence/Bowtie2Index/"
      star        = "${params.igenomes_base}/Sus_scrofa/UCSC/susScr3/Sequence/STARIndex/"
      bismark     = "${params.igenomes_base}/Sus_scrofa/UCSC/susScr3/Sequence/BismarkIndex/"
      gtf         = "${params.igenomes_base}/Sus_scrofa/UCSC/susScr3/Annotation/Genes/genes.gtf"
      bed12       = "${params.igenomes_base}/Sus_scrofa/UCSC/susScr3/Annotation/Genes/genes.bed"
      readme      = "${params.igenomes_base}/Sus_scrofa/UCSC/susScr3/Annotation/README.txt"
      mito_name   = "chrM"
    }
  }
}


================================================
FILE: conf/test.config
================================================
/*
 * -------------------------------------------------
 *  Nextflow config file for running tests
 * -------------------------------------------------
 * Defines bundled input files and everything required
 * to run a fast and simple test. Use as follows:
 * nextflow run nf-core/eager -profile test, docker (or singularity, or conda)
 */

includeConfig 'test_resources.config'

params {
  config_profile_name = 'Test profile'
  config_profile_description = 'Minimal test dataset to check pipeline function'
  // Limit resources so that this can run on GitHub Actions
  max_cpus = 2
  max_memory = 6.GB
  max_time = 48.h
  genome = false
  //Input data
  input = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/testdata/Mammoth/mammoth_design_fastq.tsv'
  // Genome references
  fasta = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/reference/Mammoth/Mammoth_MT_Krause.fasta'
}


================================================
FILE: conf/test_direct.config
================================================
/*
 * -------------------------------------------------
 *  Nextflow config file for running tests
 * -------------------------------------------------
 * Defines bundled input files and everything required
 * to run a fast and simple test. Use as follows:
 *   nextflow run nf-core/eager -profile test,<docker/singularity>
 */

includeConfig 'test_resources.config'


params {
  config_profile_name = 'Test profile'
  config_profile_description = 'Minimal test dataset to check pipeline function'
  // Limit resources so that this can run on GitHub Actions
  max_cpus = 2
  max_memory = 6.GB
  max_time = 48.h
  genome = false
  //Input data
  single_end = false
  // Genome references
  fasta = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/reference/Mammoth/Mammoth_MT_Krause.fasta'
  // Ignore `--input` as otherwise the parameter validation will throw an error
  schema_ignore_params = 'genomes,input_paths,input'
}


================================================
FILE: conf/test_full.config
================================================
/*
 * -------------------------------------------------
 *  Nextflow config file for running full-size tests
 * -------------------------------------------------
 * Defines bundled input files and everything required
 * to run a full size pipeline test. Use as follows:
 *   nextflow run nf-core/eager -profile test_full,<docker/singularity>
 */

params {
  config_profile_name = 'Full test profile for nf-core/eager'
  config_profile_description = 'Full test dataset to check nf-core/eager function'

  // Input data for full size test
  input = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/testdata/Benchmarking/benchmarking_vikingfish.tsv'
   
   // Genome reference
   fasta = 'https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_other/Gadus_morhua/representative/GCF_902167405.1_gadMor3.0/GCF_902167405.1_gadMor3.0_genomic.fna.gz'
   
   bwaalnn = 0.04
   bwaalnl = 1024
   
   run_bam_filtering = true
   bam_unmapped_type = 'discard'
   bam_mapping_quality_threshold = 25
     
   run_genotyping = true
   genotyping_tool = 'hc'
   genotyping_source = 'raw'
   gatk_ploidy = 2
}

process {
   withName:'adapter_removal'{
      cpus = { check_max( 8, 'cpus' ) }
      memory = { check_max( 16.GB * task.attempt, 'memory' ) }
      time = { check_max( 2.h * task.attempt, 'time' ) }
   }
   withName:'bwa'{
      cpus = { check_max( 8, 'cpus' ) }
      memory = { check_max( 16.GB * task.attempt, 'memory' ) }
      time = { check_max( 8.h * task.attempt, 'time' ) }
   }
   withName:'dedup'{
      cpus = { check_max( 8, 'cpus' ) }
      memory = { check_max( 16.GB * task.attempt, 'memory' ) }
      time = { check_max( 4.h * task.attempt, 'time' ) }
   }
   withName:'genotyping_hc'{
     cpus = { check_max( 8, 'cpus' ) }
     memory = { check_max( 16.GB * task.attempt, 'memory' ) }
     time = { check_max( 8.h * task.attempt, 'time' ) }
   }
   
  // Ignore `--input` as otherwise the parameter validation will throw an error
  schema_ignore_params = 'genomes,input_paths,input'
}


================================================
FILE: conf/test_resources.config
================================================
/*
 * -------------------------------------------------
 *  Nextflow config file for running tests
 * -------------------------------------------------
 * Defines the base computing resources used across all CI tests (primarily the
 * time limit)
 */


process {

  withLabel:'sc_tiny'{
      cpus = { check_max( 1, 'cpus' ) }
      memory = { check_max( 1.GB * task.attempt, 'memory' ) }
      time = { check_max( 10.m * task.attempt, 'time' ) }
  }

  withLabel:'sc_small'{
      cpus = { check_max( 1, 'cpus' ) }
      memory = { check_max( 4.GB * task.attempt, 'memory' ) }
      time = { check_max( 10.m * task.attempt, 'time' ) }
  }

  withLabel:'sc_medium'{
      cpus = { check_max( 1, 'cpus' ) }
      memory = { check_max( 8.GB * task.attempt, 'memory' ) }
      time = { check_max( 10.m * task.attempt, 'time' ) }
  }

  withLabel:'mc_small'{
      cpus = { check_max( 2 * task.attempt, 'cpus' ) }
      memory = { check_max( 4.GB * task.attempt, 'memory' ) }
      time = { check_max( 10.m * task.attempt, 'time' ) }
  }

  withLabel:'mc_medium' {
      cpus = { check_max( 4 * task.attempt, 'cpus' ) }
      memory = { check_max( 8.GB * task.attempt, 'memory' ) }
      time = { check_max( 10.m * task.attempt, 'time' ) }
  }

  withLabel:'mc_large'{
      cpus = { check_max( 8 * task.attempt, 'cpus' ) }
      memory = { check_max( 16.GB * task.attempt, 'memory' ) }
      time = { check_max( 10.m * task.attempt, 'time' ) }
  }

  withLabel:'mc_huge'{
      cpus = { check_max( 32 * task.attempt, 'cpus' ) }
      memory = { check_max( 256.GB * task.attempt, 'memory' ) }
      time = { check_max( 10.m * task.attempt, 'time' ) }
  }

  withName:'mapdamage_rescaling'{
      time = { check_max( 20.m * task.attempt, 'time' ) }
  }

}

================================================
FILE: conf/test_stresstest_human.config
================================================
/*
 * -------------------------------------------------
 *  Nextflow config file for running tests
 * -------------------------------------------------
 * Defines bundled input files and everything required
 * to run a fast and simple test. Use as follows:
 * nextflow run nf-core/eager -profile test, docker (or singularity, or conda)
 */

params {
   config_profile_name = 'nf-core/eager stresstess - human profile'
   config_profile_description = "A large-scale benchmarking profile AWS stress-testing of large sample number study" 

   //Input data
   input = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/testdata/Benchmarking/human_stresstest.tsv'
   // Genome reference
   fasta = 'https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz'

   save_reference = true

   email = 'james@nf-co.re'

   run_mtnucratio = true
   mtnucratio_header = 'ChrM'

   run_bam_filtering = true
   bam_unmapped_type = 'discard'
   bam_mapping_quality_threshold = 30

   dedupper = 'markduplicates'
  
   run_sexdeterrmine = true
   sexdeterrmine_bedfile = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/reference/Human/1240K.pos.list_HG19.0based.bed.gz'

   run_nuclear_contamination = true
   contamination_chrom_name = 'chrX'

   run_mtnucratio = true


}

process {

   errorStrategy = 'retry'
   
   maxRetries = 5

   withName:'makeBWAIndex'{
      time = { check_max( 48.h * task.attempt, 'time' ) }
   }
   withName:'adapter_removal'{
      cpus = { check_max( 8, 'cpus' ) }
      memory = { check_max( 16.GB * task.attempt, 'memory' ) }
      time = { check_max( 48.h * task.attempt, 'time' ) }
   }
   withName:'bwa'{
      cpus = { check_max( 8, 'cpus' ) }
      memory = { check_max( 16.GB * task.attempt, 'memory' ) }
      time = { check_max( 48.h * task.attempt, 'time' ) }
   }
   withName:'markduplicates'{
      errorStrategy = { task.exitStatus in [143,137,104,134,139] ? 'retry' : 'finish' }
      cpus = { check_max( 16, 'cpus' ) }
      memory = { check_max( 16.GB * task.attempt, 'memory' ) }
      time = { check_max( 48.h * task.attempt, 'time' ) }
   }
   withName:'damageprofiler'{
      cpus = 1
      memory = { check_max( 8.GB * task.attempt, 'memory' ) }
      time = { check_max( 48.h * task.attempt, 'time' ) }
   }
}


================================================
FILE: conf/test_tsv_bam.config
================================================
/*
 * -------------------------------------------------
 *  Nextflow config file for running tests
 * -------------------------------------------------
 * Defines bundled input files and everything required
 * to run a fast and simple test. Use as follows:
 * nextflow run nf-core/eager -profile test, docker (or singularity, or conda)
 */

includeConfig 'test_resources.config'

params {
  config_profile_name = 'Test profile'
  config_profile_description = 'Minimal test dataset to check pipeline function'
  // Limit resources so that this can run on Travis
  max_cpus = 2
  max_memory = 6.GB
  max_time = 48.h
  genome = false
  //Input data
  input = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/testdata/Mammoth/mammoth_design_bam.tsv'
  // Genome references
  fasta = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/reference/Mammoth/Mammoth_MT_Krause.fasta'
}

================================================
FILE: conf/test_tsv_complex.config
================================================
/*
 * -------------------------------------------------
 *  Nextflow config file for running tests
 * -------------------------------------------------
 * Defines bundled input files and everything required
 * to run a fast and simple test. Use as follows:
 * nextflow run nf-core/eager -profile test, docker (or singularity, or conda)
 */

includeConfig 'test_resources.config'


params {
  config_profile_name = 'Test profile'
  config_profile_description = 'Minimal test dataset to check pipeline function'
  // Limit resources so that this can run on GitHub Actions
  max_cpus = 2
  max_memory = 6.GB
  max_time = 48.h
  genome = false
  //Input data
  input = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/testdata/Mammoth/mammoth_design_fastq_multilane_multilib.tsv'
  // Genome references
  fasta = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/reference/Mammoth/Mammoth_MT_Krause.fasta'
}


================================================
FILE: conf/test_tsv_fna.config
================================================
/*
 * -------------------------------------------------
 *  Nextflow config file for running tests
 * -------------------------------------------------
 * Defines bundled input files and everything required
 * to run a fast and simple test. Use as follows:
 * nextflow run nf-core/eager -profile test, docker (or singularity, or conda)
 */

includeConfig 'test_resources.config'

params {
  config_profile_name = 'Test profile'
  config_profile_description = 'Minimal test dataset to check pipeline function'
  // Limit resources so that this can run on Travis
  max_cpus = 2
  max_memory = 6.GB
  max_time = 48.h
  genome = false
  //Input data
  input = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/testdata/Mammoth/mammoth_design_fastq.tsv'
  // Genome references
  fasta = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/reference/Mammoth/Mammoth_MT_Krause.fna'
}


================================================
FILE: conf/test_tsv_humanbam.config
================================================
/*
 * -------------------------------------------------
 *  Nextflow config file for running tests
 * -------------------------------------------------
 * Defines bundled input files and everything required
 * to run a fast and simple test. Use as follows:
 * nextflow run nf-core/eager -profile test, docker (or singularity, or conda)
 */

includeConfig 'test_resources.config'

params {
  config_profile_name = 'Test profile'
  config_profile_description = 'Minimal test dataset to check pipeline function'
  // Limit resources so that this can run on Travis
  max_cpus = 2
  max_memory = 6.GB
  max_time = 48.h
  genome = false
  //Input data
  input = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/testdata/Human/human_design_bam.tsv'
  // Genome references
  fasta = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/reference/Mammoth/Mammoth_MT_Krause.fasta'
  sexdeterrmine_bedfile = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/reference/Human/1240K.pos.list_hs37d5.0based.bed.gz'
  // Genotyping
  pileupcaller_bedfile = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/reference/Human/1240K.pos.list_hs37d5.0based.bed.gz'
  pileupcaller_snpfile = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/reference/Human/1240K_covered_in_JK2067_downsampled_s0.1.numeric_chromosomes.snp'
}


================================================
FILE: conf/test_tsv_kraken.config
================================================
/*
 * -------------------------------------------------
 *  Nextflow config file for running tests
 * -------------------------------------------------
 * Defines bundled input files and everything required
 * to run a fast and simple test. Use as follows:
 * nextflow run nf-core/eager -profile test, docker (or singularity, or conda)
 */

includeConfig 'test_resources.config'

params {
  config_profile_name = 'Test profile kraken'
  config_profile_description = 'Minimal test dataset to check pipeline function with kraken metagenomic profiler'
  // Limit resources so that this can run on Travis
  max_cpus = 2
  max_memory = 6.GB
  max_time = 48.h
  genome = false
  //Input data
  metagenomic_tool = 'kraken'
  run_metagenomic_screening = true
  input = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/testdata/Mammoth/mammoth_design_fastq.tsv'
  // Genome references
  fasta = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/reference/Mammoth/Mammoth_MT_Krause.fasta'
  database = 'https://github.com/nf-core/test-datasets/raw/eager/databases/kraken/eager_test.tar.gz'
}


================================================
FILE: conf/test_tsv_pretrim.config
================================================
/*
 * -------------------------------------------------
 *  Nextflow config file for running tests
 * -------------------------------------------------
 * Defines bundled input files and everything required
 * to run a fast and simple test. Use as follows:
 * nextflow run nf-core/eager -profile test, docker (or singularity, or conda)
 */

includeConfig 'test_resources.config'

params {
  config_profile_name = 'Test profile'
  config_profile_description = 'Minimal test dataset to check pipeline function'
  // Limit resources so that this can run on Travis
  max_cpus = 2
  max_memory = 6.GB
  max_time = 48.h
  genome = false
  //Input data
  input = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/testdata/Mammoth/mammoth_design_fastq_pretrim.tsv'
  // Genome references
  fasta = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/reference/Mammoth/Mammoth_MT_Krause.fasta'
}


================================================
FILE: docs/README.md
================================================
# nf-core/eager: Documentation

The nf-core/eager documentation is split into the following pages:

* [Usage](usage.md)
  * An overview of how the pipeline works, how to run it and a description of all of the different command-line flags.
  * Also includes: FAQ, Troubleshooting and Tutorials
* [Output](output.md)
  * An overview of the different results produced by the pipeline and how to interpret them.

You can find a lot more documentation about installing, configuring and running nf-core pipelines on the website: [https://nf-co.re](https://nf-co.re).

Additional pages are:

* [Installation](https://nf-co.re/usage/installation)
* Pipeline configuration
  * [Local installation](https://nf-co.re/usage/local_installation)
  * [Adding your own system config](https://nf-co.re/usage/adding_own_config)
  * [Reference genomes](https://nf-co.re/usage/reference_genomes)
* [Contribution Guidelines](../.github/CONTRIBUTING.md)
  * Basic contribution & behaviour guidelines
  * Checklists and guidelines for people who would like to contribute code
  

================================================
FILE: docs/images/README.md
================================================
# Documentation Images Information

The font used for all documentation images is Kalam by Indian Type Foundry and is released under the [Open Font License](https://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=OFL)

Originally downloaded from [Google Fonts](https://fonts.google.com/specimen/Kalam?sidebar.open&selection.family=Kalam:wght@300;400;700)


================================================
FILE: docs/images/usage/nfcore-eager_tsv_template.tsv
================================================
Sample_Name	Library_ID	Lane	Colour_Chemistry	SeqType	Organism	Strandedness	UDG_Treatment	R1	R2	BAM


================================================
FILE: docs/output.md
================================================
# nf-core/eager: Output

## Introduction

The output of nf-core/eager primarily consists of the following main components: output alignment files (e.g. VCF, BAM or FASTQ files), and summary statistics of the whole run presented in a [`MultiQC`](https://multiqc.info) report. Intermediate files and module-specific statistics files are also retained depending on your particular run configuration.

## Directory Structure

The default directory structure of nf-core/eager is as follows

```bash
results/
├── MultiQC/
├── <MODULE_1>/
├── <MODULE_2>/
├── <MODULE_3>/
├── pipeline_info/
└── reference_genome/
work/
```

* The parent directory `<RUN_OUTPUT_DIRECTORY>` is the parent directory of the run, either the directory the pipeline was run from or as specified by the `--outdir` flag. The default name of the output directory (unless otherwise specified) will be `./results/`.

### Primary Output Directories

These directories are the ones you will use on a day-to-day basis and are those which you should familiarise yourself with.

* The `MultiQC` directory is the most important directory and contains the main summary report of the run in HTML format, which can be viewed in a web-browser of your choice. The sub-directory contains the MultiQC collected data used to build the HTML report. The Report allows you to get an overview of the sequencing and mapping quality as well as aDNA metrics (see the [MultiQC Report](#multiqc-report) section for more detail).
* A `<MODULE>` directory contains the (cleaned-up) output from a particular software module. This is the second most important set of directories. This contains output files such as FASTQ, BAM, statistics, and/or plot files of a specific module (see the [Output Files](#output-files) section for more detail). The latter two are only needed when you need finer detail about that particular part of the pipeline.

### Secondary Output Directories

These are less important directories which are used less often, normally in the context of bug-reporting.

* `pipeline_info/`: [Nextflow](https://www.nextflow.io/docs/latest/tracing.html) provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.
  * Reports generated by Nextflow: `execution_report.html`, `execution_timeline.html`, `execution_trace.txt` and `pipeline_dag.dot`/`pipeline_dag.svg`.
  * Reports generated by the pipeline: `pipeline_report.html`, `pipeline_report.txt` and `software_versions.csv`.
  * Documentation for interpretation of results in HTML format: `results_description.html`.
* `reference_genome/` contains either text files describing the location of specified reference genomes, and if not already supplied when running the pipeline, auxiliary indexing files. This is often useful when re-running other samples using the same reference genome, but is otherwise often not important.
* The `work/` directory contains all the `nextflow` processing directories. This is where `nextflow` actually does all the work, but in an efficient programmatic procedure that is not intuitive to human-readers. Due to this, the directory is often not important to a user as all the useful output files are linked to the module directories (see above). Otherwise, this directory maybe useful when a bug-reporting.

> :warning: Note that `work/` will be created wherever you are running the `nextflow run` command from, unless you specify the location with `-w`, i.e. it will not by default be in `outdir`!.

## MultiQC Report

In this section we will run through the output of each **default** module as reported in a MultiQC output. This can be viewed by opening the HTML file in your `<RUN_OUTPUT_DIRECTORY>/MultiQC/` directory in a web browser. The section will also provide some basic tips on how to interpret the plots and values, although we highly recommend reading the READMEs or original papers of the tools used in the pipeline. A list of references can be seen on the [nf-core/eager github repository](https://github.com/nf-core/eager/)

For more information about how to use MultiQC reports, see [http://multiqc.info](http://multiqc.info)

### General Stats Table

#### Background

This is the main summary table produced by MultiQC that the report begins with. This section of the report is generated by MultiQC itself rather than stats produced by a specific module. It shows whatever each module considers to be as the 'most important' values to be displayed — however the nf-core/eager version has been somewhat customised to make it as close to the EAGER (v1) ReportTable format as possible, with some opinionated tweaks.

#### Table

This table will report values per-file, library, or sample statistics depending on which stage along the pipeline you have gone through. MultiQC will try and collapse the rows as far as possible, if the log files have the same name. However, separate libraries will be displayed separately, for example when using DamageProfiler with the using TSV input and merging of samples is performed (which would be reported at the qualimap level). If you're only interested in a single part of the results (e.g. qualimap) you can use the `Configure Columns` to remove columns and the corresponding rows will be not displayed, resulting in a more compact table.

Each column name is supplied by the module, so you may see similar column names. When unsure, hovering over the column name will allow you see which module it is derived from.

The possible columns displayed by default are as follows (note you may see additional columns depending on what other modules you activate):

* **Sample Name** This is the log file name without file suffix(s). This will depend on the module outputs.
* **Nr. Input Reads** This is from Pre-AdapterRemoval FastQC. Represents the number of raw reads in your untrimmed and (paired end) unmerged FASTQ file. Each row should be approximately equal to the number of reads you requested to be sequenced, divided by the number of FASTQ files you received for that library.
* **Length Input Reads** This is from Pre-AdapterRemoval FastQC. This is the average read length in your untrimmed and (paired end) unmerged FASTQ file and should represent the number of cycles of your sequencing chemistry.
* **% GC Input Reads** This is from Pre-AdapterRemoval FastQC. This is the average GC content in percent of all the reads in your untrimmed and (paired end) unmerged FASTQ file.
* **GC content** This is from FastP. This is the average GC of all reads in your untrimmed and unmerged FASTSQ file after poly-G tail trimming. If you have lots of tails, this value should drop from the pre-AdapterRemoval FastQC  %GC column.
* **% Trimmed** This is from AdapterRemoval. It is the percentage of reads which had an adapter sequence removed from the end of the read.
* **Nr. Processed Reads** This is from Post-AdapterRemoval FastQC. Represents the number of preprocessed reads in your adapter trimmed (paired end) merged FASTQ file. The loss between this number and the Pre-AdapterRemoval FastQC can give you an idea of the quality of trimming and merging.
* **% GC Processed Reads** This is from Post-AdapterRemoval FastQC. Represents the average GC of all preprocessed reads in your adapter trimmed (paired end) merged FASTQ file.
* **Length Processed Reads** This is from post-AdapterRemoval FastQC. This is the average read length in your trimmed and (paired end) merged FASTQ file and should represent the 'realistic' average lengths of your DNA molecules
* **% Aligned** This is from bowtie2. It reports the percentage of input reads that mapped to your reference genome. This number will be likely similar to Endogenous DNA % (see below).
* **% Metagenomic Mappability** This is from MALT. It reports the percentage of the off-target reads (from mapping), that could map to your MALT metagenomic database. This can often be low for aDNA due to short reads and database bias.
* **% Unclassified** This is from Kraken. It reports the percentage of reads that could not be aligned and taxonomically assigned against your Kraken metagenomic database. This can often be high for aDNA due to short reads and database bias.
* **Nr. Reads Into Mapping** This is from Samtools. This is the raw number of preprocessed reads that went into mapping.
* **Nr. Mapped Reads** This is from Samtools. This is the raw number of preprocessed reads mapped to your reference genome _prior_ map quality filtering.
* **Endogenous DNA (%)** This is from the endorS.py tool. It displays a percentage of mapped reads over total reads that went into mapped (i.e. the percentage DNA content of the library that matches the reference). Assuming a perfect ancient sample with no modern contamination, this would be the amount of true ancient DNA in the sample. However this value _most likely_ include contamination and will not entirely be the true 'endogenous' content.
* **Nr. Mapped Reads Post-Filter** This is from Samtools. This is the raw number of preprocessed reads mapped to your reference genome _after_ map quality filtering (note the column name does not distinguish itself from prior-map quality filtering, but the post-filter column is always second)
* **Endogenous DNA Post-Filter (%)** This is from the endorS.py tool. It displays a percentage of mapped reads _after_ BAM filtering (i.e. for mapping quality and/or bam-level length filtering) over total reads that went into mapped (i.e. the percentage DNA content of the library that matches the reference). This column will only be displayed if BAM filtering is turned on and is based on the original mapping for total reads, and mapped reads as calculated from the post-filtering BAM.
* **ClusterFactor** This is from **DeDup only**. This is a value representing how many duplicates in the library exist for each unique read. This ratio is calculated as `reads_before_deduplication / reads_after_deduplication`. Can be converted to %Dups by calculating `1 - (1  / CF)`. A cluster factor close to one indicates a highly complex library and could be sequenced further. Generally with a value of more than 2 you will not be gaining much more information by sequencing deeper.
* **% Dup. Mapped Reads** This is from **Picard's markDuplicates only**. It represents the percentage of reads in your library that were exact duplicates of other reads in your library. The lower the better, as high duplication rate means lots of sequencing of the same information (and therefore is not time or cost effective).
* **X Prime Y>Z N base** These columns are from DamageProfiler or mapDamage. The prime numbers represent which end of the reads the damage is referring to. The Y>Z is the type of substitution (C>T is the true damage, G>A is the complementary). You should see for no- and half-UDG treatment a decrease in frequency from the 1st to 2nd base.
* **Mean Length Mapped Reads** This is from DamageProfiler. This is the mean length of all de-duplicated mapped reads. Ancient DNA normally will have a mean between 30-75, however this can vary.
* **Median Length Mapped Reads** This is from DamageProfiler. This is the median length of all de-duplicated mapped reads. Ancient DNA normally will have a mean between 30-75, however this can vary.
* **Nr. Dedup. Mapped Reads** This is from Qualimap. This is the total number of _deduplicated_ reads that mapped to your reference genome. This is the **best** number to report for final mapped reads in final publications.
* **Mean/Median Coverage** This is from Qualimap. This is the mean/median number of times a base on your reference genome was covered by a read (i.e. depth coverage). This average includes bases with 0 reads covering that position.
* **>= 1X** to **>= 5X** These are from Qualimap. This is the percentage of the genome covered at that particular depth coverage.
* **% GC Dedup. Mapped Reads** This is the mean GC content in percent of all mapped reads post-deduplication. This should normally be close to the GC content of your reference genome.
* **MT to Nuclear Ratio** This from MTtoNucRatio. This reports the number of reads aligned to a mitochondrial entry in your reference FASTA to all other entries. This will typically be high but will vary depending on tissue type.
* **SexDet Rate X Chr** This is from Sex.DetERRmine. This is the relative depth of coverage on the X-chromosome.
* **SexDet Rate Y Chr** This is from Sex.DetERRmine. This is the relative depth of coverage on the Y-chromosome.
* **#SNPs Covered** This is from eigenstrat\_snp\_coverage. The number of called SNPs after genotyping with pileupcaller.
* **#SNPs Total** This is from eigenstrat\_snp\_coverage. The maximum number of covered SNPs, i.e. the number of SNPs in the .snp file provided to pileupcaller with `--pileupcaller_snpfile`.
* **Number of SNPs** This is from ANGSD. The number of SNPs left after removing sites with no data in a 5 base pair surrounding region.
* **Contamination Estimate (Method1_ML)** This is from the nuclear contamination function of ANGSD. The Maximum Likelihood contamination estimate according to Method 1. The estimates using Method of Moments and/or those based on Method 2 can be unhidden through the "Configure Columns" button.
* **Estimate Error (Method1_ML)** This is from ANGSD. The standard error of the Method1 Maximum likelihood estimate. The errors associated with Method of Moments and/or Method2 estimates can be unhidden through the "Configure Columns" button.
* **% Hets** This is from MultiVCFAnalyzer. This reports the number of SNPs on an assumed haploid organism that have two possible alleles. A high percentage may indicate cross-mapping from a related species.

For other non-default columns (activated under 'Configure Columns'), hover over the column name for further descriptions.

### FastQC

#### Background

[FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) gives general quality metrics about your raw reads. It provides information about the quality score distribution across your reads, the per base sequence content (%T/A/G/C) as sequenced. You also get information about adapter contamination and other overrepresented sequences.

You will receive output for each supplied FASTQ file.

When dealing with ancient DNA data the MultiQC plots for FastQC will often show lots of 'warning' or 'failed' samples. You generally can discard this sort of information as we are dealing with very degraded and metagenomic samples which have artefacts that violate the FastQC 'quality definitions', while still being valid data for aDNA researchers. Instead you will _normally_ be looking for 'global' patterns across all samples of a sequencing run to check for library construction or sequencing failures. Decision on whether a individual sample has 'failed' or not should be made by the user after checking all the plots themselves (e.g. if the sample is consistently an outlier to all others in the run).

[FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences.

For further reading and documentation see the [FastQC help pages](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/).

> **NB:** The FastQC (pre-Trimming) plots displayed in the MultiQC report shows _untrimmed_ reads. They may contain adapter sequence and potentially regions with low quality. To see how your reads look after trimming, look at the FastQC reports in the FastQC (post-Trimming) section. You should expect after AdapterRemoval, that most of the artefacts are removed.
> :warning: If you turned on `--post_ar_fastq_trimming` your 'post-Trimming' report the statistics _after_ this trimming. There is no separate report for the post-AdapterRemoval trimming.

#### Sequence Counts

This shows a barplot with the overall number of sequences (x axis) in your raw library after demultiplexing, **per file** (y-axis). If you have paired end data, you will have one bar for Read 1 (or forward), and a second bar for Read 2 (or reverse). Each entire bar should represent approximately what you requested from the sequencer itself — unless you have your library sequenced over multiple lanes, where it should be what you request divided by the number of lanes it was split over.

A section of the bar will also show an approximate estimation of the fraction of the total number of reads that are duplicates of another. This can derive from over-amplification of the library, or lots of single adapters. This can be later checked with the Deduplication check. A good library and sequencing run should have very low amounts of duplicates reads.

<p align="center">
  <img src="images/output/fastqc/fastqc_sequence_counts.png" width="75%" height = "75%">
</p>

#### Sequence Quality Histograms

This line plot represents the Phred scores across each base pair of all the reads. The x-axis is the base position across each read, and the y-axis is the average base-calling score (Phred-scaled) of the nucleotides across all reads. Again, this is per FASTQ file (i.e. forward/reverse and/or lanes separately). The background colours represent approximate ranges of quality, with green section being acceptable quality, orange is dubious and red is bad.

You will often see that the first 5 or so bases have slightly lower quality than the rest of the read as this the calibration steps of the machine. The bulk of the read should then stay ~35. Do not worry if you see the last 10-20 bases of reads do often have lower quality base calls that the middle of the read, as the sequencing reagents start to deplete during these cycles (e.g. making nucleotide fluorescence weaker). Furthermore, the reverse reads of sequencing data will often be even lower at ends than forward reads for the same reason.

<p align="center">
  <img src="images/output/fastqc/fastqc_sequence_quality_histogram.png" width="75%" height = "75%">
</p>

Things to watch out for:

* all positions having Phred scores less than 27
* a sharp drop-off of quality early in the read
* for paired-end data, if either R1 or R2 is significantly lower quality across the whole read compared to the complementary read.
  
#### Per Sequence Quality Scores

This is a further summary of the previous plot. This is a histogram of the _overall_ read quality (compared to per-base, above). The x axis is the mean read-quality score (summarising all the bases of the read in a single value), and the y-axis is the number of reads with this Phred score. You should see a peak with the majority of your reads between 27-35.

<p align="center">
  <img src="images/output/fastqc/fastqc_per_sequence_quality_score.png" width="75%" height = "75%">
</p>

Things to watch out for:

* bi-modal peaks which suggests artefacts in some of the sequencing cycles
* all peaks being in orange or red sections which suggests an overall bad sequencing run (possibly due to a faulty flow-cell).
  
#### Per Base Sequencing Content

This is a heatmap which shows the average percentage of C, G, T, and A nucleotides across ~4bp bins across all reads.

You expect to see whole heatmap to be a relatively equal block of colour (normally black), representing an equal mix of A, C, T, G colors (see legend).

<p align="center">
  <img src="images/output/fastqc/fastqc_per_base_sequence_content.png" width="75%" height = "75%">
</p>

Things to watch out for:

* If you see a particular colour becoming more prominent this suggests there is an over-representation of those bases at that base-pair range across all reads (e.g. 20-24bp). This could happen if you have lots of PCR duplicates, or poly-G tails from Illumina NextSeq/NovaSeq 2-colour chemistry data (where no fluorescence can mean both G or 'no-call').

> If you see Poly-G tails, we recommend to turn on FastP poly-G trimming with EAGER. See the 'running' documentation page for details.

#### Per Sequence GC Content

This line graph shows the number percentage reads (y-axis) with an average percent GC content (y-axis). In 'isolate' samples (i.e. majority of the reads should be from the host species of the sample), this should be represented by a sharp peak around the average percent GC content of the reference genome. In metagenomic contexts this should be a wide flat distribution with a mean around 50%, however this can be highly different for other types of data.

<p align="center">
  <img src="images/output/fastqc/fastqc_per_sequence_GC_content.png" width="75%" height = "75%">
</p>

Things to watch out for:

* If you see particularly high percent GC content peak with NextSeq/NovaSeq data, you may have lots of PCR duplicates, or poly-G tails from Illumina NextSeq/NovaSeq 2-colour chemistry data (where no fluorescence can mean both G or 'no-call'). Consider re-running nf-core/eager using the poly-G trimming option from `fastp` See the 'running' documentation page for details.

#### Per Base N Content

This line graph shows you the average numbers of Ns found across all reads of a sample. Ns can be caused for a variety of reasons such as low-confidence base call, or the base has been masked. The lines should be very low (as close to 0 as possible) and generally be flat across the whole read. Increases in Ns may reflect in HiSeq data issues of the last cycles running out of chemistry.

<p align="center">
  <img src="images/output/fastqc/fastqc_per_base_n_content.png" width="75%" height = "75%">
</p>

> **NB:** Publicly downloaded data may have extremely high N contents across all reads. These normally come from 'masked' reads that may have originally be, for example, from a human sample for microbial analysis where the consent for publishing of the host DNA was not given. In these cases you do not need to worry about this plot.

#### Sequence Duplication Levels

This plot is some-what similar to looking at duplication rate or 'cluster factor' of mapped reads. In this case however FastQC takes the sequences of the first 100 thousand reads of a library, and looks to see how often a read sequence is repeated in the rest of the library.

<p align="center">
  <img src="images/output/fastqc/fastqc_sequence_duplication_level.png" width="75%" height = "75%">
</p>

A good library should have very low rates of duplication (vast majority of reads having a duplication rate of 1) — suggesting 'high complexity' or lots of unique reads and useful data. This is represented as a steep drop in the line plot and possible a very small curve at about a duplication rate of 2 or 3 and then remaining at ~0 for higher duplication rates.

Note that good libraries may sometimes have small peaks at high duplication levels. This maybe due to free-adapters (with no inserts), or mono-nucleotide reads (e.g. GGGGG in NextSeq/NovaSeq data).

Bad libraries which have extremely low input DNA (so during amplification the same molecules been amplified repeatedly), or a good library that has been erroneously over-amplified will show very high duplication levels — so a very slowly decreasing curve. Alternatively, if your library construction failed and many adapters were not ligated to insert molecules, a high duplication rate may be caused by these free-adapters (see 'Overrepresented sequences' for more information).

> **NB:** amplicon libraries such as for 16S rRNA analysis may appear here as having high duplication rates and these peaks can be ignored. This can be verified if no contaminants are found in the 'Overrepresented sequences' section.

#### Overrepresented sequences

After identifying duplicates (see the previous section), a table will be displayed in the 'Overrepresented sequences' section of the report. This is an attempt by FastQC to check to see if the duplicates identified match common contaminants such as free adapters or mono-nucleotide reads.

You can then use this table help inform you in identification where the problem occurred in the construction and sequencing of this library. E.g. if you have high duplication rates but no identified contaminants, this suggests over-amplification of reads rather than left over adapters.

#### Adapter Content

This plot shows the percentage of reads (y-axis), which has an adapter starting at a particular position along a read (x-axis). There can be multiple lines per sample as each line represents a particular adapter.

It is common in aDNA libraries to see very rapid increases in the proportion of reads with an adapter 'early on' in the read, as by nature aDNA molecules are fragmented and very short. Palaeolithic samples can have reads as short as 25bp, so sequences can already start having adapters 25bp into a read.

This can already give you an indication on the authenticity of your library - as if you see very low proportions of reads with adapters this suggests long insert molecules that are less likely to derive from a 'true' aDNA library. On the flip-side, if you are working with modern DNA - it can give an indication of over-sonication if you have artificially fragmented your reads to lower than your target molecule length.

<p align="center">
  <img src="images/output/fastqc/fastqc_adapter_content.png" width="75%" height = "75%">
</p>

If you have downloaded public data this often is uploaded with adapters already removed, so you can expect a flat distribution straight away.

When comparing pre- and post-AdapterRemoval FASTQC plots of fresh sequencing data (assuming your sequencing center doesn't already remove adapters), you expect to see something similar to the left panel of the example above _pre-_ adapter removal and the right hand panel _post-_ adapter removal.

### FastP

#### Background

FastP is a general pre-processing toolkit for Illumina sequencing data. In nf-core/eager we currently only use the 'poly-G' trimming function. Poly-G tails occur at ends of reads when using two-colour chemistry kits (i.e. in NextSeq and NovaSeq). This occurs as 'no fluorescence' is interpreted by the machine; however if the chemistry runs out or the read is shorter than the number of cycles in the kit, you will get at the ends of reads lots of cycles with no nucleotides and these are then recorded as Gs.

While the machine should detect a reduction in base-calling quality, this is not always the case and you will retain these tails in your FASTQ files. This can cause skews in GC content and false positive SNP calls when the reference genome has long mono-nucleotide stretches (typically in larger eukaryotic genomes).

In the case of dual-indexed paired-end sequencing, it is likely poly-G tails are less of an issue as during your AdapterRemoval step, anything passed the adapter will be clipped off anyway. However you can check this under the 'Per Sequence GC Content' plot in FastQC.

> **NB:** As you are more likely to see this at the end of the run, in paired-end data you should see all 'Read 2' data having a higher GC content distribution than the 'Read 1'

While the MultiQC report has multiple plots for FastP, we will only look at GC content as that's the functionality we use currently.

The pipeline will generate the respective output for each supplied FASTQ file.

#### GC Content

This line plot shows the average GC content (Y axis) across each nucleotide of the reads (X-axis). There are two buttons per read (i.e. 2 for single-end, and 4 for paired-end) representing before and after the poly-G tail trimming.

Before filtering, if you have poly-G tails, you should see the lines going up  at the end of the right-hand side of the plot.

After filtering, you should see that the average GC content along the reads is now reduced to around the general trend of the entire read.

Things to look out for:

* If you see a distinct GC content increase at the end of the reads, but are not removed after filtering, check to see where along the read the increase seems to start. If it is less than 10 base pairs from the end, consider reducing the overlap parameter `--complexity_filter_poly_g_min`, which tells FastP how far in the read the Gs need to go before removing them.

### AdapterRemoval

#### Background

AdapterRemoval a tool that does the post-sequencing clean up of your sequencing reads. It performs the following functions

* 'Merges' (or 'collapses') forward and reverse reads of Paired End data
* Removes remaining library indexing adapters
* Trims low quality base tails from ends of reads
* Removes too-short reads

In more detail merging is where the same read from the forward and reverse files of a single library (based on the flowcell coordinates), are compared to find a stretch of sequence that are the same. If this overlap reaches certain quality thresholds, the two reads are 'collapsed' into a single read, with the base quality scores are updated accordingly accounting for the increase quality call precision.

Adapter removal involves finding overlaps at the 5' and 3' end of reads for the artificial NGS library adapters (which connect the DNA molecule insert, and the index), and stretches that match each other are then removed from the read itself. Note, by default AdapterRemoval does _not_ remove 'internal barcodes' (between insert and the adapter), so these statistics are not considered.

Quality trimming (or 'truncating') involves looking at ends of reads for low-confidence bases (i.e. where the FASTQ Phred score is below a certain threshold). These are then removed remove the read.

Length filtering involves removing any read that does not reach the number of bases specified by a particular value.

You will receive output for each FASTQ file supplied for single end data, or for each pair of merged FASTQ files for paired end data.

#### Retained and Discarded Reads Plot

These stacked bars plots are unfortunately a little confusing, when displayed in MultiQC. However are relatively straight-forward once you understand each category. They can be displayed as counts of reads per AdapterRemoval read-category, or as percentages of the same values. Each forward(/reverse) file combination are displayed once.

The most important value is the **Retained Read Pairs** which gives you the final number of reads output into the file that goes into mapping. Note, however, this section of the stack bar _includes_ the other categories displayed (see below) in the calculation.

Other Categories:

* If paired-end, the **Singleton [mate] R1(/R2)** categories represent reads which were unable to be collapsed, possibly due to the reads being too long to overlap.
* If paired-end, **Full-length collapsed pairs** are reads which were collapsed and did not require low-quality bases at end of reads to be removed.
* If paired-end, **Truncated collapsed pairs** are paired-end that were collapsed but did required the removal of low quality bases at the end of reads.
* **Discarded [mate] R1/R2** represent reads which were a part of a pair, but one member of the pair did not reach other quality criteria and was discarded. However the other member of the pair is still retained in the output file as it still reached other quality criteria.

<p align="center">
  <img src="images/output/adapter_removal/adapter_removal_discarded_reads.png" width="75%" height = "75%">
</p>
  
For ancient DNA, assuming a good quality run, you expect to see a the vast majority of your reads overlapping because we have such fragmented molecules. Large numbers of singletons suggest your molecules are too long and may not represent true ancient DNA.

If you see high numbers of discarded or truncated reads, you should check your FastQC results for low sequencing quality of that particular run.

#### Length Distribution Plot

The length distribution plots show the number of reads at each read-length. You can change the plot to display different categories.

* All represent the overall distribution of reads. In the case of paired-end sequencing You may see a peak at the turn around from forward to reverse cycles.
* **Mate 1** and **Mate 2** represents the length of the forward and reverse read respectively prior collapsing
* **Singleton** represent those reads that had a one member of a pair discarded
* **Collapsed** and **Collapsed Truncated** represent reads that overlapped and able to merge into a single read, with the latter including base-quality trimming off ends of reads. These plots will start with a vertical rise representing where you are above the minimum-read threshold you set.
* **Discarded** here represents the number of reads that did not each the read length filter. You will likely see a vertical drop at what your threshold was set to.

<p align="center">
  <img src="images/output/adapter_removal/adapter_removal_length_distribution.png" width="75%" height = "75%">
</p>

With paired-end ancient DNA sequencing runs You expect to see a slight increase in shorter fragments in the reverse (R2) read, as our fragments are so short we often don't reach the maximum number of cycles of that particular sequencing run.

### Bowtie2

#### Background

This module provides information on mapping when running the Bowtie2 aligner. Bowtie2, like bwa, takes raw FASTQ reads and finds the most likely place on the reference genome it derived from. While this module is somewhat redundant with the [Samtools](#samtools) (which reports mapping statistics for bwa) and the endorSp.y endogenous DNA value in the general statistics table, it does provide some details that could be useful in certain contexts.

You will receive output for each _library_. This means that if you use TSV input and have one library sequenced over multiple lanes and sequencing types, these are merged and you will get mapping statistics of all lanes in one value.

#### Single/Paired-end alignments

This bar plot shows the number of different categories of reads that Bowtie2 was able to align to the reference genome. You will get slightly different plots for Paired-End (PE) and Single-End (SE) data, but they are basically the same.

Ancient DNA samples typically have low endogenous DNA values, as most of the DNA from the sample is from taphonomic sources (burial environment, modern handling etc), so it is normal to get low numbers of mapping reads.

<p align="center">
  <img src="images/output/bowtie2/bowtie2_alignment_scores.png" width="75%" height = "75%">
</p>

The main additional useful information compared to [Samtools](#samtools) is that these plots can inform you how many reads had multiple places on the reference the read could align to. This can occur with low complexity reads or reads derived from e.g. repetitive regions on the genome. If you have large amounts of multi-mapping reads, this can be a warning flag that there is an issue either with the reference genome or library itself (e.g. library construction artefacts). You should investigate cases like this more closely before using the data downstream.

### MALT

#### Background

MALT is a metagenomic aligner (equivalent to BLAST, but much faster). It produces direct alignments of sequencing reads in a reference genome. It is often used for metagenomic profiling or pathogen screening, and specifically in nf-core/eager, of off-target reads from genome mapping.

You will receive output for each _library_. This means that if you use TSV input and have one library sequenced over multiple lanes and sequencing types, these are merged and you will get mapping statistics of all lanes and sequencing configurations in one value.

#### Metagenomic Mappability

This bar plot gives an approximation of how many reads in your off-target FASTQ file was able to align to your metagenomic database.

Due to low 'endogenous' content of aDNA, and the high biodiversity of modern or environmental microbes that normally exists in archaeological and museum samples, you often will get relatively low mappability percentages.

<p align="center">
  <img src="images/output/malt/malt_metagenomic_mappability.png" width="75%" height = "75%">
</p>

 This can also be influenced by the type of database you supplied — many databases have an over-abundance of taxa of clinical or economic interest, so when you have a large amount of uncharacterised environmental taxa, this may also result in low mappability.

#### Taxonomic assignment success

In addition to actually being able to align to a given reference sequence, MALT can also allow sequences without a 'taxonomic' ID to be included in a database. Furthermore, it utilises a 'lowest common ancestor' algorithm (LCA), that can result in a read getting no taxonomic identification (because it can align to multiple reference sequences with equal probability). Because of this, MultiQC also produces a bar plot indicating of the successfully aligned reads (see Metagenomic Mappability above), how many could be assigned a taxon ID.

<p align="center">
  <img src="images/output/malt/malt_taxonomic_assignment_success.png" width="75%" height = "75%">
</p>

For the same reasons above, you can often get not very many reads being taxonomically assigned when working with aDNA. This can also occur when many of your reads are from conservative regions of genomes and can map onto multiple references. At this point LCA pushes the possible taxon identification so high up the tree, it cannot give a taxonomic assignment.

If you have multiple samples of a similar level of preservation, but one with unusually low numbers of taxonomically assigned reads, it maybe worth investigating what the alignments look like in case
there is some sequencing artefact (although it could just be badly preserved and little DNA).

### Kraken

#### Background

Kraken is another metagenomic classifier, but takes a different approach to alignment as with [MALT](#malt). It uses 'K-mer similarity' between reads and references to very efficiently find similar patterns in sequences. It does not however, do alignment — meaning you cannot screen for authentication criteria such as damage patterns and fragment lengths.

It is useful when you do not have large computing power or you want very rapid but rough approximation of the metagenomic profile of your sample.

You will receive output for each _library_. This means that if you use TSV input and have one library sequenced over multiple lanes and sequencing types, these are merged and you will get mapping statistics of all lanes and sequencing configurations in one value.

#### Top Taxa

This plot gives you an approximation of the abundance of the five top taxa identified. Typically for ancient DNA, this will be quite a small fraction of taxa, as archaeological and museum samples have a large biodiversity from environmental microbes — therefore a large fraction of 'unclassified' can be quite normal.

<p align="center">
  <img src="images/output/kraken/kraken_top_taxa.png" width="75%" height = "75%">
</p>

However for screening for specific metagenomic profiles, such as ancient microbiomes, if the top taxa are from your specific microbiome of interest (e.g. looking at calculus for oral microbiomes, or paleofaeces for gut microbiome), this can be a good indicator that you have a well preserved sample. But of course, you must do further downstream (manual!) authentication of these taxa to ensure they are not from modern contamination.

### Samtools

#### Background

This module provides numbers in raw counts of the mapping of your DNA reads to your reference genome.

You will receive output for each _library_. This means that if you use TSV input and have one library sequenced over multiple lanes and sequencing types, these are merged and you will get mapping statistics of all lanes in one value.

#### Flagstat Plot

This dot plot shows different statistics, and the number of reads (typically as an multiple e.g. million, or thousands), are represented by dots on the X axis.

In most cases the first two rows, 'Total Reads' and 'Total Passed QC' will be the same as EAGER (v1) does not do quality control of reads with samtools. This number should normally be the same the number of (clipped, and if paired-end, merged) retained reads coming out of AdapterRemoval.

The third row 'Mapped' represents the number of reads that found a place that could be aligned on your reference genome. This is the raw number of mapped reads, prior PCR duplication.

The remaining rows will be 0 when running `bwa aln` as these characteristics of the data are not considered by the algorithm by default.

<p align="center">
  <img src="images/output/samtools_flagstat/samtools_flagstat.png" width="80%" height = "80%">
</p>

> **NB:** The Samtools (pre-samtools filter) plots displayed in the MultiQC report shows mapped reads without mapping quality filtering. This will contain reads that can map to multiple places on your reference genome with equal or slightly less mapping quality score. To see how your reads look after mapping quality, look at the FastQC reports in the Samtools (pre-samtools filter). You should expect after mapping quality filtering, that you will have less reads.

### DeDup

You will receive output for each _library_. This means that if you use TSV input and have one library sequenced over multiple lanes and sequencing types, these are merged and you will get mapping statistics of all lanes of the library in one value.

#### Background

DeDup is a duplicate removal tool which searches for PCR duplicates and removes them from your BAM file. We remove these duplicates because otherwise you would be artificially increasing your coverage and subsequently confidence in genotyping, by considering these lab artefacts which are not biologically meaningful. DeDup looks for reads with the same start and end coordinates, and whether they have exactly the same sequence. The main difference of DeDup versus e.g. `samtools markduplicates` is that DeDup considers _both_ ends of a read, not just the start position, so it is more precise in removing actual duplicates without penalising often already low aDNA data.

#### DeDup Plot

This stacked bar plot shows as a whole the total number of reads in the BAM file going into DeDup. The different sections of a given bar represents the following:

* **Not Removed** — the overall number of reads remaining after duplicate removal. These may have had a duplicate (see below).
* **Reverse Removed** — the number of reads that found to be a duplicate of another and removed that were un-collapsed reverse reads (from the earlier read merging step).
* **Forward Removed** — the number of reads that found to be a duplicate of another and removed that were an un-collapsed forward reads (from the earlier read merging step).
* **Merged Removed** — the number of reads that were found to be a duplicate and removed that were a collapsed read (from the earlier read merging step).
  
Exceptions to the above:

* If you do not have paired end data, you will not have sections for 'Merged removed' or 'Reverse removed'.
* If you use the `--dedup_all_merged` flag, you will not have the 'Forward removed' or 'Reverse removed' sections.

<p align="center">
  <img src="images/output/dedup/dedup_deduplicated_reads.png" width="75%" height = "75%">
</p>

Things to look out for:

* The smaller the number of the duplicates removed the better. If you have a small number of duplicates, and wish to sequence deeper, you can use the preseq module (see below) to make an estimate on how much deeper to sequence.
* If you have a very large number of duplicates that were removed this may suggest you have an over amplified library, or a lot of left-over adapters that were able to map to your genome.

### Picard

#### Background

Picard is a toolkit for general BAM file manipulation with many different functions. nf-core/eager most visibly uses the 'markduplicates' tool, for the removal of exact PCR duplicates that can occur during library amplification and results in false inflated coverages (and overly-confident genotyping calls).

#### Mark Duplicates

The deduplication stats plot shows you how many reads were detected and then removed during deduplication of a mapped BAM file. Well-preserved and constructed libraries will typically have many unique reads and few duplicates. These libraries are often good candidates for deeper sequencing (if required), but low-endogenous DNA libraries that have been over-amplified will have few unique reads and many copies of each read. For better calculations you can see the [Preseq](#preseq) module below.

<p align="center">
  <img src="images/output/picard/picard_deduplication_stats.png" width="75%" height = "75%">
</p>

The amount of unmapped reads will depend on whether you have filtered out unmapped reads out not (see the [usage/running the pipeline](usage.md) documentation for more information.

Things to look out for:

* The smaller the number of the duplicates removed the better. If you have a smaller number of duplicates, and wish to sequence deeper, you can use the preseq module (see below) to make an estimate on how much deeper to sequence.
* If you have a very large number of duplicates that were removed this may suggest you have an over amplified library, a badly preserved sample with a very low yield, or a lot of left-over adapters that were able to map to your genome.

### Preseq

#### Background

Preseq is a collection of tools that allow assessment of the complexity of the library, where complexity means the number of unique molecules in your library (i.e. not molecules with the exact same length and sequence).

There are two algorithms from the tools we use: `c_curve` and `lc_extrap`. The former gives you the expected number of unique reads if you were to repeated sequencing but with fewer reads than your first sequencing run. The latter tries to extrapolate the decay in the number of unique reads you would get with re-sequencing but with more reads than your initial sequencing run.

Due to endogenous DNA being so low when doing initial screening, the maths behind `lc_extrap` often fails as there is not enough data. Therefore nf-core/eager sticks with `c_curve` which gives a similar approximation of the library complexity, but is more robust to smaller datasets.

You will receive output for each deduplicated _library_. This means that if you use TSV input and have one library sequenced over multiple lanes and sequencing types, these are merged and you will get mapping statistics of all lanes of the library in one value.

#### Complexity Curve

Using the de-duplication information from DeDup, the calculated curve (a solid line) allows you to estimate: at this sequencing depth (on the X axis), how many unique molecules would you have sequenced (along the Y axs). When you start getting DNA sequences that are the mostly same as ones you've sequenced before, it is often not cost effective to continue sequencing and is a good point to stop.

The dashed line represents a 'perfect' library containing only unique molecules and no duplicates. You are looking for your library stay as close to this line as possible. Plateauing of your curve shows that at that point you would not be getting any more unique molecules and you shouldn't sequence further than this.

<p align="center">
  <img src="images/output/preseq/preseq_complexity_curve.png" width="75%" height = "75%">
</p>

Plateauing can be caused by a number of reasons:

* You have simply sequenced your library to exhaustion
* You have an over-amplified library with many PCR duplicates. You should consider rebuilding the library to maximise data to cost ratio
* You have a low quality library made up of mappable sequencing artefacts that were able to pass filtering (e.g. adapters)

### Damage Calculation

#### Background

DamageProfiler and mapDamage are tools that calculate a variety of standard 'aDNA' metrics from a BAM file. The primary plots here are the misincorporation and length distribution plots. Ancient DNA undergoes depurination and hydrolysis, causing fragmentation of molecules into gradually shorter fragments, and cytosine to thymine deamination damage, that occur on the subsequent single-stranded overhangs at the ends of molecules.

Therefore, three main characteristics of ancient DNA are:

* Short DNA fragments
* Elevated G and As (purines) just before strand breaks
* Increased C and Ts at ends of fragments

You will receive output for each deduplicated _library_. This means that if you use TSV input and have one library sequenced over multiple lanes and sequencing types, these are merged and you will get mapping statistics of all lanes of the library in one value.
  
#### Misincorporation Plots

The MultiQC DamageProfiler and mapDamage module misincorporation plots shows the percent frequency (Y axis) of C to T mismatches at 5' read ends and complementary G to A mismatches at the 3' ends. The X axis represents base pairs from the end of the molecule from the given prime end, going into the middle of the molecule i.e. 1st base of molecule, 2nd base of molecule etc until the 14th base pair. The mismatches are when compared to the base of the reference genome at that position.

When looking at the misincorporation plots, keep the following in mind:

* As few-base single-stranded overhangs are more likely to occur than long overhangs, we expect to see a gradual decrease in the frequency of the modifications from position 1 to the inside of the reads.
* If your library has been **partially-UDG treated**, only the first one or two bases will display the misincorporation frequency.
* If your library has been **UDG treated** you will expect to see extremely-low to no misincorporations at read ends.
* If your library is **single-stranded**, you will expect to see only C to T misincorporations at both 5' and 3' ends of the fragments.
* We generally expect that the older the sample, or the less-ideal preservational environment (hot/wet) the greater the frequency of C to T/G to A.
* The curve will be not smooth then you have few reads informing the frequency calculation. Read counts of less than 500 are likely not reliable.
* If the `mapdamage_downsample` parameter was specified and mapDamage was used for damage calculation, the damage frequency for each base is based only on the specified number of reads.

<p align="center">
  <img src="images/output/damageprofiler/damageprofiler_deaminationpatterns.png" width="75%" height = "75%">
</p>

> **NB:** An important difference to note compared to the mapDamage tool, which DamageProfiler is otherwise an exact-re-implementation of, is that the percent frequency on the Y axis is not fixed between 0 and 0.3, and will 'zoom' into small values the less damage there is

#### Length Distribution

The MultiQC DamageProfiler and mapDamage module length distribution plots show the frequency of read lengths across forward and reverse reads respectively.

When looking at the length distribution plots, keep in mind the following:

* Your curves will likely not start at 0, and will start wherever your minimum read-length setting was when removing adapters.
* You should typically see the bulk of the distribution falling between 40-120bp, which is normal for aDNA
* You may see large peaks at paired-end turn-arounds, due to very-long reads that could not overlap for merging being present, however this reads are normally from modern contamination.
* If the `mapdamage_downsample` parameter was specified and mapDamage was used for damage calculation, the length distribution is based only on the specified number of reads.

### QualiMap

#### Background

Qualimap is a tool which provides statistics on the quality of the mapping of your reads to your reference genome. It allows you to assess how well covered your reference genome is by your data, both in 'fold' depth (average number of times a given base on the reference is covered by a read) and 'percentage' (the percentage of all bases on the reference genome that is covered at a given fold depth). These outputs allow you to make decision if you have enough quality data for downstream applications like genotyping, and how to adjust the parameters for those tools accordingly.

> NB: Neither fold coverage nor percent coverage on there own is sufficient to assess whether you have a high quality mapping. Abnormally high fold coverages of a smaller region such as highly conserved genes or un-removed-adapter-containing reference genomes can artificially inflate the mean coverage, yet a high percent coverage is not useful if all bases of the genome are covered at just 1x coverage.

Note that many of the statistics from this module are displayed in the General Stats table (see above), as they represent single values that are not plottable.

You will receive output for each _sample_. This means you will statistics of deduplicated values of all types of libraries combined in a single value (i.e. non-UDG treated, full-UDG, paired-end, single-end all together).

:warning: If your library has no reads mapping to the reference, this will result in an empty BAM file. Qualimap will therefore not produce any output even if a BAM exists!

#### Coverage Histogram

This plot shows on the Y axis the range of fold coverages that the bases of the reference genome are possibly covered by. The Y axis shows the number of bases that were covered at the given fold coverage depth as indicated on the Y axis.

The greater the number of bases covered at as high as possible fold coverage, the better.

<p align="center">
  <img src="images/output/qualimap/qualimap_coverage_histogram.png" width="75%" height = "75%">
</p>

Things to watch out for:

* You will typically see a direct decay from the lowest coverage to higher. A large range of coverages along the X axis is potentially suspicious.
* If you have stacking of reads i.e. a small region with an abnormally large amount of reads despite the rest of the reference being quite shallowly covered, this will artificially increase your coverage. This would be represented by a small peak that is a much further along the X axis away from the main distribution of reads.
  
#### Cumulative Genome Coverage

This plot shows how much of the genome in percentage (X axis) is covered by a given fold depth coverage (Y axis).

An ideal plot for this is to see an increasing curve, representing larger greater fractions of the genome being increasingly covered at higher depth. However, for low-coverage ancient DNA data, you will be more likely to see decreasing curves starting at a large percentage of the genome being covered at 0 fold coverage — something particular true for large genomes such as for humans.

<p align="center">
  <img src="images/output/qualimap/qualimap_cumulative_genome_coverage.png" width="75%" height = "75%">
</p>

#### GC Content Distribution

This plot shows the distribution of the frequency of reads at different GC contents. The X axis represents the GC content (i.e the percentage of Gs and Cs nucleotides in a given read), the Y axis represents the frequency.

<p align="center">
  <img src="images/output/qualimap/qualimap_gc_content_distribution.png" width="75%" height = "75%">
</p>

Things to watch out for:

* This plot should normally show a normal distribution around the average GC content of your reference genome.
* Bimodal peaks may represent lab-based artefacts that should be further investigated.
* Skews of the peak to a higher GC content that the reference in Illumina dual-colour chemistry data (e.g. NextSeq or NovaSeq), may suggest long poly-G tails that are mapping to poly-G stretches of your genome. The nf-core/eager trimming option `--complexity_filter_poly_g` can be used to remove these tails by utilising the tool FastP for detection and trimming.

### Sex.DetERRmine

#### Background

Sex.DetERRmine calculates the coverage of your mapped reads on the X and Y chromosomes relative to the coverage on the autosomes (X-/Y-rate). This metric can be thought of as the number of copies of chromosomes X and Y that is found within each cell, relative to the autosomal copies. The number of autosomal copies is assumed to be two, meaning that an X-rate of `1.0` means there are two X chromosomes in each cell, while `0.5` means there is a single copy of the X chromosome per cell. Human females have two copies of the X chromosome and no Y chromosome (XX), while human males have one copy of each of the X and Y chromosomes (XY).

When a bedfile of specific sites is provided, Sex.DetERRmine additionally calculates error bars around each relative coverage estimate. For this estimate to be trustworthy, the sites included in the bedfile should be spaced apart enough that a single sequencing read cannot overlap multiple sites. Hence, when a bedfile has not been provided, this error should be ignored. When a suitable bedfile is provided, each observation of a covered site is independent, and the error around the coverage is equal to the binomial error estimate. This error is then propagated during the calculation of relative coverage for the X and Y chromosomes.

> Note that in nf-core/eager this will be run on single- and double-stranded variants of the same library _separately_. This can also help assess for differential contamination between libraries.

#### Relative Coverage

Theoretically, males are expected to cluster around (0.5, 0.5) in the produced scatter plot, while females are expected to cluster around (1.0, 0.0). In practice, when analysing ancient DNA, these relative coverage on both axes is slightly lower than expected, and individuals can cluster around (0.45, 0.45) and (0.85, 0.05). As the number of covered sites for an individual gets smaller, the confidence on the estimate becomes lower, because it is increasingly more likely to be affected by randomness in the preservation and sequencing of ancient DNA.
Placement of individuals between the male and female clusters can be indicative of low coverage and in some cases contamination, when the contaminant and sampled individuals are of opposite biological sex.
Aneuploidy of the sex chromosomes can also be identified with this approach when the placement of an individual in the scatter plot is unexpected. For example, placement of an individual around (1.0, 0.5) despite good genomic coverage is indicative of XXY karyotype (Klinefelter syndrome), while placement around (0.5, 0) could be indicative of karyotype X (Turner's syndrome).

<p align="center">
  <img src="images/output/sexdeterrmine/sexdeterrmine_relative_coverage.png" width="75%" height = "75%">
</p>

#### Read Counts

This plot gives you the number of reads mapped onto the autosomes, X or Y chromosomes. When the total number of mapped reads is low, the estimates are more likely to be dominated by random effects, and hence untrustworthy.
For well-covered data without any skews, you should see long bars that are comprised mostly of autosomal reads. The edge of the bars in female individuals should be mostly X (some small amounts of Y reads are expected and are usually caused by random mapping on the Y chromosome). In males, the number of X-reads will still be higher (since the X chromosome is longer), but the Y reads should be clearly visible on the rightmost end of the bars. The ratio between the number of sites in each bin should roughly correlate with the difference in length in base pairs of each chromosome type.
If this correlation is not observed, your data is skewed towards higher coverage on some chromosomes. This can be expected if you have enriched for a specific set of markers (e.g. Y-chromosome capture), or if the number of reads is too low.

<p align="center">
  <img src="images/output/sexdeterrmine/sexdeterrmine_read_counts.png" width="75%" height = "75%">
</p>

### Bcftools

### Background

Bcftools is a toolkit for processing and summarising of VCF files, i.e. variant call format files. nf-core/eager currently uses bcftools for the `stats` functionality. This summarises in a text file a range of statistics about VCF files, produced by GATK and FreeBayes variant callers.

#### Variant Substitution Types

This stack bar plot shows you the distribution of all types of point-mutation variants away from the reference nucleotide at each position, (e.g. A>C, A>G etc.).

For low-coverage non-UDG treated, non-trimmed nor re-scaled aDNA data, you expect to see a C>T substitutions as the largest category, due to the most common ancient DNA damage being C to T deamination.

#### Variant Quality

This gives you the distribution of variant-call _qualities_ in your VCF files. Each variant will get given a 'Phred-scale' like value that represents the confidence of the variant caller that it has made the right call. The scale is very similar to that of base-call values in FASTQ files (as assessed by FastQC). Distributions that have peaks at higher variant quality scores (>= 30) suggest more confident variant calls. However, in cases of low-coverage aDNA data, these distributions may not be so good.

More detailed explanation of variant quality scores can be seen in the Broad Institute's [GATK documentation](https://gatk.broadinstitute.org/hc/en-us/articles/360035531872-Phred-scaled-quality-scores).

#### Indel Distribution

This plot shows you the distribution of the sizes of insertion- and deletions (InDels) in the variant calling (assuming you configured your variant caller parameters to do so). Low-coverage aDNA data often will not have high enough coverage to accurately assess InDels. In cases of high-coverage data of small-genomes such as microbes, large numbers of InDels, however, may indicate your reads are actually from a _relative_ of the reference mapped to - and should be verified downstream.

#### Variant depths

This plot shows the distribution of depth coverages of each variant called. Typically higher coverage will result in higher quality variant calls (see Variant Quality, above), however in many cases in aDNA these may be low and unequally distributed (due to uneven mapping coverage from contamination).

### MultiVCFAnalyzer

#### Background

MultiVCFanalyzer is a SNP alignment generation tool, that allows further evaluation and filtering of SNP calls made by the GATK UnifiedGenotyper. More specifically it takes one or more VCF files as well as a reference genome, and will allow filtering of SNPs via a variety of metrics and produces a FASTA file with each sample as an entry containing 'consensus calls' at each position.

#### Summary metrics

This table shows the contents of the `snpStatistics.tsv` file produced by MultiVCFAnalyzer. Descriptions of each column can be seen at the MultiVCFAnalyzer page [here](https://github.com/alexherbig/MultiVCFAnalyzer#snpstatisticstsv).

#### Call statistics barplot

You can get different variants of the call statistics bar plot, depending on how you configured  the MultiVCFAnalyzer options.

If you ran with `--min_allele_freq_hom` and `--min_allele_freq_het` set to two different values (left panel A in the figure below), this allows you to assess the number of multi-allelic positions that were called in your genome. Typically MultiVCFAnalyzer is used for analysing smallish haploid genomes (such as mitochondrial or bacterial genomes), therefore a position with multiple possible 'alleles' suggests some form of cross-mapping from other taxa or presence of multiple strains. If this is the case, you will need to be careful with downstream analysis of the consensus sequence (e.g. for phylogenetic tree analysis) as you may accidentally pick up SNPs from other taxa/strains — particularly when dealing with low coverage data. Therefore if you have a high level of 'het' values (see image), you should carefully check your alignments manually to see how clean your genomes are, or whether you can do some form of strain separation (e.g. by majority/minority calling).

<p align="center">
  <img src="images/output/multivcfanalyzer/multivcfanalyzer_call_categories.png" width="75%" height = "75%">
</p>

If you ran with `--min_allele_freq_hom` and `--min_allele_freq_het` set to the same value, you will have three sections to your bars: SNP Calls (hom), Reference Calls and Discard SNP Call. The overall size of the bars will generally depend on the percentage of the genome covered, meaning less well preserved samples will likely have smaller bars than well-preserved samples. Typically you wish to have a low number of discarded SNP calls, but this can be quite high when you have low coverage data (as many positions may not reach that threshold). The number of SNP calls to reference calls can vary depending on the mutation rate of your target organism.

## Output Files

This section gives a brief summary of where to look for what files for downstream analysis. This covers _all_ modules.

Each module has it's own output directory which sit alongside the `MultiQC/` directory from which you opened the report.

* `reference_genome/`: this directory contains the indexing files  of your input reference genome (i.e. the various `bwa` indices, a `samtools`' `.fai` file, and a picard `.dict`), if you used the `--saveReference` flag.
  * When masking of the reference is requested prior to running pmdtools, an additional directory `reference_genome/masked_genome` will be found here, containing the masked reference.
* `fastqc/`: this contains the original per-FASTQ FastQC reports that are summarised with MultiQC. These occur in both `html` (the report) and `.zip` format (raw data). The `after_clipping` folder contains the same but for after AdapterRemoval.
* `adapterremoval/`: this contains the log files (ending with `.settings`) with raw trimming (and merging) statistics after AdapterRemoval. In the `output` sub-directory, are the output trimmed (and merged) `fastq` files. These you can use for downstream applications such as taxonomic binning for metagenomic studies.
* `lanemerging/`: this contains adapter-trimmed and merged (i.e. collapsed) FASTQ files that were merged across lanes, where applicable. These files are the reads that go into mapping (when multiple lanes were specified for a library), and can be used for downstream applications such as taxonomic binning for metagenomic studies.
* `post_ar_fastq_trimmed`: this contains `fastq` files that have been additionally trimmed after AdapterRemoval (if turned on). These reads are usually that had internal barcodes, or damage that needed to be removed before mapping.
* `mapping/`: this contains a sub-directory corresponding to the mapping tool you used, inside of which will be the initial BAM files containing the reads that mapped to your reference genome with no modification (see below). You will also find a corresponding BAM index file (ending in `.csi` or `.bai`), and if running the `bowtie2` mapper: a log ending in `_bt2.log`. You can use these for downstream applications e.g. if you wish to use a different de-duplication tool not included in nf-core/eager (although please feel free to add a new module request on the Github repository's [issue page](https://github.com/nf-core/eager/issues)!).
* `samtools/`: this contains two sub-directories. `stats/` contain the raw mapping statistics files (ending in `.stats`) from directly after mapping. `filter/` contains BAM files that have had a mapping quality filter applied (set by the `--bam_mapping_quality_threshold` flag) and a corresponding index file. Furthermore, if you selected `--bam_discard_unmapped`, you will find your separate file with only unmapped reads in the format you selected. Note unmapped read BAM files will _not_ have an index file.
* `deduplication/`: this contains a sub-directory called `dedup/`, inside here are sample specific directories. Each directory contains a BAM file containing mapped reads but with PCR duplicates removed, a corresponding index file and two stats file. `.hist.` contains raw data for a deduplication histogram used for tools like preseq (see below), and the `.log` contains overall summary deduplication statistics.
* `endorSpy/`: this contains all JSON files exported from the endorSpy endogenous DNA calculation tool. The JSON files are generated specifically for display in the MultiQC general statistics table and is otherwise very likely not useful for you.
* `preseq/`: this contains a `.preseq` file for every BAM file that had enough deduplication statistics to generate a complexity curve for estimating the amount unique reads that will be yield if the library is re-sequenced. You can use this file for plotting e.g. in `R` to find your sequencing target depth.
* `qualimap/`: this contains a sub-directory for every sample, which includes a qualimap report and associated raw statistic files. You can open the `.html` file in your internet browser to see the in-depth report (this will be more detailed than in MultiQC). This includes stuff like percent coverage, depth coverage, GC content and so on of your mapped reads.
* `damageprofiler/`: this contains sample specific directories containing raw statistics and damage plots from DamageProfiler. The `.pdf` files can be used to visualise C to T miscoding lesions or read length distributions of your mapped reads. All raw statistics used for the PDF plots are contained in the `.txt` files.
* `mapdamage/`: this contains sample specific directories containing raw statistics and damage plots from mapDamage. The `.pdf` files can be used to visualise C to T miscoding lesions or read length distributions of your mapped reads. All raw statistics used for the PDF plots are contained in the `.txt` files. The `Runtime_log.txt` file contains runtime information.
* `pmdtools/`: this contains raw output statistics of pmdtools (estimates of frequencies of substitutions), and BAM files which have been filtered to remove reads that do not have a Post-mortem damage (PMD) score of `--pmdtools_threshold`.
* `trimmed_bam/`: this contains the BAM files with X number of bases trimmed off as defined with the `--bamutils_clip_half_udg_left`, `--bamutils_clip_half_udg_right`, `--bamutils_clip_none_udg_left`, and `--bamutils_clip_none_udg_right` flags and corresponding index files. You can use these BAM files for downstream analysis such as re-mapping data with more stringent parameters (if you set trimming to remove the most likely places containing damage in the read).
* `damage_rescaling/`: this contains rescaled BAM files from mapDamage. These BAM files have damage probabilistically removed via a bayesian model, and can be used for downstream genotyping.
* `genotyping/`: this contains all the (gzipped) genotyping files produced by your genotyping module. The file suffix will have the genotyping tool name. You will have files corresponding to each of your deduplicated BAM files (except pileupcaller), or any turned-on downstream processes that create BAMs (e.g. trimmed bams or pmd tools). If `--gatk_ug_keep_realign_bam` supplied, this may also contain BAM files from InDel realignment when using GATK 3 and UnifiedGenotyping for variant calling. When pileupcaller is used to create eigenstrat genotypes, this directory also contains eigenstrat SNP coverage statistics.
* `multivcfanalyzer/`: this contains all output from MultiVCFAnalyzer, including SNP calling statistics, various SNP table(s) and FASTA alignment files.
* `sex_determination/`: this contains the output for the sex determination run. This is a single `.tsv` file that includes a table with the sample name, the number of autosomal SNPs, number of SNPs on the X/Y chromosome, the number of reads mapping to the autosomes, the number of reads mapping to the X/Y chromosome, the relative coverage on the X/Y chromosomes, and the standard error associated with the relative coverages. These measures are provided for each bam file, one row per file. If the `sexdeterrmine_bedfile` option has not been provided, the error bars cannot be trusted, and runtime will be considerably longer.
* `nuclear_contamination/`: this contains the output of the nuclear contamination processes. The directory contains one `*.X.contamination.out` file per individual, as well as `nuclear_contamination.txt` which is a summary table of the results for all individual. `nuclear_contamination.txt` contains a header, followed by one line per individual, comprised of the Method of Moments (MOM) and Maximum Likelihood (ML) contamination estimate (with their respective standard errors) for both Method1 and Method2.
* `bedtools/`: this contains two files as the output from bedtools coverage. One file contains the 'breadth' coverage (`*.breadth.gz`). This file will have the contents of your annotation file (e.g. BED/GFF), and the following subsequent columns: no. reads on feature, # bases at depth, length of feature, and % of feature. The second file (`*.depth.gz`), contains the contents of your annotation file (e.g. BED/GFF), and an additional column which is mean depth coverage (i.e. average number of reads covering each position).
* `metagenomic_complexity_filter`: this contains the output from filtering of input reads to metagenomic classification of low-sequence complexity reads as performed by `bbduk`. This will include the filtered FASTQ files (`*_lowcomplexityremoved.fq.gz`) and also the run-time log (`_bbduk.stats`) for each sample. **Note:** there are no sections in the MultiQC report for this module, therefore you must check the `._bbduk.stats` files to get summary statistics of the filtering.
* `metagenomic_classification/`: this contains the output for a given metagenomic classifier.
  * Running MALT will contain RMA6 files that can be loaded into MEGAN6 or MaltExtract for phylogenetic visualisation of read taxonomic assignments and aDNA characteristics respectively. Additional a `malt.log` file is provided which gives additional information such as run-time, memory usage and per-sample statistics of numbers of alignments with taxonomic assignment etc. This will also include gzip SAM files if requested.
  * Running kraken will contain the Kraken output and report files, as well as a merged Taxon count table. You will also get a Kraken kmer duplication table, in a [KrakenUniq](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1568-0) fashion. This is very useful to check for breadth of coverage and detect read stacking. A small number of aligned reads (low coverage) and a kmer duplication >1 is usually a sign of read stacking, usually indicative of a false positive hit (e.g. from over-amplified libraries). _Kmer duplication is defined as: number of kmers / number of unique kmers_. You will find two kraken reports formats available:  
    * the `*.kreport` which is the old report format, without distinct minimizer count information, used by some tools such as [Pavian](https://github.com/fbreitwieser/pavian)
    * the `*.kraken2_report` which is the new kraken report format, with the distinct minimizer count information.  
    * finally, the `*.kraken.out` file are the direct output of Kraken2
    * ⚠️ If your sample has no hits, no kraken output files will be created for that sample!
* `maltextract/`: this contains a `results` directory in which contains the output from MaltExtract - typically one folder for each filter type, an error and a log file. The characteristics of each node (e.g. damage, read lengths, edit distances - each in different txt formats) can be seen in each sub-folder of the filter folders. Output can be visualised either with the [HOPS postprocessing script](https://github.com/rhuebler/HOPS) or [MEx-IPA](https://github.com/jfy133/MEx-IPA)
* `consensus_sequence/`: this contains three FASTA files from VCF2Genome of a consensus sequence based on the reference FASTA with each sample's unique modifications. The main FASTA is a standard file with bases not passing the specified thresholds as Ns. The two other FASTAS (`_refmod.fasta.gz`) and (`_uncertainity.fasta.gz`) are IUPAC uncertainty codes (rather than Ns) and a special number-based uncertainty system used for other downstream tools, respectively.
   `merged_bams/initial`: these contain the BAM files that would go into UDG-treatment specific BAM trimming. All libraries of the sample sample, **and** same UDG-treatment type will be in these BAM files.
* `merged_bams/additional`: these contain the final BAM files that would go into genotyping (if genotyping is turned on). This means the files will contain all libraries of a given sample (including trimmed non-UDG or half-UDG treated libraries, if BAM trimming turned on)
* `bcftools`: this currently contains a single directory called `stats/` that includes general statistics on variant callers producing VCF files as output by `bcftools stats`. These includethings such as the number of positions, number of transititions/transversions and depth coverage of SNPs etc. These are only produced if `--run_bcftools_stats` is supplied.


================================================
FILE: docs/usage.md
================================================
# nf-core/eager: Usage

## :warning: Please read this documentation on the nf-core website: [https://nf-co.re/eager/usage](https://nf-co.re/eager/usage)

> _Documentation of pipeline parameters is generated automatically from the pipeline schema and can no longer be found in markdown files._

## Introduction

## Running the pipeline

### Quick Start

> Before you start you should change into the output directory you wish your
> results to go in. This will guarantee, that when you start the Nextflow job,
> it will place all the log files and 'working' folders in the corresponding
> output directory, (and not wherever else you may have executed the run from)

The typical command for running the pipeline is as follows:

```bash
nextflow run nf-core/eager --input '*_R{1,2}.fastq.gz' --fasta 'some.fasta' -profile standard,docker
```

where the reads are from FASTQ files of the same pairing.

This will launch the pipeline with the `docker` configuration profile. See below
for more information about profiles.

Note that the pipeline will create the following files in your working
directory:

```bash
work            # Directory containing the Nextflow working files
results         # Finished results (configurable, see below)
.nextflow.log   # Log file from Nextflow
                # Other Nextflow hidden files, eg. history of pipeline runs and old logs.
```

To see the the nf-core/eager pipeline help message run: `nextflow run
nf-core/eager --help`

If you want to configure your pipeline interactively using a graphical user
interface, please visit [nf-co.re
launch](https://nf-co.re/launch?pipeline=eager). Select the `eager` pipeline and
the version you intend to run, and follow the on-screen instructions to create a
config for your pipeline run.

### Updating the pipeline

When you run the above command, Nextflow automatically pulls the pipeline code from GitHub and stores it as a cached version. When running the pipeline after this, it will always use the cached version if available - even if the pipeline has been updated since. To make sure that you're running the latest version of the pipeline, make sure that you regularly update the cached version of the pipeline:

```bash
nextflow pull nf-core/eager
```

### Reproducibility

It's a good idea to specify a pipeline version when running the pipeline on your data. This ensures that a specific version of the pipeline code and software are used when you run your pipeline. If you keep using the same tag, you'll be running the same version of the pipeline, even if there have been changes to the code since.

First, go to the [nf-core/eager releases page](https://github.com/nf-core/eager/releases) and find the latest version number - numeric only (eg. `1.3.1`). Then specify this when running the pipeline with `-r` (one hyphen) - eg. `-r 1.3.1`.

This version number will be logged in reports when you run the pipeline, so that
you'll know what you used when you look back in the future.

Additionally, nf-core/eager pipeline releases are named after Swabian German
Cities. The first release V2.0 is named "Kaufbeuren". Future releases are named
after cities named in the [Swabian league of
Cities](https://en.wikipedia.org/wiki/Swabian_League_of_Cities).

### Automatic Resubmission

By default, if a pipeline step fails, nf-core/eager will resubmit the job with
twice the amount of CPU and memory. This will occur two times before failing.

## Core Nextflow arguments

> **NB:** These options are part of Nextflow and use a _single_ hyphen (pipeline
> parameters use a double-hyphen).

### `-profile`

Use this parameter to choose a configuration profile. Profiles can give configuration presets for different compute environments.

Several generic profiles are bundled with the pipeline which instruct the pipeline to use software packaged using different methods (Docker, Singularity, Podman, Shifter, Charliecloud, Conda) - see below.

> We highly recommend the use of Docker or Singularity containers for full pipeline reproducibility, however when this is not possible, Conda is also supported.

The pipeline also dynamically loads configurations from [https://github.com/nf-core/configs](https://github.com/nf-core/configs) when it runs, making multiple config profiles for various institutional clusters available at run time. For more information and to see if your system is available in these configs please see the [nf-core/configs documentation](https://github.com/nf-core/configs#documentation).

Note that multiple profiles can be loaded, for example: `-profile test,docker` - the order of arguments is important!
They are loaded in sequence, so later profiles can overwrite earlier profiles.

If `-profile` is not specified, the pipeline will run locally and expect all software to be installed and available on the `PATH`. This is _not_ recommended.

* `docker`
  * A generic configuration profile to be used with [Docker](https://docker.com/)
  * Pulls software from Docker Hub: [`nfcore/eager`](https://hub.docker.com/r/nfcore/eager/)
* `singularity`
  * A generic configuration profile to be used with [Singularity](https://sylabs.io/docs/)
  * Pulls software from Docker Hub: [`nfcore/eager`](https://hub.docker.com/r/nfcore/eager/)
* `podman`
  * A generic configuration profile to be used with [Podman](https://podman.io/)
  * Pulls software from Docker Hub: [`nfcore/eager`](https://hub.docker.com/r/nfcore/eager/)
* `shifter`
  * A generic configuration profile to be used with [Shifter](https://nersc.gitlab.io/development/shifter/how-to-use/)
  * Pulls software from Docker Hub: [`nfcore/eager`](https://hub.docker.com/r/nfcore/eager/)
* `charliecloud`
  * A generic configuration profile to be used with [Charliecloud](https://hpc.github.io/charliecloud/)
  * Pulls software from Docker Hub: [`nfcore/eager`](https://hub.docker.com/r/nfcore/eager/)
* `conda`
  * Please only use Conda as a last resort i.e. when it's not possible to run the pipeline with Docker, Singularity, Podman, Shifter or Charliecloud.
  * A generic configuration profile to be used with [Conda](https://conda.io/docs/)
  * Pulls most software from [Bioconda](https://bioconda.github.io/)
* `test`
  * A profile with a complete configuration for automated testing
  * Includes links to test data so needs no other parameters

> _Important_: If running nf-core/eager on a cluster - ask your system
> administrator what profile to use.

**Institution Specific Profiles** These are profiles specific to certain **HPC
clusters**, and are centrally maintained at
[nf-core/configs](https://github.com/nf-core/configs). Those listed below are
regular users of nf-core/eager, if you don't see your own institution here check
the [nf-core/configs](https://github.com/nf-core/configs) repository.

* `uzh`
  * A profile for the University of Zurich Research Cloud
  * Loads Singularity and defines appropriate resources for running the
    pipeline.
* `binac`
  * A profile for the BinAC cluster at the University of Tuebingen 0 Loads
    Singularity and defines appropriate resources for running the pipeline
* `shh`
  * A profile for the S/CDAG cluster at the Department of Archaeogenetics of
    the Max Planck Institute for the Science of Human History
  * Loads Singularity and defines appropriate resources for running the pipeline

**Pipeline Specific Institution Profiles** There are also pipeline-specific
institution profiles. I.e., we can also offer a profile which sets special
resource settings to specific steps of the pipeline, which may not apply to all
pipelines. This can be seen at
[nf-core/configs](https://github.com/nf-core/configs) under
[conf/pipelines/eager/](https://github.com/nf-core/configs/tree/master/conf/pipeline/eager).

We currently offer a nf-core/eager specific profile for

* `shh`
  * A profiler for the S/CDAG cluster at the Department of Archaeogenetics of
    the Max Planck Institute for the Science of Human History
  * In addition to the nf-core wide profile, this also sets the MALT resources
    to match our commonly used databases

Further institutions can be added at
[nf-core/configs](https://github.com/nf-core/configs). Please ask the eager
developers to add your institution to the list above, if you add one!

If you are likely to be running `nf-core` pipelines regularly it may be a good idea to request that your custom config file is uploaded to the `nf-core/configs` git repository. Before you do this please can you test that the config file works with your pipeline of choice using the `-c` parameter (see definition above). You can then create a pull request to the `nf-core/configs` repository with the addition of your config file, associated documentation file (see examples in [`nf-core/configs/docs`](https://github.com/nf-core/configs/tree/master/docs)), and amending [`nfcore_custom.config`](https://github.com/nf-core/configs/blob/master/nfcore_custom.config) to include your custom profile.

If you have any questions or issues please send us a message on [Slack](https://nf-co.re/join/slack) on the [`#configs` channel](https://nfcore.slack.com/channels/configs).

### `-resume`

Specify this when restarting a pipeline. Nextflow will used cached results from any pipeline steps where the inputs are the same, continuing from where it got to previously.

You can also supply a run name to resume a specific run: `-resume [run-name]`. Use the `nextflow log` command to show previous run names.

### `-c`

Specify the path to a specific config file (this is a core Nextflow command). See the [nf-core website documentation](https://nf-co.re/usage/configuration) for more information.

#### Custom resource requests

Each step in the pipeline has a default set of requirements for number of CPUs, memory and time. For most of the steps in the pipeline, if the job exits with an error code of `143` (exceeded requested resources) it will automatically resubmit with higher requests (2 x original, then 3 x original). If it still fails after three times then the pipeline is stopped.

Whilst these default requirements will hopefully work for most people with most
data, you may find that you want to customise the compute resources that the
pipeline requests. You can do this by creating a custom config file. For
example, to give the workflow process `star` 32GB of memory, you could use the
following config:

```nextflow
process {
  withName: bwa {
    memory = 32.GB
  }
}
```

To find the exact name of a process you wish to modify the compute resources, check the live-status of a nextflow run displayed on your terminal or check the nextflow error for a line like so: `Error executing process > 'bwa'`. In this case the name to specify in the custom config file is `bwa`.

See the main [Nextflow documentation](https://www.nextflow.io/docs/latest/config.html) for more information.

If you are likely to be running `nf-core` pipelines regularly it may be a good
idea to request that your custom config file is uploaded to the
`nf-core/configs` git repository. Before you do this please can you test that
the config file works with your pipeline of choice using the `-c` parameter (see
definition below). You can then create a pull request to the `nf-core/configs`
repository with the addition of your config file, associated documentation file
(see examples in
[`nf-core/configs/docs`](https://github.com/nf-core/configs/tree/master/docs)),
and amending
[`nfcore_custom.config`](https://github.com/nf-core/configs/blob/master/nfcore_custom.config)
to include your custom profile.

If you have any questions or issues please send us a message on
[Slack](https://nf-co.re/join/slack) on the [`#configs`
channel](https://nfcore.slack.com/channels/configs).

#### `-name`

Name for the pipeline run. If not specified, Nextflow will automatically
generate a random mnemonic.

This is used in the MultiQC report (if not default) and in the summary HTML /
e-mail (always).

**NB:** Single hyphen (core Nextflow option)

### Running in the background

Nextflow handles job submissions and supervises the running jobs. The Nextflow process must run until the pipeline is finished.

The Nextflow `-bg` flag launches Nextflow in the background, detached from your terminal so that the workflow does not stop if you log out of your session. The logs are saved to a file.

Alternatively, you can use `screen` / `tmux` or similar tool to create a detached session which you can log back into at a later time.
Some HPC setups also allow you to run nextflow within a cluster job submitted your job scheduler (from where it submits more jobs).

To create a screen session:

```bash
screen -R nf-core/eager
```

To disconnect, press `ctrl+a` then `d`.

To reconnect, type:

```bash
screen -r nf-core/eager
```

to end the screen session while in it type `exit`.

#### Nextflow memory requirements

In some cases, the Nextflow Java virtual machines can start to request a large amount of memory.
We recommend adding the following line to your environment to limit this (typically in `~/.bashrc` or `~./bash_profile`):

```bash
NXF_OPTS='-Xms1g -Xmx4g'
```

## Input Specifications

There are two possible ways of supplying input sequencing data to nf-core/eager. The most efficient but more simplistic is supplying direct paths (with wildcards) to your FASTQ or BAM files, with each file or pair being considered a single library and each one run independently. TSV input requires creation of an extra file by the user and extra metadata, but allows more powerful lane and library merging.

### Direct Input Method

This method is where you specify with `--input`, the path locations of FASTQ (optionally gzipped) or BAM file(s). This option is mutually exclusive to the [TSV input method](#tsv-input-method), which is used for more complex input configurations such as lane and library merging.

When using the direct method of `--input` you can specify one or multiple samples in one or more directories files. File names **must be unique**, even if in different directories.  

By default, the pipeline _assumes_ you have paired-end data. If you want to run single-end data you must specify [`--single_end`]('#single_end')

For example, for a single set of FASTQs, or multiple paired-end FASTQ files in one directory, you can specify:

```bash
--input 'path/to/data/sample_*_{1,2}.fastq.gz'
```

If you have multiple files in different directories, you can use additional wildcards (`*`) e.g.:

```bash
--input 'path/to/data/*/sample_*_{1,2}.fastq.gz'
```

> :warning: It is not possible to run a mixture of single-end and paired-end files in one run with the paths `--input` method! Please see the [TSV input method](#tsv-input-method) for possibilities.

**Please note** the following requirements:

1. Valid file extensions: `.fastq.gz`, `.fastq`, `.fq.gz`, `.fq`, `.bam`.
2. The path **must** be enclosed in quotes
3. The path must have at least one `*` wildcard character
4. When using the pipeline with **paired end data**, the path must use `{1,2}`
   notation to specify read pairs.
5. Files names must be unique, having files with the same name, but in different directories is _not_ sufficient
   * This can happen when a library has been sequenced across two sequencers on the same lane. Either rename the file, try a symlink with a unique name, or merge the two FASTQ files prior input.
6. Due to limitations of downstream tools (e.g. FastQC), sample IDs may be truncated after the first `.` in the name, Ensure file names are unique prior to this!
7. For input BAM files you should provide a small decoy reference genome with pre-made indices, e.g. the human mtDNA or phiX genome, for the mandatory parameter `--fasta` in order to avoid long computational time for generating the index files of the reference genome, even if you do not actually need a reference genome for any downstream analyses.

### TSV Input Method

Alternatively to the [direct input method](#direct-input-method), you can supply to `--input` a path to a TSV file that contains paths to FASTQ/BAM files and additional metadata. This allows for more complex procedures such as merging of sequencing data across lanes, sequencing runs, sequencing configuration types, and samples.

<p align="center">
  <img src="https://github.com/nf-core/eager/raw/master/docs/images/usage/merging_files.png" alt="Schematic diagram indicating merging points of different types of libraries, given a TSV input. Dashed boxes are optional library-specific processes" width="70%">
</p>

> Only different libraries from a single sample that have been BAM trimmed will be merged together. Rescaled or PMD filtered libraries will not be merged prior genotyping as each library _may_ have a different model applied to it and have their own biases (i.e. users may need to play around with settings to get the damage-removal optimal).

The use of the TSV `--input` method is recommended when performing more complex procedures such as lane or library merging. You do not need to specify `--single_end`, `--bam`, `--colour_chemistry`, `-udg_type` etc. when using TSV input - this is defined within the TSV file itself. You can only supply a single TSV per run (i.e. `--input '*.tsv'` will not work).

This TSV should look like the following:

| Sample_Name | Library_ID | Lane | Colour_Chemistry | SeqType | Organism | Strandedness | UDG_Treatment | R1 | R2 | BAM |
|-------------|------------|------|------------------|--------|----------|--------------|---------------|----|----|-----|
| JK2782      | JK2782     | 1    | 4                | PE      | Mammoth  | double       | full          | [https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2782_TGGCCGATCAACGA_L008_R1_001.fastq.gz.tengrand.fq.gz](https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2782_TGGCCGATCAACGA_L008_R1_001.fastq.gz.tengrand.fq.gz) | [https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2782_TGGCCGATCAACGA_L008_R2_001.fastq.gz.tengrand.fq.gz](https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2782_TGGCCGATCAACGA_L008_R2_001.fastq.gz.tengrand.fq.gz) | NA  |
| JK2802      | JK2802     | 2    | 2                | SE      | Mammoth  | double       | full          | [https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2802_AGAATAACCTACCA_L008_R1_001.fastq.gz.tengrand.fq.gz](https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2802_AGAATAACCTACCA_L008_R1_001.fastq.gz.tengrand.fq.gz) | NA | NA  |

A template can be taken from
[here](https://raw.githubusercontent.com/nf-core/test-datasets/eager/reference/TSV_template.tsv).

> :warning: Cells **must not** contain spaces before or after strings, as this will make the TSV unreadable by nextflow. Strings containing spaces should be wrapped in quotes.

When using TSV_input, nf-core/eager will merge FASTQ files of libraries with the same `Library_ID` but different `Lanes` values after adapter clipping (and merging), assuming all other metadata columns are the same. If you have the same `Library_ID` but with different `SeqType`, this will be merged directly after mapping prior BAM filtering. Finally, it will also merge BAM files with the same `Sample_ID` but different `Library_ID` after duplicate removal, but prior to genotyping. Please see caveats to this below.

Column descriptions are as follows:

* **Sample_Name:** A text string containing the name of a given sample of which there can be multiple libraries. All libraries with the same sample name and same SeqType will be merged after deduplication.
* **Library_ID:** A text string containing a given library, which there can be multiple sequencing lanes (with the same SeqType).
* **Lane:** A number indicating which lane the library was sequenced on. Files from the libraries sequenced on different lanes (and different SeqType) will be concatenated after read clipping and merging.
* **Colour Chemistry** A number indicating whether the Illumina sequencer the library was sequenced on was a 2 (e.g. Next/NovaSeq) or 4 (Hi/MiSeq) colour chemistry machine. This informs whether poly-G trimming (if turned on) should be performed.
* **SeqType:** A text string of either 'PE' or 'SE', specifying paired end (with both an R1 [or forward] and R2 [or reverse]) and single end data (only R1 [forward], or BAM). This will affect lane merging if different per library.
* **Organism:** A text string of the organism name of the sample or 'NA'. This currently has no functionality and can be set to 'NA', but will affect lane/library merging if different per library
* **Strandedness:** A text string indicating whether the library type is'single' or 'double'. This will affect lane/library merging if different per library.
* **UDG_Treatment:** A text string indicating whether the library was generated with UDG treatment - either 'full', 'half' or 'none'. Will affect lane/library merging if different per library.
* **R1:** A text string of a file path pointing to a forward or R1 FASTQ file. This can be used with the R2 column. File names **must be unique**, even if they are in different directories.
* **R2:** A text string of a file path pointing to a reverse or R2 FASTQ file, or 'NA' when single end data. This can be used with the R1 column. File names **must be unique**, even if they are in different directories.
* **BAM:** A text string of a file path pointing to a BAM file, or 'NA'. Cannot be specified at the same time as R1 or R2, both of which should be set to 'NA'

For example, the following TSV table:

| Sample_Name | Library_ID | Lane | Colour_Chemistry | SeqType | Organism | Strandedness | UDG_Treatment | R1                                                             | R2                                                             | BAM |
|-------------|------------|------|------------------|---------|----------|--------------|---------------|----------------------------------------------------------------|----------------------------------------------------------------|-----|
| JK2782      | JK2782     | 7    | 4                | PE      | Mammoth  | double       | full          | data/JK2782_TGGCCGATCAACGA_L007_R1_001.fastq.gz.tengrand.fq.gz | data/JK2782_TGGCCGATCAACGA_L007_R2_001.fastq.gz.tengrand.fq.gz | NA  |
| JK2782      | JK2782     | 8    | 4                | PE      | Mammoth  | double       | full          | data/JK2782_TGGCCGATCAACGA_L008_R1_001.fastq.gz.tengrand.fq.gz | data/JK2782_TGGCCGATCAACGA_L008_R2_001.fastq.gz.tengrand.fq.gz | NA  |
| JK2802      | JK2802     | 7    | 4                | PE      | Mammoth  | double       | full          | data/JK2802_AGAATAACCTACCA_L007_R1_001.fastq.gz.tengrand.fq.gz | data/JK2802_AGAATAACCTACCA_L007_R2_001.fastq.gz.tengrand.fq.gz | NA  |
| JK2802      | JK2802     | 8    | 4                | SE      | Mammoth  | double       | full          | data/JK2802_AGAATAACCTACCA_L008_R1_001.fastq.gz.tengrand.fq.gz | NA                                                             | NA  |

will have the following effects:

* After AdapterRemoval, and prior to mapping, FASTQ files from lane 7 and lane 8 _with the same `SeqType`_ (and all other _metadata_ columns) will be concatenated together for each **Library**.
* After mapping, and prior BAM filtering, BAM files with different `SeqType` (but with all other metadata columns the same) will be merged together for each **Library**.
* After duplicate removal, BAM files with different `Library_ID`s but with the same  `Sample_Name` and the same `UDG_Treatment` will be merged together.
* If BAM trimming is turned on, all post-trimming BAMs (i.e. non-UDG and half-UDG ) will be merged with UDG-treated (untreated) BAMs, if they have the same `Sample_Name`.

Note the following important points and limitations for setting up:

* The TSV must use actual tabs (not spaces) between cells.
* The input FASTQ filenames are discarded after FastQC, all other downstream results files are based on `Sample_Name`, `Library_ID` and `Lane` columns for filenames.
* _File_ names must be unique regardless of file path, due to risk of over-writing (see: [https://github.com/nextflow-io/nextflow/issues/470](https://github.com/nextflow-io/nextflow/issues/470)).
  * At different stages of the merging process, (as above) nf-core/eager will use as output filenames the information from the `Sample_Name`, `Library_ID` and/or `Lane` columns for filenames.
  * Library_IDs must be unique (other than if they are spread across multiple lanes). For example, your .tsv file must not have rows with both the strings in the Library_ID column as `Library1` and `Library1`, for **both** `SampleA` and `SampleB` in the Sample_ID column, otherwise the two `Library1.fq.gz` files may result in a filename collision.
  * If it is 'too late' and you already have duplicated FASTQ file names before starting a run, a workaround is to concatenate the FASTQ files together and supply this to a nf-core/eager run. The only downside is that you will not get independent FASTQC results for each file.
* Lane IDs must be unique for each sequencing of each library.
  * If you have a library sequenced e.g. on Lane 8 of two HiSeq runs, you can give a fake lane ID (e.g. 20) for one of the FASTQs, and the libraries will still be processed correctly.
  * This also applies to the SeqType column, i.e. with the example above, if one run is PE and one run is SE, you need to give fake lane IDs to one of the runs as well.
* All _BAM_ files must be specified as `SE` under `SeqType`.
  * You should provide a small decoy reference genome with pre-made indices, e.g. the human mtDNA or phiX genome, for the mandatory parameter `--fasta` in order to avoid long computational time for generating the index files of the reference genome, even if you do not actually need a reference genome for any downstream analyses.
* nf-core/eager will only merge multiple _lanes_ of sequencing runs with the same single-end or paired-end configuration
* Accordingly nf-core/eager will not merge _lanes_ of FASTQs with BAM files (unless you use `--run_convertbam`), as only FASTQ files are lane-merged together.
* nf-core/eager is able to correctly handle libraries that are sequenced multiple times on different sequencing configurations (i.e mixtures of single- and paired-end data). These will be merged after mapping and considered 'paired-end' during downstream processes.
  * **Important** we do not recommend choosing to use DeDup (i.e. `--dedupper 'dedup'`) when mixing PE and SE data, as SE data will not necessarily have the correct end position of the read, and DeDup requires both ends of the molecule to remove a duplicate read. Therefore you may end up with inflated (false-positive) coverages due to suboptimal deduplication.
  * When you wish to run PE/SE data together, the default `-dedupper markduplicates` is therefore preferred, as it only looks at the first position. While more conservative (i.e. it'll remove more reads even if not technically duplicates, because it assumes it can't see the true ends of molecules), it is more consistent.
  * An error will be thrown if you try to merge both PE and SE and also supply `--skip_merging`.
  * If you truly want to mix SE data and PE data but using mate-pair info for PE mapping, please run FASTQ preprocessing mapping manually and supply BAM files for downstream processing by nf-core/eager
  * If you _regularly_ want to run the situation above, please leave a feature request on github.
* DamageProfiler, NuclearContamination, MTtoNucRatio and PreSeq are performed on each unique library separately after deduplication (but prior same-treated library merging).
* nf-core/eager functionality such as `--run_trim_bam` will be applied to only   non-UDG (UDG_Treatment: none) or half-UDG (UDG_Treatment: half) libraries. - Qualimap is run on each sample, after merging of libraries (i.e. your values   will reflect the values of all libraries combined - after being damage trimmed etc.).
* Genotyping will be typically performed on each `sample` independently, as normally all libraries will have been merged together. However, if you have a mixture of single-stranded and double-stranded libraries, you will normally need to genotype separately. In this case you **must** give each the SS and DS libraries _distinct_ `Sample_IDs`; otherwise you will receive a `file collision` error in steps such as `sexdeterrmine`, and then you will need to merge these yourself. We will consider changing this behaviour in the future if there is enough interest.

## Clean up

Once a run has completed, you will have _lots_ of (some very large) intermediate
files in your output directory. These are stored within the directory named
`work`.

After you have verified your run completed correctly and everything in the
module output directories are present as you expect and need, you can perform a
clean-up.

> **Important**: Once clean-up is completed, you will _not_ be able to re-rerun
> the pipeline from an earlier step and you'll have to re-run from scratch.

While in your output directory, firstly verify you're only deleting files stored
in `work/` with the dry run command:

```bash
nextflow clean -n
```

> :warning: some institutional profiles already have clean-up on successful run
> completion turned on by default.

If you're ready, you can then remove the files with

```bash
nextflow clean -f -k
```

This will make your system administrator very happy as you will _halve_ the
hard drive footprint of the run, so be sure to do this!

## Troubleshooting and FAQs

### I get a file name collision error during merging

When using TSV input, nf-core/eager will attempt to merge all `Lanes` of a
`Library_ID`, or all files with the same `Library_ID` or `Sample_ID`. However,
if you have specified the same `Lane` or  `Library_ID` for two sets of FASTQ
files you will likely receive an error such as

```bash
Error executing process > 'library_merge (JK2782)'
Caused by:
  Process `library_merge` input file name collision -- There are multiple input files for each of the following file names: JK2782.mapped_rmdup.bam.csi, JK2782.mapped_rmdup.bam
Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`
Execution cancelled -- Finishing pending tasks before exit
```

In this case: for lane merging errors, you can give 'fake' lane IDs to ensure
they are unique (e.g. if one library was sequenced on Lane 8 of two HiSeq runs,
specify lanes as 8 and 16 for each FASTQ file respectively). For library merging
errors, you must modify your `Library_ID`s accordingly, to make them unique.

### A library or sample is missing in my MultiQC report

In some cases it maybe no output log is produced by a particular tool for MultiQC. Therefore this sample will not be displayed.

Known cases include:

* Qualimap: there will be no MultiQC output if the BAM file is empty. An empty BAM file is produced when no reads map to the reference and causes Qualimap to crash - this is crash is ignored by nf-core/eager (to allow the rest of the pipeline to continue) and will therefore have no log file for that particular sample/library

## Tutorials

### Tutorial - How to investigate a failed run

As with most pipelines, nf-core/eager can sometimes fail either through a
problem with the pipeline itself, but also sometimes through an issue of the
program being run at the given step.

To help try and identify what has caused the error, you can perform the
following steps before reporting the issue:

#### 1a Nextflow reports an 'error executing process' with command error

Firstly, take a moment to read the terminal output that is printed by an
nf-core/eager command.

When reading the following, you can see that the actual _command_ failed. When
you get this error, this would suggest that an actual program used by the
pipeline has failed. This is identifiable when you get an `exit status` and a
`Command error:`, the latter of which is what is reported by the failed program
itself.

```bash
ERROR ~ Error executing process > 'circulargenerator (hg19_complete_500.fasta)'

Caused by:
  Process `circulargenerator (hg19_complete_500.fasta)` terminated with an error exit status (1)

Command executed:

  circulargenerator -e 500 -i hg19_complete.fasta -s MT
  bwa index hg19_complete_500.fasta

Command exit status:
  1

Command output:
  (empty)

Command error:
  Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:3332)
        at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
        at java.lang.StringBuffer.append(StringBuffer.java:270)
        at CircularGenerator.extendFastA(CircularGenerator.java:155)
        at CircularGenerator.main(CircularGenerator.java:119)

Work dir:
  /projects1/microbiome_calculus/RIII/03-preprocessing/mtCap_preprocessing/work/7f/52f33fdd50ed2593d3d62e7c74e408

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

 -- Check '.nextflow.log' file for details
```

If you find it is a common error try and fix it yourself by changing your
options in your nf-core/eager run - it could be a configuration error on your
part. However in some cases it could be an error in the way we've set up the
process in nf-core/eager.

To further investigate, go to step 2.

#### 1b Nextflow reports an 'error executing process' with no command error

Alternatively, you may get an error with Nextflow itself. The most common one
would be a 'process fails' and it looks like the following.

```bash
Error executing process > 'library_merge (JK2782)'
Caused by:
  Process `library_merge` input file name collision -- There are multiple input files for each of the following file names: JK2782.mapped_rmdup.bam.csi, JK2782.mapped_rmdup.bam
Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`
Execution cancelled -- Finishing pending tasks before exit
```

However in this case, there is no `exit status` or `Command error:` message. In
this case this is a Nextflow issue.

The example above is because a user has specified multiple sequencing runs of
different libraries but with the same library name. In this case Nextflow could
not identify which is the correct file to merge because they have the same name.

This again can also be a user or Nextflow error, but the errors are often more
abstract and less clear how to solve (unless you are familiar with Nextflow).

Try to investigate a bit further and see if you can understand what the error
refers to, but if you cannot - please ask on the #eager channel on the [nf-core
slack](https://nf-co.re/join/slack) or leave a [github
issue](https://github.com/nf-core/eager/issues).

#### 2 Investigating an failed process's `work/` directory

If you haven't found a clear solution to the failed process from the reported
errors, you can next go into the directory where the process was working in,
and investigate the log and error messages that are produced by each command of
the process.

For example, in the error in
[1a](#1a-nextflow-reports-an-error-executing-process-with-command-error) you can
see the following line

```bash
Work dir:
  /projects1/microbiome_calculus/RIII/03-preprocessing/mtCap_preprocessing/work/7f/52f33fdd50ed2593d3d62e7c74e408
```

> A shortened version of the 'hash' directory ID can also be seen in your
> terminal while the pipeline is running in the square brackets at the beginning
> of each line.

If you change into this with `cd` and run `ls -la` you should see a collection
of normal files, symbolic links (symlinks) and hidden files (indicated with `.`
at the beginning of the file name).

* Symbolic links: are typically input files from previous processes.
* Normal files: are typically successfully completed output files from some of
  some of the commands in the process
* Hidden files are Nextflow generated files and include the submission commands
  as well as log files

When you have an error run, you can firstly check the contents of the output
files to see if they are empty or not (e.g. with `cat` or `zcat`),
interpretation of which will depend on the program thus dependent on the user
knowledge.

Next, you can investigate `.command.err` and `.command.out`, or `.command.log`.
These represent the standard out or error (in the case of `.log`, both combined)
of all the commands/programs in the process - i.e. what would be printed to
screen if you were running the command/program yourself. Again, view these with
e.g. `cat` and see if you can identify the error of the program itself.

Finally, you can also try running the commands _yourself_. You can firstly try
to do this by loading your given nf-core/eager environment (e.g. `singularity
shell /\<path\>/\<to\>/nf-core-eager-X-X-X.img` or `conda activate
nf-core-eager-X.X.X`), then running `bash .command.sh`.

If this doesn't work, this suggests either there is something wrong with the
nf-core/eager environment configuration, _or_ there is still a problem with the
program itself. To confirm the former, try running the command within the
`.command.sh` file (viewable with `cat`) but with locally installed versions of
programs you may already have on your system. If the command still doesn't work,
it is a problem with the program or your specified configuration. If it does
work locally, please report as a [github
issue](https://github.com/nf-core/eager/issues).

If it does, please ask the developer of the tool (although we will endeavour to
help as much as we can via the [nf-core slack](https://nf-co.re/join/slack) in
the #eager channel).

### Tutorial - What are profiles and how to use them

#### Tutorial Profiles - Background

A useful feature of Nextflow is the ability to use configuration _profiles_ that
can specify many default parameters and other settings on how to run your
pipeline.

For example, you can use it to set your preferred mapping parameters, or specify
where to keep Docker, Singularity or Conda environments, and which cluster
scheduling system (and queues) your pipeline runs should normally use.

This are defined in `.config` files, and these in-turn can contain different
profiles that can define parameters for different contexts.

For example, a `.config` file could contain two profiles, one for
shallow-sequenced samples that uses only a small number of CPUs and memory e.g.
`small`, and another for deep sequencing data, `deep`, that allows larger
numbers of CPUs and memory. As another example you could define one profile
called `loose` that contains mapping parameters to allow reads with aDNA damage
to map, and then another called `strict` that reduces the likelihood of damaged
DNA to map and cause false positive SNP calls.

Within nf-core, there are two main levels of configs

* Institutional-level profiles: these normally define things like paths to
  common storage, resource maximums, scheduling system
* Pipeline-level profiles: these normally define parameters specifically for a
  pipeline (such as mapping parameters, turning specific modules on or off)

As well as allowing more efficiency and control at cluster or Institutional
levels in terms of memory usage, pipeline-level profiles can also assist in
facilitating reproducible science by giving a way for researchers to 'publish'
their exact pipeline parameters in way other users can automatically re-run the
pipeline with the pipeline parameters used in the original publication but on
their _own_ cluster.

To illustrate this, lets say we analysed our data on a HPC called 'blue' for
which an institutional profile already exists, and for our analysis we defined a
profile called 'old_dna'. We will have run our pipeline with the following
command

```bash
nextflow run nf-core/eager -c old_dna_profile.config -profile hpc_blue,old_dna <...>
```

Then our colleague wished to recreate your results. As long as the
`old_dna_profile.config` was published alongside your results, they can run the
same pipeline settings but on their own cluster HPC 'purple'.

```bash
nextflow run nf-core/eager -c old_dna_profile.config -profile hpc_purple,old_dna <...>
```

(where the `old_dna` profile is defined in `old_dna_profile.config`, and
`hpc_purple` is defined on nf-core/configs)

This tutorial will describe how to create and use profiles that can be used by
or from other researchers.

#### Tutorial Profiles - Inheritance Rules

##### Tutorial Profiles - Profiles

An important thing to understand before you start writing your own profile is
understanding 'inheritance' of profiles when specifying multiple profiles, when
using `nextflow run`.

When specifying multiple profiles, parameters defined in the profile in the
first position will be overwritten by those in the second, and everything defined in the
first and second will be overwritten everything in a third.

This can be illustrated as follows.

```bash
              overwrites  overwrites
               ┌──────┐   ┌──────┐
               ▼      │   ▼      │
-profile institution,cluster,my_paper
```

This would be translated as follows.

If your parameters looked like the following

| Parameter       | Resolved Parameters    | institution | cluster  | my_paper |
| ----------------|------------------------|-------------|----------|----------|
| --executor      | singularity            | singularity | \<none\> | \<none\> |
| --max_memory    | 256GB                  | 756GB       | 256GB    | \<none\> |
| --bwa_aln       | 0.1                    | \<none\>    | 0.01     | 0.1      |

(where '\<none\>' is a parameter not defined in a given profile.)

You can see that `my_paper` inherited the `0.1` parameter over the `0.01`
defined in the `cluster` profile.

> :warning: You must always check if parameters are defined in any 'upstream'
> profiles that have been set by profile administrators that you may be unaware
> of. This is make sure there are no unwanted or unreported 'defaults' away from
> original nf-core/eager defaults.

##### Tutorial Profiles - Configuration Files

> :warning: This section is only needed for users that want to set up
> institutional-level profiles. Otherwise please skip to [Writing your own profile](#tutorial-profiles---writing-your-own-profile)

In actuality, a nf-core/eager run already contains many configs and profiles,
and will normally use _multiple_ configs profiles in a single run. Multiple
configuration and profiles files can be used, and each new one selected will
inherit all the previous one's parameters, and the parameters in the new one
will then overwrite any that have been changed from the original.

This can be visualised here

<p align="center">
  <img src="images/tutorials/profiles/config_profile_inheritence.png" width="75%" height = "75%">
</p>

Using the example given in the [background](#tutorial-profiles---background), if
the `hpc_blue` profile has the following pipeline parameters set

```txt
<...>
mapper = 'bwamem'
dedupper = 'markduplicates'
<...>
```

However, the profile `old_dna` has only the following parameter

```txt
<...>
mapper = 'bwaaln'
<...>
```

Then running the pipeline with the profiles in the order of the following run
command:

```bash
nextflow run nf-core/eager -c old_dna_profile.config -profile hpc_blue,old_dna <...>
```

In the background, any parameters in the pipeline's `nextflow.config`
(containing default parameters) will be overwritten by the
`old_dna_profile.config`. In addition, the `old_dna` _profile_ will overwrite
any parameters set in the config but outside the profile definition of
`old_dna_profile.config`.

Therefore, the final profile used by your given run would look like:

```txt
<...>
mapper = 'bwaaln'
dedupper = 'markduplicates'
<...>
```

You can see here that `markduplicates` has not changed as originally defined in
the `hpc_blue` profile, but the `mapper` parameter has been changed from
`bwamem` to `bwaaln`, as specified in the `old_dna` profile.

The order of loading of different configuration files can be seen here:

| Loading Order | Configuration File                                                                                              |
| -------------:|:----------------------------------------------------------------------------------------------------------------|
| 1             | `nextflow.config` in your current directory                                                                     |
| 2             | (if using a script for `nextflow run`) a `nextflow.config` in the directory the script is located               |
| 3             | `config` stored in your human directory under `~/.nextflow/`                                                    |
| 4             | `<your_file>.config` if you specify in the `nextflow run` command with `-c`                                     |
| 5             | general nf-core institutional configurations stored at [nf-core/configs](https://github.com/nf-core/configs)    |
| 6             | pipeline-specific nf-core institutional configurations at [nf-core/configs](https://github.com/nf-core/configs) |

This loading order of these `.config` files will not normally affect the
settings you use for the pipeline run itself; `-profiles` are normally more
important. However this is good to keep in mind when you need to debug profiles
if your run does not use the parameters you expect.

> :warning: It is also possible to ignore every configuration file other when
> specifying a custom `.config` file by using `-C` (capital C) instead of `-c`
> (which inherits previously specify parameters)

Another thing that is important to note is that if a specific _profile_ is
specified in `nextflow run`, this replaces any 'global' parameter that is
specified within the config file (but outside a profile) itself - **regardless**
of profile order (see above).

For example, see the example adapted from the SHH nf-core/eager
pipeline-specific
[configuration](https://github.com/nf-core/configs/blob/master/conf/pipeline/eager/shh.config).

This pipeline-specific profile is automatically loaded if nf-core/eager detects
we are running eager, and that we specified the profile as `shh`.

```txt
// global 'fallback' parameters
params {
  // Specific nf-core/configs params
  config_profile_contact = 'James Fellows Yates (@jfy133)'
  config_profile_description = 'nf-core/eager SHH profile provided by nf-core/configs'
  
  // default BWA
   bwaalnn = 0.04
   bwaalnl = 32
}

}

// profile specific parameters
profiles {
  pathogen_loose {
    params {
      config_profile_description = 'Pathogen (loose) MPI-SHH profile, provided by nf-core/configs.'
      bwaalnn = 0.01
      bwaalnl = 16
    }
  }
}

```

If you run with `nextflow run -profile shh` to specify to use an
institutional-level nf-core config, the parameters will be read as `--bwaalnn
0.04` and `--bwaalnl 32` as these are the default 'fall back' params as
indicated in the example above.

If you specify as `nextflow run -profile shh,pathogen_loose`, as expected
Nextflow will resolve the two parameters as `0.01` and `16`.

Importantly however, if you specify `-profile pathogen_loose,shh` the
`pathogen_loose` **profile** will **still** take precedence over just the
'global' params.

Equally, a **process**-level defined parameter (within the nf-core/eager code
itself) will take precedence over the fallback parameters in the `config` file.
This is also described in the Nextflow documentation
[here](https://www.nextflow.io/docs/latest/config.html#config-profiles)

This is because selecting a `profile` will always take precedence over the
values specified in a config file, but outside of a profile.

#### Tutorial Profiles - Writing your own profile

We will now provide an example of how to write, use and share a project specific
profile. We will use the example of [Andrades Valtueña et al.
2016](https://doi.org/10.1016/j.cub.2017.10.025).

In it they used the original EAGER (v1) to map DNA from ancient DNA to the
genome of the bacterium **Yersinia pestis**.

Now, we will generate a profile, that, if they were using nf-core/eager they
could share with other researchers.

In the methods they described the following:

> ... reads mapped to **Y. pestis** CO92 reference with BWA aln (-l 16, -n 0.01,
> hereby referred to as non-UDG parameters). Reads with mapping quality scores
> lower than 37 were filtered out. PCR duplicates were removed with
> MarkDuplicates."

Furthermore, in their 'Table 1' they say they used the NCBI **Y. pestis** genome
'NC_003143.1', which can be found on the NCBI FTP server at:
[https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/009/065/GCF_000009065.1_ASM906v1/GCF_000009065.1_ASM906v1_genomic.fna.gz](https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/009/065/GCF_000009065.1_ASM906v1/GCF_000009065.1_ASM906v1_genomic.fna.gz)

To make a profile with these parameters for use with nf-core/eager we first need
to open a text editor, and define a Nextflow 'profile' block.

```txt
profiles {

}

```

Next we need to define the name of the profile. This is what we would write in
`-profile`. Lets call this AndradesValtuena2018.

```txt
profiles {
  AndradesValtuena2018 {

  }
}
```

Now we need to make a `params` 'scope'. This means these are the parameters you
specifically pass to nf-core/eager itself (rather than Nextflow configuration
parameters).

You should generally not add [non-`params`
scopes](https://www.nextflow.io/docs/latest/config.html?highlight=profile#config-scopes)
in profiles for a specific project. This is because these will normally modify
the way the pipeline will run on the computer (rather than just nf-core/eager
itself, e.g. the scheduler/executor or maximum memory available), and thus not
allow other researchers to reproduce your analysis on their own
computer/clusters.

```txt
profiles {
  AndradesValtuena2018 {
    params {

    }
  }
}
```

Now, as a cool little trick, we can use a couple of nf-core specific parameters
that can help you keep track which profile you are using when running the
pipeline. The `config_profile_description` and `config_profile_contact` profiles
are displayed in the console log when running the pipeline. So you can use these
to check if your profile loaded as expected. These are free text boxes so you
can put what you like.

```txt
profiles {
  AndradesValtuena2018 {
    params {
        config_profile_description = 'non-UDG parameters used in Andrades Valtuena et al. 2018 Curr. Bio.'
        config_profile_contact = 'Aida Andrades Valtueña (@aidaanva)'
    }
  }
}
```

Now we can add the specific nf-core/eager parameters that will modify the
mapping and deduplication parameters in nf-core/eager.

```txt
profiles {
  AndradesValtuena2018 {
    params {
        config_profile_description = 'non-UDG parameters used in Andrades Valtuena et al. 2018 Curr. Bio.'
        config_profile_contact = 'Aida Andrades Valtueña (@aidaanva)'
        fasta = 'https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/009/065/GCF_000009065.1_ASM906v1/GCF_000009065.1_ASM906v1_genomic.fna.gz'
        bwaalnn = 0.01
        bwaalnl = 16
        run_bam_filtering = true
        bam_mapping_quality_threshold = 37
        dedupper = 'markduplicates'
    }
  }
}
```

Once filled in, we can save the file as `AndradesValtuena2018.config`. This you
can use yourself, or upload alongside your publication for others to use.

To use the profile you just need to specify the file containing the profile you
wish to use, and then the profile itself.

For example, Aida (Andrades Valtueña) at the MPI-SHH (`shh`) in Jena could run
the following:

```bash
nextflow run nf-core/eager -c /<path>/<to>/AndradesValtuena2018.config -profile shh,AndradesValtuena2018 --input '/<path>/<to>/<some_input>/' <...>
```

Then a colleague at a different institution, such as the SciLifeLab, could run
the same profile on the UPPMAX cluster in Uppsala with:

```bash
nextflow run nf-core/eager -c /<path>/<to>/AndradesValtuena2018.config -profile uppmax,AndradesValtuena2018 --input '/<path>/<to>/<some_input>/' <...>
```

And that's all there is to it. Of course you should always check that there are
no other 'default' parameters for your given pipeline are defined in any
pipeline-specific or institutional profiles. This ensures that someone
re-running the pipeline with your settings is as close to the nf-core/eager
defaults as possible, and only settings specific to your given project are used.
If there are 'upstream' defaults, you should explicitly specify these in your
project profile.

### Tutorial - How to set up nf-core/eager for human population genetics

#### Tutorial Human Pop-Gen - Introduction

This tutorial will give a basic example on how to set up nf-core/eager to
perform initial screening of samples in the context of ancient human population
genetics research.

> :warning: this tutorial does not describe how to install and set up
> nf-core/eager For this please see other documentation on the
> [nf-co.re](https://nf-co.re/usage/installation) website.

We will describe how to set up mapping of ancient sequences against the human
reference genome to allow sequencing and library quality-control, estimation of
nuclear contamination, genetic sex determination, and production of random draw
genotypes in eigenstrat format for a specific set of sites, to be used in
further analysis. For this example, I will be using the 1240k SNP set. This SNP
set was first described in [Mathieson et al.
2015](https://www.nature.com/articles/nature16152) and contains various
positions along the genome that have been extensively genotyped in present-day
and ancient populations, and are therefore useful for ancient population genetic
analysis. Capture techniques are often used to enrich DNA libraries for
fragments, that overlap these SNPs, as is being assumed has been performed in
this example.

> :warning: Please be aware that the settings used in this tutorial may not use
> settings nor produce files you would actually use in 'real' analysis. The
> settings are only specified for demonstration purposes. Please consult the
> your colleagues, communities and the literature for optimal parameters.

#### Tutorial Human Pop-Gen - Preparation

Prior setting up the nf-core/eager run, we will need:

1. Raw sequencing data in FASTQ format
2. Reference genome in FASTA format, with associated pre-made `bwa`, `samtools`
   and `picard SequenceDictionary` indices (however note these can be made for
   you with nf-core/eager, but this can make a pipeline run take much longer!)
3. A BED file with the positions of the sites of interest.
4. An eigenstrat formatted `.snp` file for the positions of interest.

We should also ensure we have the very latest version of the nf-core/eager
pipeline so we have all latest bugfixes etc. In this case we will be using
nf-core/eager version 2.2.0. You should always check on the
[nf-core](https://nf-co.re/eager) website whether a newer release has been made
(particularly point releases e.g. 2.2.1).

```bash
nextflow pull nf-core/eager -r 2.2.0
```

It is important to note that if you are planning on running multiple runs of
nf-core/eager for a given project, that the version should be **kept the same**
for all runs to ensure consistency in settings for all of your libraries.

#### Tutorial Human Pop-Gen - Inputs and Outputs

To start, lets make a directory where all your nf-core/eager related files for
this run will go, and change into it.

```bash
mkdir projectX_preprocessing20200727
cd projectX_preprocessing20200727
```

The first part of constructing any nf-core/eager run is specifying a few generic
parameters that will often be common across all runs. This will be which
pipeline, version and _profile_ we will use. We will also specify a unique name
of the run to help us keep track of all the nf-core/eager runs you may be
running.

```bash
nextflow run nf-core/eager \
-r 2.2.0 \
-profile singularity,shh \
-name 'projectX_preprocessing20200727' \
<...>
```

For the `-profile` parameter, I have indicated that I wish to use Singularity as
my software container environment, and I will use the MPI-SHH institutional
config as listed on
[nf-core/configs](https://github.com/nf-core/configs/blob/master/conf/shh.config).
These profiles specify settings
optimised for the specific cluster/institution, such as maximum memory available
or which scheduler queues to submit to. More explanations about configs and
profiles can be seen in the [nf-core
website](https://nf-co.re/usage/configuration) and the [profile
tutorial](#tutorial---what-are-profiles-and-how-to-use-them).

Next we need to specify our input data. nf-core/eager can accept input FASTQs
files in two main ways, either with direct paths to files (with wildcards), or
with a Tab-Separate-Value (TSV) file which contains the paths and extra
metadata. In this example, we will use the TSV method, as to simulate a
realistic use-case, such as receiving paired-end data from an Illumina NextSeq
of double-stranded libraries. Illumina NextSeqs sequence a given library across
four different 'lanes', so for each library you will receive four FASTQ files.
The TSV input method is more useful for this context, as it allows 'merging' of
these lanes after preprocessing prior mapping (whereas direct paths will
consider each pair of FASTQ files as independent libraries/samples).

Our TSV file will look something like the following:

```bash
Sample_Name     Library_ID      Lane    Colour_Chemistry        SeqType Organism        Strandedness    UDG_Treatment   R1      R2      BAM
EGR001  EGR001.B0101.SG1        1       2       PE      homo_sapiens    double  half    ../../02-raw_data/EGR001.B0101.SG1.1/EGR001.B0101.SG1.1_S0_L001_R1_001.fastq.gz ../../02-raw_data/EGR001.B0101.SG1.1/EGR001.B0101.SG1.1_S0_L001_R2_001.fastq.gz NA
EGR001  EGR001.B0101.SG1        2       2       PE      homo_sapiens    double  half    ../../02-raw_data/EGR001.B0101.SG1.1/EGR001.B0101.SG1.1_S0_L002_R1_001.fastq.gz ../../02-raw_data/EGR001.B0101.SG1.1/EGR001.B0101.SG1.1_S0_L002_R2_001.fastq.gz NA
EGR001  EGR001.B0101.SG1        3       2       PE      homo_sapiens    double  half    ../../02-raw_data/EGR001.B0101.SG1.1/EGR001.B0101.SG1.1_S0_L003_R1_001.fastq.gz ../../02-raw_data/EGR001.B0101.SG1.1/EGR001.B0101.SG1.1_S0_L003_R2_001.fastq.gz NA
EGR001  EGR001.B0101.SG1        4       2       PE      homo_sapiens    double  half    ../../02-raw_data/EGR001.B0101.SG1.1/EGR001.B0101.SG1.1_S0_L004_R1_001.fastq.gz ../../02-raw_data/EGR001.B0101.SG1.1/EGR001.B0101.SG1.1_S0_L004_R2_001.fastq.gz NA
EGR001  EGR001.B0101.SG1        5       2       PE      homo_sapiens    double  half    ../../02-raw_data/EGR001.B0101.SG1.2/EGR001.B0101.SG1.2_S0_L001_R1_001.fastq.gz ../../02-raw_data/EGR001.B0101.SG1.2/EGR001.B0101.SG1.2_S0_L001_R2_001.fastq.gz NA
EGR001  EGR001.B0101.SG1        6       2       PE      homo_sapiens    double  half    ../../02-raw_data/EGR001.B0101.SG1.2/EGR001.B0101.SG1.2_S0_L002_R1_001.fastq.gz ../../02-raw_data/EGR001.B0101.SG1.2/EGR001.B0101.SG1.2_S0_L002_R2_001.fastq.gz NA
EGR001  EGR001.B0101.SG1        7       2       PE      homo_sapiens    double  half    ../../02-raw_data/EGR001.B0101.SG1.2/EGR001.B0101.SG1.2_S0_L003_R1_001.fastq.gz ../../02-raw_data/EGR001.B0101.SG1.2/EGR001.B0101.SG1.2_S0_L003_R2_001.fastq.gz NA
EGR002  EGR002.B0201.SG1        8       2       PE      homo_sapiens    double  half    ../../02-raw_data/EGR001.B0101.SG1.2/EGR001.B0101.SG1.2_S0_L004_R1_001.fastq.gz ../../02-raw_data/EGR001.B0101.SG1.2/EGR001.B0101.SG1.2_S0_L004_R2_001.fastq.gz NA
EGR002  EGR002.B0201.SG1        1       2       PE      homo_sapiens    double  half    ../../02-raw_data/EGR002.B0201.SG1.1/EGR002.B0201.SG1.1_S0_L001_R1_001.fastq.gz ../../02-raw_data/EGR002.B0201.SG1.1/EGR002.B0201.SG1.1_S0_L001_R2_001.fastq.gz NA
EGR002  EGR002.B0201.SG1        2       2       PE      homo_sapiens    double  half    ../../02-raw_data/EGR002.B0201.SG1.1/EGR002.B0201.SG1.1_S0_L002_R1_001.fastq.gz ../../02-raw_data/EGR002.B0201.SG1.1/EGR002.B0201.SG1.1_S0_L002_R2_001.fastq.gz NA
EGR002  EGR002.B0201.SG1        3       2       PE      homo_sapiens    double  half    ../../02-raw_data/EGR002.B0201.SG1.1/EGR002.B0201.SG1.1_S0_L003_R1_001.fastq.gz ../../02-raw_data/EGR002.B0201.SG1.1/EGR002.B0201.SG1.1_S0_L003_R2_001.fastq.gz NA
EGR002  EGR002.B0201.SG1        4       2       PE      homo_sapiens    double  half    ../../02-raw_data/EGR002.B0201.SG1.1/EGR002.B0201.SG1.1_S0_L004_R1_001.fastq.gz ../../02-raw_data/EGR002.B0201.SG1.1/EGR002.B0201.SG1.1_S0_L004_R2_001.fastq.gz NA
EGR002  EGR002.B0201.SG1        5       2       PE      homo_sapiens    double  half    ../../02-raw_data/EGR002.B0201.SG1.2/EGR002.B0201.SG1.2_S0_L001_R1_001.fastq.gz ../../02-raw_data/EGR002.B0201.SG1.2/EGR002.B0201.SG1.2_S0_L001_R2_001.fastq.gz NA
EGR002  EGR002.B0201.SG1        6       2       PE      homo_sapiens    double  half    ../../02-raw_data/EGR002.B0201.SG1.2/EGR002.B0201.SG1.2_S0_L002_R1_001.fastq.gz ../../02-raw_data/EGR002.B0201.SG1.2/EGR002.B0201.SG1.2_S0_L002_R2_001.fastq.gz NA
EGR002  EGR002.B0201.SG1        7       2       PE      homo_sapiens    double  half    ../../02-raw_data/EGR002.B0201.SG1.2/EGR002.B0201.SG1.2_S0_L003_R1_001.fastq.gz ../../02-raw_data/EGR002.B0201.SG1.2/EGR002.B0201.SG1.2_S0_L003_R2_001.fastq.gz NA
EGR002  EGR002.B0201.SG1        8       2       PE      homo_sapiens    double  half    ../../02-raw_data/EGR002.B0201.SG1.2/EGR002.B0201.SG1.2_S0_L004_R1_001.fastq.gz ../../02-raw_data/EGR002.B0201.SG1.2/EGR002.B0201.SG1.2_S0_L004_R2_001.fastq.gz NA
```

You can see that we have a single line for each pair of FASTQ files representing
each `Lane`, but the `Sample_Name` and `Library_ID` columns identify and group
them together accordingly. Secondly, as we have NextSeq data, we have specified
we have `2` for `Colour_Chemistry`, which is important for downstream processing
(see below). See the nf-core/eager
parameter documentation above for more specifications on how to set up a
TSV file (e.g. why despite NextSeqs
only having 4 lanes, we go up to 8 in the example above).

Alongside our input TSV file, we will also specify the paths to our reference
FASTA file and the corresponding indices.

```bash
nextflow run nf-core/eager \
-r 2.2.0 \
-profile singularity,shh \
-name 'projectX_preprocessing20200727' \
--input 'preprocessing20200727.tsv' \
--fasta '../Reference/genome/hs37d5.fa' \
--bwa_index '../Reference/genome/hs37d5/' \
--fasta_index '../Reference/genome/hs37d5.fa.fai' \
--seq_dict '../Reference/genome/hs37d5.dict' \
<...>
```

We specify the paths to each reference genome and it's corresponding tool
specific index. Paths should always be encapsulated in quotes to ensure Nextflow
evaluates them, rather than your shell! Also note that as `bwa` generates
multiple index files, nf-core/eager takes a _directory_ that must contain these
indices instead.

> Note the difference between single and double `-` parameters. The former
> represent Nextflow flags, while the latter are nf-core/eager specific flags.

Finally, we can also specify the output directory and the Nextflow `work/`
directory (which contains 'intermediate' working files and directories).

```bash
nextflow run nf-core/eager \
-r 2.2.0 \
-profile singularity,shh \
-name 'projectX_preprocessing20200727' \
--input 'preprocessing20200727.tsv' \`
--fasta '../Reference/genome/hs37d5.fa' \
--bwa_index '../Reference/genome/hs37d5/' \
--fasta_index '../Reference/genome/hs37d5.fa.fai' \
--seq_dict '../Reference/genome/hs37d5.dict' \
--outdir './results/' \
-w './work/' \
<...>
```

#### Tutorial Human Pop-Gen - Pipeline Configuration

Now that we have specified the input data, we can start moving onto specifying
settings for each different module we will be running. As mentioned above, we
are pretending to run with NextSeq data, which is generated with a two-colour
imaging technique. What this means is when you have shorter molecules than the
number of cycles of the sequencing chemistry, the sequencer will repeatedly see
'G' calls (no colour) at the last few cycles, and you get long poly-G 'tails' on
your reads. We therefore will turn on the poly-G clipping functionality offered
by [`fastp`](https://github.com/OpenGene/fastp), and any pairs of files
indicated in the TSV file as having `2` in the `Colour_Chemistry` column will be
passed to `fastp`. We will not change the default minimum length of a poly-G
string to be clipped.

```bash
nextflow run nf-core/eager \
-r 2.2.0 \
-profile singularity,shh \
-name 'projectX_preprocessing20200727' \
--input 'preprocessing20200727.tsv' \
--fasta '../Reference/genome/hs37d5.fa' \
--bwa_index '../Reference/genome/hs37d5/' \
--fasta_index '../Reference/genome/hs37d5.fa.fai' \
--seq_dict '../Reference/genome/hs37d5.dict' \
--outdir './results/' \
-w './work/' \
--complexity_filter_poly_g \
<...>
```

Since our input data is paired-end, we will be using `DeDup` for duplicate
removal, which takes into account both the start and end of a merged read before
flagging it as a duplicate. To ensure this happens works properly we first need
to disable base quality trimming of collapsed reads within Adapter Removal. To
do this, we will provide the option `--preserve5p`. Additionally, Dedup should
only be provided with merged reads, so we will need to provide the option
`--mergedonly` here as well. We can then specify which dedupper we want to use
with `--dedupper`.

```bash
nextflow run nf-core/eager \
-r 2.2.0 \
-profile singularity,shh \
-name 'projectX_preprocessing20200727' \
--input 'preprocessing20200727.tsv' \
--fasta '../Reference/genome/hs37d5.fa' \
--bwa_index '../Reference/genome/hs37d5/' \
--fasta_index '../Reference/genome/hs37d5.fa.fai' \
--seq_dict '../Reference/genome/hs37d5.dict' \
--outdir './results/' \
-w './work/' \
--complexity_filter_poly_g \
--preserve5p \
--mergedonly \
--dedupper 'dedup' \
<...>
```

We then need to specify the mapping parameters for this run. The default mapping
parameters of nf-core/eager are fine for the purposes of our run. Personally, I
like to set `--bwaalnn` to `0.01`, (down from the default `0.04`) which reduces
the stringency in the number of allowed mismatches between the aligned sequences
and the reference.

```bash
nextflow run nf-core/eager \
-r 2.2.0 \
-profile singularity,shh \
-name 'projectX_preprocessing20200727' \
--input 'preprocessing20200727.tsv' \
--fasta '../Reference/genome/hs37d5.fa' \
--bwa_index '../Reference/genome/hs37d5/' \
--fasta_index '../Reference/genome/hs37d5.fa.fai' \
--seq_dict '../Reference/genome/hs37d5.dict' \
--outdir './results/' \
-w './work/' \
--complexity_filter_poly_g \
--preserve5p \
--mergedonly \
--dedupper 'dedup' \
--bwaalnn 0.01 \
<...>
```

We may also want to remove ambiguous sequences from our alignments, and also
remove off-target reads to speed up downstream processing (and reduce your
hard-disk footprint). We can do this with the samtools filter module to set a
mapping-quality filter (e.g. with a value of `25` to retain only slightly
ambiguous alignments that might occur from damage), and to indicate to discard
unmapped reads.

```bash
nextflow run nf-core/eager \
-r 2.2.0 \
-profile singularity,shh \
-name 'projectX_preprocessing20200727' \
--input 'preprocessing20200727.tsv' \
--fasta '../Reference/genome/hs37d5.fa' \
--bwa_index '../Reference/genome/hs37d5/' \
--fasta_index '../Reference/genome/hs37d5.fa.fai' \
--seq_dict '../Reference/genome/hs37d5.dict' \
--outdir './results/' \
-w './work/' \
--complexity_filter_poly_g \
--preserve5p \
--mergedonly \
--dedupper 'dedup' \
--bwaalnn 0.01 \
--run_bam_filtering \
--bam_mapping_quality_threshold 25 \
--bam_unmapped_type 'discard' \
<...>
```

Next, we will set up trimming of the mapped reads to alleviate the effects of DNA
damage during genotyping. To do this we will activate trimming with
`--run_trim_bam`. The libraries in this underwent 'half' UDG treatment. This
will generally restrict all remaining DNA damage to the first 2 base pairs of a
fragment. We will therefore use `--bamutils_clip_half_udg_left` and
`--bamutils_clip_half_udg_right` to trim 2bp on either side of each fragment.

```bash
nextflow run nf-core/eager \
-r 2.2.0 \
-profile singularity,shh \
-name 'projectX_preprocessing20200727' \
--input 'preprocessing20200727.tsv' \
--fasta '../Reference/genome/hs37d5.fa' \
--bwa_index '../Reference/genome/hs37d5/' \
--fasta_index '../Reference/genome/hs37d5.fa.fai' \
--seq_dict '../Reference/genome/hs37d5.dict' \
--outdir './results/' \
-w './work/' \
--complexity_filter_poly_g \
--preserve5p \
--mergedonly \
--dedupper 'dedup' \
--bwaalnn 0.01 \
--run_bam_filtering \
--bam_mapping_quality_threshold 25 \
--bam_unmapped_type 'discard' \
--run_trim_bam \
--bamutils_clip_double_stranded_half_udg_left 2 \
--bamutils_clip_double_stranded_half_udg_right 2 \
<...>
```

To activate human sex determination (using
[Sex.DetERRmine.py](https://github.com/TCLamnidis/Sex.DetERRmine)) we will
provide the option `--run_sexdeterrmine`. Additionally, we will provide
sexdeterrmine with the BED file of our SNPs of interest using the
`--sexdeterrmine_bedfile` flag. Here I will use the 1240k SNP set as an example.
This will cut down on computational time and also provide us with an
error bar around the relative coverage on the X and Y chromosomes.
If you wish to use the same bedfile to follow along with this tutorial,
you can download the file from [here](https://github.com/nf-core/test-datasets/blob/eager/reference/Human/1240K.pos.list_hs37d5.0based.bed.gz).

```bash
nextflow run nf-core/eager \
-r 2.2.0 \
-profile singularity,shh \
-name 'projectX_preprocessing20200727' \
--input 'preprocessing20200727.tsv' \
--fasta '../Reference/genome/hs37d5.fa' \
--bwa_index '../Reference/genome/hs37d5/' \
--fasta_index '../Reference/genome/hs37d5.fa.fai' \
--seq_dict '../Reference/genome/hs37d5.dict' \
--outdir './results/' \
-w './work/' \
--complexity_filter_poly_g \
--preserve5p \
--mergedonly \
--dedupper 'dedup' \
--bwaalnn 0.01 \
--run_bam_filtering \
--bam_mapping_quality_threshold 25 \
--bam_unmapped_type 'discard' \
--run_trim_bam \
--bamutils_clip_half_udg_left 2 \
--bamutils_clip_half_udg_right 2 \
--run_sexdeterrmine \
--sexdeterrmine_bedfile '../Reference/genome/1240k.sites.bed' \
<...>
```

Similarly, we will activate nuclear contamination estimation with
`--run_nuclear_contamination`. This process requires us to also specify the
contig name of the X chromosome in the reference genome we are using with
`--contamination_chrom_name`. Here, we are using hs37d5, where the X chromosome
is simply named 'X'.

```bash
nextflow run nf-core/eager \
-r 2.2.0 \
-profile singularity,shh \
-name 'projectX_preprocessing20200727' \
--input 'preprocessing20200727.tsv' \
--fasta '../Reference/genome/hs37d5.fa' \
--bwa_index '../Reference/genome/hs37d5/' \
--fasta_index '../Reference/genome/hs37d5.fa.fai' \
--seq_dict '../Reference/genome/hs37d5.dict' \
--outdir './results/' \
-w './work/' \
--complexity_filter_poly_g \
--preserve5p \
--mergedonly \
--dedupper 'dedup' \
--bwaalnn 0.01 \
--run_bam_filtering \
--bam_mapping_quality_threshold 25 \
--bam_unmapped_type 'discard' \
--run_trim_bam \
--bamutils_clip_double_stranded_half_udg_left 2 \
--bamutils_clip_double_stranded_half_udg_right 2 \
--run_sexdeterrmine \
--sexdeterrmine_bedfile '../Reference/genome/1240k.sites.bed' \
--run_nuclear_contamination \
--contamination_chrom_name 'X' \
<...>
```

Because nuclear contamination estimates can only be provided for males, it is
possible that we will need to get mitochondrial DNA contamination estimates for
any females in our dataset. This cannot be done within nf-core/eager (v2.2.0)
and we will need to do this manually at a later time. However, mtDNA
contamination estimates have been shown to only be reliable for nuclear
contamination when the ratio of mitochondrial to nuclear reads is low
([Furtwängler et al. 2018](https://doi.org/10.1038/s41598-018-32083-0)). We can
have nf-core/eager calculate that ratio for us with `--run_mtnucratio`, and
providing the name of the mitochondrial DNA contig in our reference genome with
`--mtnucratio_header`. Within hs37d5, the mitochondrial contig is named 'MT'.

```bash
nextflow run nf-core/eager \
-r 2.2.0 \
-profile singularity,shh \
-name 'projectX_preprocessing20200727' \
--input 'preprocessing20200727.tsv' \
--fasta '../Reference/genome/hs37d5.fa' \
--bwa_index '../Reference/genome/hs37d5/' \
--fasta_index '../Reference/genome/hs37d5.fa.fai' \
--seq_dict '../Reference/genome/hs37d5.dict' \
--outdir './results/' \
-w './work/' \
--complexity_filter_poly_g \
--preserve5p \
--mergedonly \
--dedupper 'dedup' \
--bwaalnn 0.01 \
--run_bam_filtering \
--bam_mapping_quality_threshold 25 \
--bam_unmapped_type 'discard' \
--run_trim_bam \
--bamutils_clip_double_stranded_half_udg_left 2 \
--bamutils_clip_double_stranded_half_udg_right 2 \
--run_sexdeterrmine \
--sexdeterrmine_bedfile '../Reference/genome/1240k.sites.bed' \
--run_nuclear_contamination \
--contamination_chrom_name 'X' \
--run_mtnucratio \
--mtnucratio_header 'MT' \
<...>
```

Finally, we need to specify genotyping parameters. First, we need to activate
genotyping with `--run_genotyping`. It is also important to specify we wish to
use the **trimmed** data for genotyping, to avoid the effects of DNA damage. To
do this, we will specify the `--genotyping_source` as `'trimmed'`. Then we can
specify the genotyping tool to use with `--genotyping_tool`. We will be using
`'pileupCaller'` to produce random draw genotypes in eigenstrat format. For this
process we will need to specify a BED file of the sites of interest (the same as
before) with `--pileupcaller_bedfile`, as well as an eigenstrat formatted `.snp`
file of these sites that is specified with `--pileupcaller_snpfile`.

```bash
nextflow run nf-core/eager \
-r 2.2.0 \
-profile singularity,shh \
-name 'projectX_preprocessing20200727' \
--input 'preprocessing20200727.tsv' \
--fasta '../Reference/genome/hs37d5.fa' \
--bwa_index '../Reference/genome/hs37d5/' \
--fasta_index '../Reference/genome/hs37d5.fa.fai' \
--seq_dict '../Reference/genome/hs37d5.dict' \
--outdir './results/' \
-w './work/' \
--complexity_filter_poly_g \
--preserve5p \
--mergedonly \
--dedupper 'dedup' \
--bwaalnn 0.01 \
--run_bam_filtering \
--bam_mapping_quality_threshold 25 \
--bam_unmapped_type 'discard' \
--run_trim_bam \
--bamutils_clip_double_stranded_half_udg_left 2 \
--bamutils_clip_double_stranded_half_udg_right 2 \
--run_sexdeterrmine \
--sexdeterrmine_bedfile '../Reference/genome/1240k.sites.bed' \
--run_nuclear_contamination \
--contamination_chrom_name 'X' \
--run_mtnucratio \
--mtnucratio_header 'MT' \
--run_genotyping \
--genotyping_source 'trimmed' \
--genotyping_tool 'pileupcaller' \
--pileupcaller_bedfile '../Reference/genome/1240k.sites.bed' \
--pileupcaller_snpfile '../Datasets/1240k/1240k.snp'
```

With this, we are ready to submit! If running on a remote cluster/server, Make
sure to run this in a `screen` session or similar, so that if you get a `ssh`
signal drop or want to log off, Nextflow will not crash.

#### Tutorial Human Pop-Gen - Results

Assuming the run completed without any crashes (if problems do occur, check
against [parameters](https://nf-core/eager/parameters) that all parameters are as expected, or
check the [FAQ](#troubleshooting-and-faqs)), we can now check our results in
`results/`.

##### Tutorial Human Pop-Gen - MultiQC Report

In here there are many different directories containing different output files.
The first directory to check is the `MultiQC/` directory. You should
find a `multiqc_report.html` file. You will need to view this in a web browser,
so I recommend either mounting your server to your file browser, or downloading
it to your own local machine (PC/Laptop etc.).

Once you've opened this you can go through each section and evaluate all the
results. You will likely want to check these for artefacts (e.g. weird damage
patterns on the human DNA, or weirdly skewed coverage distributions).

For example, I normally look for things like:

General Stats Table:

* Do I see the expected number of raw sequencing reads (summed across each set
  of FASTQ files per library) that was requested for sequencing?
* Does the percentage of trimmed reads look normal for aDNA, and do lengths
  after trimming look short as expected of aDNA?
* Does ClusterFactor or 'Dups' look high (e.g. >2 or >10% respectively)
  suggesting over-amplified or badly preserved samples?
* Do the mapped reads show increased frequency of C>Ts on the 5' end of
  molecules?
* Is the number of SNPs used for nuclear contamination really low for any
  individuals (e.g. < 100)? If so, then the estimates might not be very
  accurate.

FastQC (pre-AdapterRemoval):

* Do I see any very early drop off of sequence quality scores suggesting a
  problematic sequencing run?
* Do I see outlier GC content distributions?
* Do I see high sequence duplication levels?

AdapterRemoval:

* Do I see high numbers of singletons or discarded read pairs?

FastQC (post-AdapterRemoval):

* Do I see improved sequence quality scores along the length of reads?
* Do I see reduced adapter content levels?

Samtools Flagstat (pre/post Filter):

* Do I see outliers, e.g. with unusually high levels of human DNA, (indicative
  of contamination) that require downstream closer assessment? Are your samples
  exceptionally preserved? If not, a value higher than e.g. 50% might require
  your attention.

DeDup/Picard MarkDuplicates:

* Do I see large numbers of duplicates being removed, possibly indicating
  over-amplified or badly preserved samples?

DamageProfiler:

* Do I see evidence of damage on human DNA?
  * High numbers of mapped reads but no damage may indicate significant
    modern contamination.
  * Was the read trimming I specified enough to overcome damage effects?

SexDetERRmine:

* Do the relative coverages on the X and Y chromosome fall within the expected
  areas of the plot?
* Do all individuals have enough data for accurate sex determination?
* Do the proportions of autosomal/X/Y reads make sense? If there is an
  overrepresentation of reads within one bin, is the data enriched for that bin?

> Detailed documentation and descriptions for all MultiQC modules can be seen in
> the the 'Documentation' folder of the results directory or here in the [output
> documentation](output.md)

If you're happy everything looks good in terms of sequencing, we then look at
specific directories to find any files you might want to use for downstream
processing.

Note that when you get back to writing up your publication, all the versions of
the tools can be found under the 'nf-core/eager Software Versions' section of
the MultiQC report. But be careful! All tools in the container are listed, so
you may have to remove some of them that you didn't actually use in the set up.

For example, in this example, we have used: Nextflow, nf-core/eager, FastQC,
AdapterRemoval, fastP, BWA, Samtools, endorS.py, DeDup, Qualimap, PreSeq,
DamageProfiler, bamUtil, sexdeterrmine, angsd, MTNucRatioCalculator,
sequenceTools, and MultiQC.

Citations to all used tools can be seen
[here](https://nf-co.re/eager#tool-references)

##### Tutorial Human Pop-Gen - Files for Downstream Analysis

You will find the eigenstrat dataset containing the random draw genotypes of
your run in the `genotyping/` directory. Genotypes from double stranded
libraries, like the ones in this example, are found in the dataset
`pileupcaller.double.{geno,snp,ind}.txt`, while genotypes for any single
stranded libraries will instead be in `pileupcaller.single.{geno,snp,ind}.txt`.

#### Tutorial Human Pop-Gen - Clean up

Finally, I would recommend cleaning up your `work/` directory of any
intermediate files (if your `-profile` does not already do so). You can do this
by going to above your `results/` and `work/` directory, e.g.

```bash
cd /<path>/<to>/projectX_preprocessing20200727
```

and running

```bash
nextflow clean -f -k
```

#### Tutorial Human Pop-Gen - Summary

In this this tutorial we have described an example on how to set up an
nf-core/eager run to preprocess human aDNA for population genetic studies,
preform some simple quality control checks, and generate random draw genotypes
for downstream analysis of the data. Additionally, we described what to look for
in the run summary report generated by MultiQC and where to find output files
that can be used for downstream analysis.

### Tutorial - How to set up nf-core/eager for metagenomic screening

#### Tutorial Metagenomics - Introduction

The field of archaeogenetics is now expanding out from analysing the genomes of
single organisms but to whole communities of taxa. One particular example is of
human associated microbiomes, as preserved in ancient palaeofaeces (gut) or
dental calculus (oral). This tutorial will give a basic example on how to set up
nf-core/eager to perform initial screening of samples in the context of ancient
microbiome research.

> :warning: this tutorial does not describe how to install and set up
> nf-core/eager For this please see other documentation on the
> [nf-co.re](https://nf-co.re/usage/installation) website.

We will describe how to set up mapping of ancient dental calculus samples
against the human reference genome to allow sequencing and library
quality-control, but additionally perform taxonomic profiling of the off-target
reads from this mapping using MALT, and perform aDNA authentication with HOPS.

> :warning: Please be aware that the settings used in this tutorial may not use
> settings nor produce files you would actually use in 'real' analysis. The
> settings are only specified for demonstration purposes. Please consult the
> your colleagues, communities and the literature for optimal parameters.

#### Tutorial Metagenomics - Preparation

Prior setting up an nf-core/eager run for metagenomic screening, we will need:

1. Raw sequencing data in FASTQ format
2. Reference genome in FASTA format, with associated pre-made `bwa`, `samtools`
   and `picard SequenceDictionary` indices
3. A MALT database of your choice (see [MALT
   manual](https://software-ab.informatik.uni-tuebingen.de/download/malt/manual.pdf)
   for set-up)
4. A list of (NCBI) taxa containing well-known taxa of your microbiome (see
   below)
5. HOPS resources `.map` and `.tre` files (available
   [here](https://github.com/rhuebler/HOPS/tree/external/Resources))

We should also ensure we have the very latest version of the nf-core/eager
pipeline so we have all latest bugfixes etc. In this case we will be using
nf-core/eager version 2.2.0. You should always check on the
[nf-core](https://nf-co.re/eager) website  whether a newer release has been made
(particularly point releases e.g. 2.2.1).

```bash
nextflow pull nf-core/eager -r 2.2.0
```

It is important to note that if you are planning on running multiple runs of
nf-core/eager for a given project, that the version should be **kept the same**
for all runs to ensure consistency in settings for all of your libraries.

#### Tutorial Metagenomics - Inputs and Outputs

To start, lets make a directory where all your nf-core/eager related files for
this run will go, and change into it.

```bash
mkdir projectX_screening20200720
cd projectX_screening20200720
```

The first part of constructing any nf-core/eager run is specifying a few generic
parameters that will often be common across all runs. This will be which
pipeline, version and _profile_ we will use. We will also specify a unique name
of the run to help us keep track of all the nf-core/eager runs you may be
running.

```bash
nextflow run nf-core/eager \
-r 2.2.0 \
-profile singularity,shh \
-name 'projectX_screening20200720' \
<...>
```

For the `-profile` parameter, I have indicated that I wish to use Singularity as
my software container environment, and I will use the MPI-SHH institutional
config as listed on
[nf-core/configs](https://github.com/nf-core/configs/blob/master/conf/shh.config).
These profiles specify settings
optimised for the specific cluster/institution, such as maximum memory available
or which scheduler queues to submit to. More explanations about configs and
profiles can be seen in the [nf-core
website](https://nf-co.re/usage/configuration) and the [profile
tutorial](#tutorial---what-are-profiles-and-how-to-use-them).

Next we need to specify our input data. nf-core/eager can accept input FASTQs
files in two main ways, either with direct paths to files (with wildcards), or
with a Tab-Separate-Value (TSV) file which contains the paths and extra
metadata. In this example, we will use the TSV method, as to simulate a
realistic use-case, such as receiving paired-end data from an Illumina NextSeq
of double-stranded libraries. Illumina NextSeqs sequence a given library across
four different 'lanes', so for each library you will receive four FASTQ files.
The TSV input method is more useful for this context, as it allows 'merging' of
these lanes after preprocessing prior mapping (whereas direct paths will
consider each pair of FASTQ files as independent libraries/samples).

Our TSV file will look something like the following:

```bash
Sample_Name     Library_ID      Lane    Colour_Chemistry        SeqType Organism        Strandedness    UDG_Treatment   R1      R2      BAM
EGR001  EGR001.B0101.SG1        1       2       PE      homo_sapiens    double  half    ../../02-raw_data/EGR001.B0101.SG1.1/EGR001.B0101.SG1.1_S0_L001_R1_001.fastq.gz ../../02-raw_data/EGR001.B0101.SG1.1/EGR001.B0101.SG1.1_S0_L001_R2_001.fastq.gz NA
EGR001  EGR001.B0101.SG1        2       2       PE      homo_sapiens    double  half    ../../02-raw_data/EGR001.B0101.SG1.1/EGR001.B0101.SG1.1_S0_L002_R1_001.fastq.gz ../../02-raw_data/EGR001.B0101.SG1.1/EGR001.B0101.SG1.1_S0_L002_R2_001.fastq.gz NA
EGR001  EGR001.B0101.SG1        3       2       PE      homo_sapiens    double  half    ../../02-raw_data/EGR001.B0101.SG1.1/EGR001.B0101.SG1.1_S0_L003_R1_001.fastq.gz ../../02-raw_data/EGR001.B0101.SG1.1/EGR001.B0101.SG1.1_S0_L003_R2_001.fastq.gz NA
EGR001  EGR001.B0101.SG1        4       2       PE      homo_sapiens    double  half    ../../02-raw_data/EGR001.B0101.SG1.1/EGR001.B0101.SG1.1_S0_L004_R1_001.fastq.gz ../../02-raw_data/EGR001.B0101.SG1.1/EGR001.B0101.SG1.1_S0_L004_R2_001.fastq.gz NA
EGR001  EGR001.B0101.SG1        5       2       PE      homo_sapiens    double  half    ../../02-raw_data/EGR001.B0101.SG1.2/EGR001.B0101.SG1.2_S0_L001_R1_001.fastq.gz ../../02-raw_data/EGR001.B0101.SG1.2/EGR001.B0101.SG1.2_S0_L001_R2_001.fastq.gz NA
EGR001  EGR001.B0101.SG1        6       2       PE      homo_sapiens    double  half    ../../02-raw_data/EGR001.B0101.SG1.2/EGR001.B0101.SG1.2_S0_L002_R1_001.fastq.gz ../../02-raw_data/EGR001.B0101.SG1.2/EGR001.B0101.SG1.2_S0_L002_R2_001.fastq.gz NA
EGR001  EGR001.B0101.SG1        7       2       PE      homo_sapiens    double  half    ../../02-raw_data/EGR001.B0101.SG1.2/EGR001.B0101.SG1.2_S0_L003_R1_001.fastq.gz ../../02-raw_data/EGR001.B0101.SG1.2/EGR001.B0101.SG1.2_S0_L003_R2_001.fastq.gz NA
EGR002  EGR002.B0201.SG1        8       2       PE      homo_sapiens    double  half    ../../02-raw_data/EGR001.B0101.SG1.2/EGR001.B0101.SG1.2_S0_L004_R1_001.fastq.gz ../../02-raw_data/EGR001.B0101.SG1.2/EGR001.B0101.SG1.2_S0_L004_R2_001.fastq.gz NA
EGR002  EGR002.B0201.SG1        1       2       PE      homo_sapiens    double  half    ../../02-raw_data/EGR002.B0201.SG1.1/EGR002.B0201.SG1.1_S0_L001_R1_001.fastq.gz ../../02-raw_data/EGR002.B0201.SG1.1/EGR002.B0201.SG1.1_S0_L001_R2_001.fastq.gz NA
EGR002  EGR002.B0201.SG1        2       2       PE      homo_sapiens    double  half    ../../02-raw_data/EGR002.B0201.SG1.1/EGR002.B0201.SG1.1_S0_L002_R1_001.fastq.gz ../../02-raw_data/EGR002.B0201.SG1.1/EGR002.B0201.SG1.1_S0_L002_R2_001.fastq.gz NA
EGR002  EGR002.B0201.SG1        3       2       PE      homo_sapiens    double  half    ../../02-raw_data/EGR002.B0201.SG1.1/EGR002.B0201.SG1.1_S0_L003_R1_001.fastq.gz ../../02-raw_data/EGR002.B0201.SG1.1/EGR002.B0201.SG1.1_S0_L003_R2_001.fastq.gz NA
EGR002  EGR002.B0201.SG1        4       2       PE      homo_sapiens    double  half    ../../02-raw_data/EGR002.B0201.SG1.1/EGR002.B0201.SG1.1_S0_L004_R1_001.fastq.gz ../../02-raw_data/EGR002.B0201.SG1.1/EGR002.B0201.SG1.1_S0_L004_R2_001.fastq.gz NA
EGR002  EGR002.B0201.SG1        5       2       PE      homo_sapiens    double  half    ../../02-raw_data/EGR002.B0201.SG1.2/EGR002.B0201.SG1.2_S0_L001_R1_001.fastq.gz ../../02-raw_data/EGR002.B0201.SG1.2/EGR002.B0201.SG1.2_S0_L001_R2_001.fastq.gz NA
EGR002  EGR002.B0201.SG1        6       2       PE      homo_sapiens    double  half    ../../02-raw_data/EGR002.B0201.SG1.2/EGR002.B0201.SG1.2_S0_L002_R1_001.fastq.gz ../../02-raw_data/EGR002.B0201.SG1.2/EGR002.B0201.SG1.2_S0_L002_R2_001.fastq.gz NA
EGR002  EGR002.B0201.SG1        7       2       PE      homo_sapiens    double  half    ../../02-raw_data/EGR002.B0201.SG1.2/EGR002.B0201.SG1.2_S0_L003_R1_001.fastq.gz ../../02-raw_data/EGR002.B0201.SG1.2/EGR002.B0201.SG1.2_S0_L003_R2_001.fastq.gz NA
EGR002  EGR002.B0201.SG1        8       2       PE      homo_sapiens    double  half    ../../02-raw_data/EGR002.B0201.SG1.2/EGR002.B0201.SG1.2_S0_L004_R1_001.fastq.gz ../../02-raw_data/EGR002.B0201.SG1.2/EGR002.B0201.SG1.2_S0_L004_R2_001.fastq.gz NA
```

You can see that we have a single line for each pair of FASTQ files representing
each `Lane`, but the `Sample_Name` and `Library_ID` columns identify and group
them together accordingly. Secondly, as we have NextSeq data, we have specified
we have `2` for `Colour_Chemistry`, which is important for downstream processing
(see below). The other columns are less important for this particular context of
metagenomic screening. See the nf-core/eager [parameters](https://nf-core/eager/parameters)
documentation for more specifications on how to set up a TSV file (e.g. why
despite NextSeqs only having 4 lanes, we go up to 8 in the example above).

Alongside our input TSV file, we will also specify the paths to our reference
FASTA file and the corresponding indices.

```bash
nextflow run nf-core/eager \
-r 2.2.0 \
-profile singularity,shh \
-name 'projectX_screening20200720' \
--input 'screening20200720.tsv' \
--fasta '../Reference/genome/GRCh38.fa' \
--bwa_index '../Reference/genome/GRCh38/' \
--fasta_index '../Reference/genome/GRCh38.fa.fai' \
--seq_dict '../Reference/genome/GRCh38.dict' \
<...>
```

We specify the paths to each reference genome and it's corresponding tool
specific index. Paths should always be encapsulated in quotes to ensure Nextflow
evaluates them, rather than your shell! Also note that as `bwa` generates
multiple index files, nf-core/eager takes a _directory_ that must contain these
indices instead.

> Note the difference between single and double `-` parameters. The former
> represent Nextflow flags, while double are nf-core/eager specific flags.

Finally, we can also specify the output directory and the Nextflow `work/`
directory (which contains 'intermediate' working files and directories).

```bash
nextflow run nf-core/eager \
-r 2.2.0 \
-profile singularity,shh \
-name 'projectX_screening20200720' \
--input 'screening20200720.tsv' \
--fasta '../Reference/genome/GRCh38.fa' \
--bwa_index '../Reference/genome/GRCh38/' \
--fasta_index '../Reference/genome/GRCh38.fa.fai' \
--seq_dict '../Reference/genome/GRCh38.dict' \
--outdir './results/' \
-w './work/' \
<...>
```

#### Tutorial Metagenomics - Pipeline Configuration

Now that we have specified the input data, we can start moving onto specifying
settings for each different module we will be running. As mentioned above, we
are pretending to run with NextSeq data, which is generated with a two-colour
imaging technique. What this means is when you have shorter molecules than the
number of cycles of the sequencing chemistry, the sequencer will repeatedly see
'G' calls (no colour) at the last few cycles, and you get long poly-G 'tails' on
your reads. We therefore will turn on the poly-G clipping functionality offered
by [`fastp`](https://github.com/OpenGene/fastp), and any pairs of files
indicated in the TSV file as having `2` in the `Colour_Chemistry` column will be
passed to `fastp`. We will not change the default minimum length of a poly-G
string to be clipped.

```bash
nextflow run nf-core/eager \
-r 2.2.0 \
-profile singularity,shh \
-name 'projectX_screening20200720' \
--input 'screening20200720.tsv' \
--fasta '../Reference/genome/GRCh38.fa' \
--bwa_index '../Reference/genome/GRCh38/' \
--fasta_index '../Reference/genome/GRCh38.fa.fai' \
--seq_dict '../Reference/genome/GRCh38.dict' \
--outdir './results/' \
-w './work/' \
--complexity_filter_poly_g \
<...>
```

We will keep the default settings for mapping etc. against the reference genome
as we will only use this for sequencing quality control, however we now need to
specify that we want to run metagenomic screening. To do this we firstly need to
tell nf-core/eager what to do with the off target reads from the mapping.

```bash
nextflow run nf-core/eager \
-r 2.2.0 \
-profile singularity,shh \
-name 'projectX_screening20200720' \
--input 'screening20200720.tsv' \
--fasta '../Reference/genome/GRCh38.fa' \
--bwa_index '../Reference/genome/GRCh38/' \
--fasta_index '../Reference/genome/GRCh38.fa.fai' \
--seq_dict '../Reference/genome/GRCh38.dict' \
--outdir './results/' \
-w './work/' \
--complexity_filter_poly_g \
--run_bam_filtering \
--bam_unmapped_type 'fastq' \
<...>
```

nf-core/eager will now take all unmapped reads after mapping and convert the BAM
file back to FASTQ, which can be accepted by MALT. But of course, we also then
need to tell nf-core/eager we actually want to run MALT. We will also specify
the location of the [pre-built database](#tutorial-metagenomics---preparation) and which 'min support'
method we want to use (this specifies the minimum number of alignments is needed
to a particular taxonomic node to be 'kept' in the MALT output files). Otherwise
we will keep all other parameters as default. For example using BlastN mode,
requiring a minimum of 85% identity, requiring at least 0.01% alignments for a
taxon to be saved (as specified with the `--malt_min_support_mode`). More
documentation describing each parameters can be seen in the usage
[documentation](usage.md)

```bash
nextflow run nf-core/eager \
-r 2.2.0 \
-profile singularity,shh \
-name 'projectX_screening20200720' \
--input 'screening20200720.tsv' \
--fasta '../Reference/genome/GRCh38.fa' \
--bwa_index '../Reference/genome/GRCh38/' \
--fasta_index '../Reference/genome/GRCh38.fa.fai' \
--seq_dict '../Reference/genome/GRCh38.dict' \
--outdir './results/' \
-w './work/' \
--complexity_filter_poly_g \
--run_bam_filtering \
--bam_unmapped_type 'fastq' \
--run_metagenomic_screening \
--metagenomic_tool 'malt' \
--database '../Reference/database/refseq-bac-arch-homo-2018_11' \
--malt_min_support_mode 'percent' \
<...>
```

Finally, to help quickly assess whether we our sample has taxa that are known to
exist in (modern samples of) our expected microbiome, and that these alignments
have indicators of true aDNA, we will run 'maltExtract' of the
[HOPS](https://github.com/rhuebler/HOPS) pipeline.

```bash
nextflow run nf-core/eager \
-r 2.2.0 \
-profile singularity,shh \
-name 'projectX_screening20200720' \
--input 'screening20200720.tsv' \
--fasta '../Reference/genome/GRCh38.fa' \
--bwa_index '../Reference/genome/GRCh38/' \
--fasta_index '../Reference/genome/GRCh38.fa.fai' \
--seq_dict '../Reference/genome/GRCh38.dict' \
--outdir './results/' \
-w './work/' \
--complexity_filter_poly_g \
--run_bam_filtering \
--bam_unmapped_type 'fastq' \
--run_metagenomic_screening \
--metagenomic_tool 'malt' \
--database '../Reference/database/refseq-bac-arch-homo-2018_11' \
--malt_min_support_mode 'percent' \
--run_maltextract \
--maltextract_taxon_list '../Reference/taxa_list/core_genera-anthropoids_hominids_panhomo-20180131.txt' \
--maltextract_ncbifiles '../Reference/hops' \
--maltextract_destackingoff
```

In the last parameters above we've specified the path to our list of taxa. This
contains something like (for oral microbiomes):

```text
Actinomyces
Streptococcus
Tannerella
Porphyromonas
```

We have also specified the path to the HOPS resources [downloaded
earlier](#tutorial-metagenomics---preparation), and that I want to turn off 'destacking' (removal of any
read that overlaps the positions of another - something only recommended to keep
on when you have high coverage data).

With this, we are ready to submit! If running on a remote cluster/server, Make
sure to run this in a `screen` session or similar, so that if you get a `ssh`
signal drop or want to log off, Nextflow will not crash.

#### Tutorial Metagenomics - Results

Assuming the run completed without any crashes (if problems do occur, check
against [parameters](https://nf-core/eager/parameters) that all parameters are as expected, or check
the [FAQ](#troubleshooting-and-faqs)), we can now check our results in
`results/`.

##### Tutorial Metagenomics - MultiQC Report

In here there are many different directories containing different output files.
The first directory to check is the `MultiQC/` directory. You should
find a `multiqc_report.html` file. You will need to view this in a web browser,
so I recommend either mounting your server to your file browser, or downloading
it to your own local machine (PC/Laptop etc.).

Once you've opened this you can go through each section and evaluate all the
results. You will likely not want to concern yourself too much with anything
after MALT - however you should check these for other artefacts (e.g. weird
damage patterns on the human DNA, or weirdly skewed coverage distributions).

For example, I normally look for things like:

General Stats Table:

* Do I see the expected number of raw sequencing reads (summed across each set
  of FASTQ files per library) that was requested for sequencing?
* Does the percentage of trimmed reads look normal for aDNA, and do lengths
  after trimming look short as expected of aDNA?
* Does ClusterFactor or 'Dups' look high suggesting over-amplified or
  badly preserved samples (e.g. >2 or >10% respectively - however
  given this is on the human reads this is just a rule of thumb and may not
  reflect the quality of the metagenomic profile) ?
* Does the human DNA show increased frequency of C>Ts on the 5' end of
  molecules?

FastQC (pre-AdapterRemoval):

* Do I see any very early drop off of sequence quality scores suggesting
  problematic sequencing run?
* Do I see outlier GC content distributions?
* Do I see high sequence duplication levels?

AdapterRemoval:

* Do I see high numbers of singletons or discarded read pairs?

FastQC (post-AdapterRemoval):

* Do I see improved sequence quality scores along the length of reads?
* Do I see reduced adapter content levels?

MALT:

* Do I have a reasonable level of mappability?
  * Somewhere between 10-30% can be pretty normal for aDNA, whereas e.g. <1%
    requires careful manual assessment
* Do I have a reasonable taxonomic assignment success?
  * You hope to have a large number of the mapped reads (from the mappability
    plot) that also have taxonomic assignment.

Samtools Flagstat (pre/post Filter):

* Do I see outliers, e.g. with unusually high levels of human DNA, (indicative
  of contamination) that require downstream closer assessment?

DeDup/Picard MarkDuplicates:

* Do I see large numbers of duplicates being removed, possibly indicating
  over-amplified or badly preserved samples?

DamageProfiler:

* Do I see evidence of damage on human DNA? Note this is just a
  rule-of-thumb/corroboration of any signals you might find in the metagenomic
  screening and not essential.
  * If you have high numbers of human DNA reads but no damage may indicate
    significant modern contamination.

> Detailed documentation and descriptions for all MultiQC modules can be seen in
> the the 'Documentation' folder of the results directory or here in the [output
> documentation](output.md)

If you're happy everything looks good in terms of sequencing, we then look at
specific directories to find any files you might want to use for downstream
processing.

Note that when you get back to writing up your publication, all the versions of
the tools can be found under the 'nf-core/eager Software Versions' section of
the MultiQC report. Note that all tools in the container are listed, so you may
have to remove some of them that you didn't actually use in the set up.

For example, in the example above, we have used: Nextflow, nf-core/eager,
FastQC, AdapterRemoval, fastP, BWA, Samtools, endorS.py, Picard Markduplicates,
Qualimap, PreSeq, DamageProfiler, MALT, MaltExtract and MultiQC.

Citations to all used tools can be seen
[here](https://nf-co.re/eager#tool-references)

##### Tutorial Metagenomics - Files for Downstream Analysis

If you wanted to look at the output of MALT more closely, such as in the GUI
based tool
[MEGAN6](https://software-ab.informatik.uni-tuebingen.de/download/megan6/welcome.html),
you can find the `.rma6` files that is accepted by MEGAN under
`metagenomic_classification/malt/`. The log file containing the information
printed to screen while MALT is running can also be found in this directory.

As we ran the HOPS pipeline (primarily the MaltExtract tool), we can look in
`MaltExtract/results/` to find all the corresponding output files for the
authentication validation of the metagenomic screening (against the taxa you
specified in your `--maltextract_taxon_list`). First you can check the
`heatmap_overview_Wevid.pdf` summary PDF from HOPS (again you will need to
either mount the server or download), but to get the actual per-sample/taxon
damage patterns etc., you can look in `pdf_candidate_profiles`. In some cases
there maybe valid results that the HOPS 'postprocessing' script doesn't pick up.
In these cases you can go into the `default` directory to find all the raw text
files which you can use to visualise and assess the authentication results
yourself.

Finally, if you want to re-run the taxonomic classification with a new database
or tool, to find the raw `fastq/` files containing only unmapped reads that went
into MALT, you should go into `samtools/filter`. In here you will find files
ending in `unmapped.fastq.gz` for each library.

#### Tutorial Metagenomics - Clean up

Finally, I would recommend cleaning up your `work/` directory of any
intermediate files (if your `-profile` does not already do so). You can do this
by going to above your `results/` and `work/` directory, e.g.

```bash
cd /<path>/<to>/projectX_screening20200720
```

and running

```bash
nextflow clean -f -k
```

#### Tutorial Metagenomics - Summary

In this this tutorial we have described an example on how to set up a
metagenomic screening run of ancient microbiome samples. We have covered how to
set up nf-core/eager to extract off-target reads in a form that can be used for
MALT, and how to additionally run HOPS to authenticate expected taxa to be found
in the human oral microbiome. Finally we have also described what to look for in
the MultiQC run summary report and where to find output files that can be used
for downstream analysis.

### Tutorial - How to set up nf-core/eager for pathogen genomics

#### Tutorial Pathogen Genomics - Introduction

This tutorial will give a basic example on how to set up nf-core/eager to
perform bacterial genome reconstruction from samples in the context of ancient
pathogenomic research.

> :warning: this tutorial does not describe how to install and set-up
> nf-core/eager For this please see other documentation on the
> [nf-co.re](https://nf-co.re/usage/installation) website.

We will describe how to set up mapping ancient pathogen samples against the
reference of a targeted organism genome, to check sequencing and library
quality-control, calculation of depth and breath of coverage, check for damage
profiles, feature-annotation statistics (e.g. for gene presence and absence),
SNP calling, and producing an SNP alignment for its usage in downstream
phylogenetic analysis.

I will use as an example data from [Andrades Valtueña et al
2017](https://doi.org/10.1016/j.cub.2017.10.025), who retrieved Late Neolithic/Bronze
Age _Yersinia pestis_ genomes. This data is **very large shotgun data** and is
not ideal for testing, so running on your own data is recommended as otherwise
running this data will require a lot of computing resources and time. However,
note the same procedure can equally be applied on shallow-shotgun and also
whole-genome enrichment data, so other than the TSV file you can apply this
command explained below.

> :warning: Please be aware that the settings used in this tutorial may not use
> settings nor produce files you would actually use in 'real' analysis. The
> settings are only specified for demonstration purposes. Please consult the
> your colleagues, communities and the literature for optimal parameters.

#### Tutorial Pathogen Genomics - Preparation

Prior setting up the nf-core/eager run, we will need:

1. Raw sequencing data in FASTQ format
2. Reference genome in FASTA format, with associated pre-made `bwa`, `samtools`
   and `picard SequenceDictionary` indices (however note these can be made for
   you with nf-core/eager, but this can make a pipeline run take much longer!)
3. A GFF file of gene sequence annotations (normally supplied with reference
   genomes downloaded from NCBI Genomes, in this context from
   [here](https://www.ncbi.nlm.nih.gov/genome/?term=Yersinia+pestis))
4. [Optional] Previously made VCF GATK 3.5 files (see below for settings), of
   previously published _Y. pestis_ genomes.

We should also ensure we have the very latest version of the nf-core/eager
pipeline so we have all latest bugfixes etc. In this case we will be using
nf-core/eager version 2.2.0. You should always check on the
[nf-core](https://nf-co.re/eager) website  whether a newer release has been made
(particularly point releases e.g. 2.2.1).

```bash
nextflow pull nf-core/eager -r 2.2.0
```

It is important to note that if you are planning on running multiple runs of
nf-core/eager for a given project, that the version should be **kept the same**
for all runs to ensure consistency in settings for all of your libraries.

#### Tutorial Pathogen Genomics - Inputs and Outputs

To start, lets make a directory where all your nf-core/eager related files for
this run will go, and change into it.

```bash
mkdir projectX_preprocessing20200727
cd projectX_preprocessing20200727
```

The first part of constructing any nf-core/eager run is specifying a few generic
parameters that will often be common across all runs. This will be which
pipeline, version and _profile_ we will use. We will also specify a unique name
of the run to help us keep track of all the nf-core/eager runs you may be
running.

```bash
nextflow run nf-core/eager \
-r 2.2.0 \
-profile singularity,shh \
-name 'projectX_preprocessing20200727' \
<...>
```

For the `-profile` parameter, I have indicated that I wish to use Singularity as
my software container environment, and I will use the MPI-SHH institutional
config as listed on
[nf-core/configs](https://github.com/nf-core/configs/blob/master/conf/shh.config).
These profiles specify settings
optimised for the specific cluster/institution, such as maximum memory available
or which scheduler queues to submit to. More explanations about configs and
profiles can be seen in the [nf-core
website](https://nf-co.re/usage/configuration) and the [profile
tutorial](#tutorial---what-are-profiles-and-how-to-use-them).

Next we need to specify our input data. nf-core/eager can accept input FASTQs
files in two main ways, either with direct paths to files (with wildcards), or
with a Tab-Separate-Value (TSV) file which contains the paths and extra
metadata. In this example, we will use the TSV method, as to simulate a
realistic use-case, such as both receiving single-end and paired-end data from
Illumina NextSeq _and_ Illumina HiSeqs of double-stranded libraries. Illumina
NextSeqs sequence a given library across four different 'lanes', so for each
library you will receive four FASTQ files. Sometimes samples will be sequenced
across multiple HiSeq lanes to maintain complexity to improve imaging by of base
calls. The TSV input method is more useful for this context, as it allows
'merging' of these lanes after preprocessing prior mapping (whereas direct paths
will consider each pair of FASTQ files as independent libraries/samples).

```bash
Sample_Name   Library_ID    Lane    Colour_Chemistry    SeqType   Organism    Strandedness    UDG_Treatment   R1    R2    BAM
KunilaII    KunilaII_nonUDG   4   4   PE    Yersinia pestis   double    none    ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/007/ERR2112547/ERR2112547_1.fastq.gz    ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/007/ERR2112547/ERR2112547_2.fastq.gz    NA
KunilaII    KunilaII_UDG    4   4   PE    Yersinia pestis   double    full    ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/008/ERR2112548/ERR2112548_1.fastq.gz    ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/008/ERR2112548/ERR2112548_2.fastq.gz    NA
6Post   6Post_PE    1   2   PE    Yersinia pestis   double    half    ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/009/ERR2112549/ERR2112549_1.fastq.gz    ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/009/ERR2112549/ERR2112549_2.fastq.gz    NA
6Post   6Post_PE    2   2   PE    Yersinia pestis   double    half    ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/000/ERR2112550/ERR2112550_1.fastq.gz    ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/000/ERR2112550/ERR2112550_2.fastq.gz    NA
6Post   6Post_PE    3   2   PE    Yersinia pestis   double    half    ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/001/ERR2112551/ERR2112551_1.fastq.gz    ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/001/ERR2112551/ERR2112551_2.fastq.gz    NA
6Post   6Post_PE    4   2   PE    Yersinia pestis   double    half    ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/002/ERR2112552/ERR2112552_1.fastq.gz    ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/002/ERR2112552/ERR2112552_2.fastq.gz    NA
6Post   6Post_SE    1   4   SE    Yersinia pestis   double    half    ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/009/ERR2112569/ERR2112569.fastq.gz    NA    NA
6Post   6Post_SE    2   4   SE    Yersinia pestis   double    half    ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/000/ERR2112570/ERR2112570.fastq.gz    NA    NA
6Post   6Post_SE    3   4   SE    Yersinia pestis   double    half    ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/001/ERR2112571/ERR2112571.fastq.gz    NA    NA
6Post   6Post_SE    4   4   SE    Yersinia pestis   double    half    ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/002/ERR2112572/ERR2112572.fastq.gz    NA    NA
6Post   6Post_SE    8   4   SE    Yersinia pestis   double    half    ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/003/ERR2112573/ERR2112573.fastq.gz    NA    NA
```

> Note we also have a mixture of non-UDG and half-UDG treated libraries.

You can see that we have a single line for each set of FASTQ files representing
each `Lane`, but the `Sample_Name` and `Library_ID` columns identify and group
them together accordingly. Secondly, as we have NextSeq data, we have specified
we have `2` for `Colour_Chemistry` vs `4` for HiSeq; something that is important
for downstream processing (see below). See the nf-core/eager
parameter documentation above for more specifications on how to set up a TSV
file (e.g. why despite NextSeqs only having 4 lanes, we can also go up to 8 or
more when having a sample sequenced on two NextSeq runs).

Alongside our input TSV file, we will also specify the paths to our reference
FASTA file and the corresponding indices.

```bash
nextflow run nf-core/eager \
-r 2.2.0 \
-profile singularity,shh \
-name 'projectX_preprocessing20200727' \
--input 'preprocessing20200727.tsv' \
--fasta '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa' \
--bwa_index '../Reference/genome/' \
--fasta_index '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.fai' \
--seq_dict '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.dict' \
<...>
```

We specify the paths to each reference genome and it's corresponding tool
specific index. Paths should always be encapsulated in quotes to ensure Nextflow
evaluates them, rather than your shell! Also note that as `bwa` generates
multiple index files, nf-core/eager takes a _directory_ that must contain these
indices instead.

> Note the difference between single and double `-` parameters. The former
> represent Nextflow flags, while the latter are nf-core/eager specific flags.

Finally, we can also specify the output directory and the Nextflow `work/`
directory (which contains 'intermediate' working files and directories).

```bash
nextflow run nf-core/eager \
-r 2.2.0 \
-profile singularity,shh \
-name 'projectX_preprocessing20200727' \
--input 'preprocessing20200727.tsv' \
--fasta '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa' \
--bwa_index '../Reference/genome/' \
--fasta_index '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.fai' \
--seq_dict '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.dict' \
--outdir './results/' \
-w './work/' \
<...>
```

#### Tutorial Pathogen Genomics - Pipeline Configuration

Now that we have specified the input data, we can start moving onto specifying
settings for each different module we will be running. As mentioned above, some
of our samples were generated as NextSeq data, which is generated with a
two-colour imaging technique. What this means is when you have shorter molecules
than the number of cycles of the sequencing chemistry, the sequencer will
repeatedly see 'G' calls (no colour) at the last few cycles, and you get long
poly-G 'tails' on your reads. We therefore will turn on the poly-G clipping
functionality offered by [`fastp`](https://github.com/OpenGene/fastp), and any
pairs of files indicated in the TSV file as having `2` in the `Colour_Chemistry`
column will be passed to `fastp` (the HiSeq data will not). We will not change
the default minimum length of a poly-G string to be clipped.

```bash
nextflow run nf-core/eager \
-r 2.2.0 \
-profile singularity,shh \
-name 'projectX_preprocessing20200727' \
--input 'preprocessing20200727.tsv' \
--fasta '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa' \
--bwa_index '../Reference/genome/' \
--fasta_index '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.fai' \
--seq_dict '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.dict' \
--outdir './results/' \
-w './work/' \
--complexity_filter_poly_g \
<...>
```

We then need to specify the mapping parameters for this run. Typically, to
account for damage of very old aDNA libraries and also sometimes for
evolutionary divergence of the ancient genome to the modern reference, we should
relax the mapping thresholds that specify how many mismatches a read can have
from the reference to be considered 'mapped'. We will also speed up the seeding
step of the seed-and-extend approach by specifying the length of the seed. We
will do this with `--bwaalnn` and `--bwaalnl` respectively.

```bash
nextflow run nf-core/eager \
-r 2.2.0 \
-profile singularity,shh \
-name 'projectX_preprocessing20200727' \
--input 'preprocessing20200727.tsv' \
--fasta '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa' \
--bwa_index '../Reference/genome/' \
--fasta_index '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.fai' \
--seq_dict '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.dict' \
--outdir './results/' \
-w './work/' \
--complexity_filter_poly_g \
--bwaalnn 0.01 \
--bwaalnl 16 \
<...>
```

As we are also interested at checking for gene presence/absence (see below), we
will ensure no mapping quality filter is applied (to account for gene
duplication that may cause a read to map equally to to places) by setting the
threshold to 0. In addition, we will discard unmapped reads to reduce our
hard-drive footprint.

```bash
nextflow run nf-core/eager \
-r 2.2.0 \
-profile singularity,shh \
-name 'projectX_preprocessing20200727' \
--input 'preprocessing20200727.tsv' \
--fasta '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa' \
--bwa_index '../Reference/genome/' \
--fasta_index '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.fai' \
--seq_dict '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.dict' \
--outdir './results/' \
-w './work/' \
--complexity_filter_poly_g \
--bwaalnn 0.01 \
--bwaalnl 16 \
--run_bam_filtering \
--bam_mapping_quality_threshold 0 \
--bam_unmapped_type 'discard' \
<...>
```

While some of our input data is paired-end, we will keep with the default of
Picard's MarkDuplicates'for duplicate removal, as DeDup takes into account
both the start and end of a _merged_ read before flagging it as a duplicate -
something that isn't valid for a single-end read (where the true end of the
molecule might not have been sequenced). We can then specify which dedupper we
want to use with `--dedupper`. While we are using the default (which does not
need to be directly specified), we will put it explicitly in our command for
clarity.

```bash
nextflow run nf-core/eager \
-r 2.2.0 \
-profile singularity,shh \
-name 'projectX_preprocessing20200727' \
--input 'preprocessing20200727.tsv' \
--fasta '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa' \
--bwa_index '../Reference/genome/' \
--fasta_index '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.fai' \
--seq_dict '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.dict' \
--outdir './results/' \
-w './work/' \
--complexity_filter_poly_g \
--bwaalnn 0.01 \
--bwaalnl 16 \
--run_bam_filtering \
--bam_mapping_quality_threshold 0 \
--bam_unmapped_type 'discard' \
--dedupper 'markduplicates' \
<...>
```

Alongside making a SNP table for downstream phylogenetic analysis (we will get
to this in a bit), you may be interested in generating some summary statistics
of annotated parts of your reference genome, e.g. to see whether certain
virulence factors are present or absent. nf-core/eager offers some basic
statistics (percent and and depth coverage) of these via Bedtools. We will
therefore turn on this module and specify the GFF file we downloaded alongside
our reference fasta. Note that this GFF file has a _lot_ of redundant data, so
often a custom BED file with just genes of interest is recommended. Furthermore

```bash
nextflow run nf-core/eager \
-r 2.2.0 \
-profile singularity,shh \
-name 'projectX_preprocessing20200727' \
--input 'preprocessing20200727.tsv' \
--fasta '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa' \
--bwa_index '../Reference/genome/' \
--fasta_index '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.fai' \
--seq_dict '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.dict' \
--outdir './results/' \
-w './work/' \
--complexity_filter_poly_g \
--bwaalnn 0.01 \
--bwaalnl 16 \
--run_bam_filtering \
--bam_mapping_quality_threshold 0 \
--bam_unmapped_type 'discard' \
--dedupper 'markduplicates' \
--run_bedtools_coverage \
--anno_file '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.gff'
<...>
```

Next, we will set up trimming of the mapped reads to alleviate the effects of
DNA damage during genotyping. To do this we will activate trimming with
`--run_trim_bam`. The libraries in this example underwent either no or
'half'-UDG treatment. The latter will generally restrict all remaining DNA
damage to the first 2 base pairs of a fragment. We will therefore use
`--bamutils_clip_half_udg_left` and `--bamutils_clip_half_udg_right` to trim 2
bp on either side of each fragment. For the non-UDG treated libraries we can
trim a little more to remove most damage with the `--bamutils_clip_none_udg_<*>`
variants of the flag. Note that there is a tendency in ancient pathogenomics to
trim damage _prior_ mapping, as it allows mapping with stricter parameters to
improve removal of reads deriving from potential evolutionary diverged
contaminants (this can be done nf-core/eager with the Bowtie2 aligner), however
we do BAM trimming instead here as another demonstration of functionality.

```bash
nextflow run nf-core/eager \
-r 2.2.0 \
-profile singularity,shh \
-name 'projectX_preprocessing20200727' \
--input 'preprocessing20200727.tsv' \
--fasta '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa' \
--bwa_index '../Reference/genome/' \
--fasta_index '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.fai' \
--seq_dict '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.dict' \
--outdir './results/' \
-w './work/' \
--complexity_filter_poly_g \
--bwaalnn 0.01 \
--bwaalnl 16 \
--run_bam_filtering \
--bam_mapping_quality_threshold 0 \
--bam_unmapped_type 'discard' \
--dedupper 'markduplicates' \
--run_bedtools_coverage \
--anno_file '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.gff'
--run_trim_bam \
--bamutils_clip_double_stranded_half_udg_left 2 \
--bamutils_clip_double_stranded_half_udg_right 2 \
--bamutils_clip_double_stranded_none_udg_left 3 \
--bamutils_clip_double_stranded_none_udg_right 3 \
<...>
```

Here we will use MultiVCFAnalyzer for the generation of our SNP table. A
MultiVCFAnalyzer SNP table allows downstream assessment of the level of
multi-allelic positions, something not expected when dealing with a single
ploidy organism and thus may reflect cross-mapping from multiple-strains,
environmental relatives or other contaminants.

For this we need to run genotyping, but specifically with GATK UnifiedGenotyper
3.5 (as MultiVCFAnalyzer requires this particular format of VCF files). We will
therefore turn on Genotyping, and
check ploidy is set 2 so 'heterozygous' positions can be reported. We will also
need to specify that we want to use the trimmed bams from the previous step.

```bash
nextflow run nf-core/eager \
-r 2.2.0 \
-profile singularity,shh \
-name 'projectX_preprocessing20200727' \
--input 'preprocessing20200727.tsv' \
--fasta '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa' \
--bwa_index '../Reference/genome/' \
--fasta_index '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.fai' \
--seq_dict '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.dict' \
--outdir './results/' \
-w './work/' \
--complexity_filter_poly_g \
--bwaalnn 0.01 \
--bwaalnl 16 \
--run_bam_filtering \
--bam_mapping_quality_threshold 0 \
--bam_unmapped_type 'discard' \
--dedupper 'markduplicates' \
--run_trim_bam \
--bamutils_clip_double_stranded_half_udg_left 2 \
--bamutils_clip_double_stranded_half_udg_right 2 \
--bamutils_clip_double_stranded_none_udg_left 3 \
--bamutils_clip_double_stranded_none_udg_right 3 \
--run_bedtools_coverage \
--anno_file '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.gff' \
--run_genotyping \
--genotyping_tool 'ug' \
--genotyping_source 'trimmed' \
--gatk_ploidy 2 \
--gatk_ug_mode 'EMIT_ALL_SITES' \
--gatk_ug_genotype_model 'SNP' \
<...>
```

Finally we can set up MultiVCFAnalyzer itself. First we want to make sure we
specified that we want to report the frequency of the given called allele at
each position so we can assess cross mapping. Then, often with ancient
pathogens, such as _Y. pestis_, we also want to include to the SNP table
comparative data from previously published and ancient genomes. For this we
specify additional VCF files that have been generated in previous runs with the
same settings and reference genome. We can do this as follows.

```bash
nextflow run nf-core/eager \
-r 2.2.0 \
-profile singularity,shh \
-name 'projectX_preprocessing20200727' \
--input 'preprocessing20200727.tsv' \
--fasta '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa' \
--bwa_index '../Reference/genome/' \
--fasta_index '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.fai' \
--seq_dict '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.dict' \
--outdir './results/' \
-w './work/' \
--complexity_filter_poly_g \
--bwaalnn 0.01 \
--bwaalnl 16 \
--run_bam_filtering \
--bam_mapping_quality_threshold 0 \
--bam_unmapped_type 'discard' \
--dedupper 'markduplicates' \
--run_trim_bam \
--bamutils_clip_double_stranded_half_udg_left 2 \
--bamutils_clip_double_stranded_half_udg_right 2 \
--bamutils_clip_double_stranded_none_udg_left 3 \
--bamutils_clip_double_stranded_none_udg_right 3 \
--run_bedtools_coverage \
--anno_file '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.gff' \
--run_genotyping \
--genotyping_tool 'ug' \
--genotyping_source 'trimmed' \
--gatk_ploidy 2 \
--gatk_ug_mode 'EMIT_ALL_SITES' \
--gatk_ug_genotype_model 'SNP' \
--run_multivcfanalyzer \
--write_allele_frequencies \
--min_base_coverage 5 \
--min_allele_freq_hom 0.9 \
--min_allele_freq_het 0.1 \
--additional_vcf_files '../vcfs/*.vcf.gz'
```

For the two `min_allele_freq` parameters we specify that anything above 90%
frequency is considered 'homozygous', and anything above 10% (but below 90%) is
considered an ambiguous call and the frequency will be reported. Note that you
would not normally use this SNP table with these parameters for downstream
phylogenetic analysis, as the table will include ambiguous IUPAC codes, making
it only useful for fine-comb checking of multi-allelic positions. Instead, set
both parameters to the same value (e.g. 0.8) and use that table for downstream
phylogenetic analysis.

With this, we are ready to submit! If running on a remote cluster/server, Make
sure to run this in a `screen` session or similar, so that if you get a `ssh`
signal drop or want to log off, Nextflow will not crash.

#### Tutorial Pathogen Genomics - Results

Assuming the run completed without any crashes (if problems do occur, check
against [parameters](https://nf-core/eager/parameters) that all parameters are as expected, or
check the [FAQ](#troubleshooting-and-faqs)), we can now check our results in
`results/`.

##### Tutorial Pathogen Genomics - MultiQC Report

In here there are many different directories containing different output files.
The first directory to check is the `MultiQC/` directory. You should
find a `multiqc_report.html` file. You will need to view this in a web browser,
so I recommend either mounting your server to your file browser, or downloading
it to your own local machine (PC/Laptop etc.).

Once you've opened this you can go through each section and evaluate all the
results. For example, I normally look for things like:

General Stats Table:

* Do I see the expected number of raw sequencing reads (summed across each set
  of FASTQ files per library) that was requested for sequencing?
* Does the percentage of trimmed reads look normal for aDNA, and do lengths
  after trimming look short as expected of aDNA?
* Does the Endogenous DNA (%) columns look reasonable (high enough to indicate
  you have received enough coverage for downstream, and/or do you lose an
  unusually high reads after filtering )
* Does ClusterFactor or '% Dups' look high (e.g. >2 or >10% respectively -  high
  values suggesting over-amplified or badly preserved samples i.e. low
  complexity; note that genome-enrichment libraries may by their nature look
  higher).
* Do you see an increased frequency of C>Ts on the 5' end of molecules in the
  mapped reads?
* Do median read lengths look relatively low (normally <= 100 bp) indicating
  typically fragmented aDNA?
* Does the % coverage decrease relatively gradually at each depth coverage, and
  does not drop extremely drastically
* Does the Median coverage and percent >3x (or whatever you set) show sufficient
  coverage for reliable SNP calls and that a good proportion of the genome is
  covered indicating you have the right reference genome?
* Do you see a high proportion of % Hets, indicating many multi-allelic sites
  (and possibly presence of cross-mapping from other species, that may lead to
  false positive or less confident SNP calls)?

FastQC (pre-AdapterRemoval):

* Do I see any very early drop off of sequence quality scores suggesting
  problematic sequencing run?
* Do I see outlier GC content distributions?
* Do I see high sequence duplication levels?

AdapterRemoval:

* Do I see high numbers of singletons or discarded read pairs?

FastQC (post-AdapterRemoval):

* Do I see improved sequence quality scores along the length of reads?
* Do I see reduced adapter content levels?

Samtools Flagstat (pre/post Filter):

* Do I see outliers, e.g. with unusually low levels of mapped reads, (indicative
  of badly preserved samples) that require downstream closer assessment?

DeDup/Picard MarkDuplicates:

* Do I see large numbers of duplicates being removed, possibly indicating
  over-amplified or badly preserved samples?

PreSeq:

* Do I see a large drop off of a sample's curve away from the theoretical
  complexity? If so, this may indicate it's not worth performing deeper
  sequencing as you will get few unique reads (vs. duplicates that are not any
  more informative than the reads you've already sequenced)

DamageProfiler:

* Do I see evidence of damage on the microbial DNA (i.e. a % C>T of more than ~5% in
  the first few nucleotide positions?) ? If not, possibly your mapped
  reads are deriving from modern contamination.

QualiMap:

* Do you see a peak of coverage (X) at a good level, e.g. >= 3x, indicating
  sufficient coverage for reliable SNP calls?

MultiVCFAnalyzer:

* Do I have a good number of called SNPs that suggest the samples have genomes
  with sufficient nucleotide diversity to inform phylogenetic analysis?
* Do you have a large number of discarded SNP calls?
* Are the % Hets very high indicating possible cross-mapping from off-target
  organisms that may confounding variant calling?

> Detailed documentation and descriptions for all MultiQC modules can be seen in
> the the 'Documentation' folder of the results directory or here in the [output
> documentation](output.md)

If you're happy everything looks good in terms of sequencing, we then look at
specific directories to find any files you might want to use for downstream
processing.

Note that when you get back to writing up your publication, all the versions of
the tools can be found under the 'nf-core/eager Software Versions' section of
the MultiQC report. Note that all tools in the container are listed, so you may
have to remove some of them that you didn't actually use in the set up.

For example, in the example above, we have used: Nextflow, nf-core/eager,
FastQC, AdapterRemoval, fastP, BWA, Samtools, endorS.py, Picard Markduplicates,
Bedtools, Qualimap, PreSeq, DamageProfiler, MultiVCFAnalyzer and MultiQC.

Citations to all used tools can be seen
[here](https://nf-co.re/eager#tool-references)

##### Tutorial Pathogen Genomics - Files for Downstream Analysis

You will find the most relevant output files in your `results/` directory. Each
directory generally corresponds to a specific step or tool of the pipeline. Most
importantly you should look in `deduplication` for your de-duplicated BAM files
(e.g. for viewing in IGV), bedtools for depth (X) and breadth (%) coverages of
annotations of your reference (e.g. genes), `multivcfanalyzer` for final SNP
tables etc that can be used for downstream phylogenetic applications.

#### Tutorial Pathogen Genomics - Clean up

Finally, I would recommend cleaning up your `work/` directory of any
intermediate files (if your `-profile` does not already do so). You can do this
by going to above your `results/` and `work/` directory, e.g.

```bash
cd /<path>/<to>/projectX_preprocessing20200727
```

and running

```bash
nextflow clean -f -k
```

#### Tutorial Pathogen Genomics - Summary

In this this tutorial we have described an example on how to set up an
nf-core/eager run to process microbial aDNA for a relatively standard pathogen
genomics study for phylogenetics and basic functional screening. This includes
preform some simple quality control checks, mapping, genotyping, and SNP table
generation for downstream analysis of the data. Additionally, we described what
to look for in the run summary report generated by MultiQC and where to find
output files that can be used for downstream analysis.


================================================
FILE: environment.yml
================================================
# You can use this file to create a conda environment for this pipeline:
#   conda env create -f environment.yml
name: nf-core-eager-2.5.3
channels:
  - conda-forge
  - bioconda
  - defaults
dependencies:
  - conda-forge::python=3.9.4
  - conda-forge::markdown=3.3.4
  - conda-forge::pymdown-extensions=8.2
  - conda-forge::pygments=2.14.0
  - bioconda::rename=1.601
  - conda-forge::openjdk=8.0.144 # Don't upgrade - required for GATK
  - bioconda::fastqc=0.11.9
  - bioconda::adapterremoval=2.3.2
  - bioconda::adapterremovalfixprefix=0.0.5
  - bioconda::bwa=0.7.17
  - bioconda::picard=2.26.0
  - bioconda::samtools=1.12
  - bioconda::dedup=0.12.8
  - bioconda::angsd=0.935
  - bioconda::circularmapper=1.93.5
  - bioconda::gatk4=4.2.0.0
  - bioconda::gatk=3.5 ## Don't upgrade - required for MultiVCFAnalyzer
  - bioconda::qualimap=2.2.2d
  - bioconda::vcf2genome=0.91
  - bioconda::damageprofiler=0.4.9 # Don't upgrade - later versions don't allow java 8
  - bioconda::multiqc=1.16
  - bioconda::pmdtools=0.60
  - bioconda::bedtools=2.30.0
  - conda-forge::libiconv=1.16
  - conda-forge::pigz=2.6
  - bioconda::sequencetools=1.5.2
  - bioconda::preseq=3.1.2
  - bioconda::fastp=0.20.1
  - bioconda::bamutil=1.0.15
  - bioconda::mtnucratio=0.7
  - bioconda::pysam=0.16.0
  - bioconda::kraken2=2.1.2
  - conda-forge::pandas=1.2.4
  - bioconda::freebayes=1.3.5
  - bioconda::sexdeterrmine=1.1.2
  - bioconda::multivcfanalyzer=0.85.2
  - bioconda::hops=0.35
  - bioconda::malt=0.61
  - conda-forge::biopython=1.79
  - conda-forge::xopen=1.1.0
  - bioconda::bowtie2=2.4.4
  - bioconda::eigenstratdatabasetools=1.0.2
  - bioconda::mapdamage2=2.2.1
  - bioconda::bbmap=38.92
  - bioconda::bcftools=1.12

================================================
FILE: lib/Checks.groovy
================================================
import org.yaml.snakeyaml.Yaml

/*
 * This file holds several functions used to perform standard checks for the nf-core pipeline template.
 */

class Checks {

    static void check_conda_channels(log) {
        Yaml parser = new Yaml()
        def channels = []
        try {
            def config = parser.load("conda config --show channels".execute().text)
            channels = config.channels
        } catch(NullPointerException | IOException e) {
            log.warn "Could not verify conda channel configuration."
            return
        }

        // Check that all channels are present
        def required_channels = ['conda-forge', 'bioconda', 'defaults']
        def conda_check_failed = !required_channels.every { ch -> ch in channels }

        // Check that they are in the right order
        conda_check_failed |= !(channels.indexOf('conda-forge') < channels.indexOf('bioconda'))
        conda_check_failed |= !(channels.indexOf('bioconda') < channels.indexOf('defaults'))

        if (conda_check_failed) {
            log.warn "=============================================================================\n" +
                     "  There is a problem with your Conda configuration!\n\n" + 
                     "  You will need to set-up the conda-forge and bioconda channels correctly.\n" +
                     "  Please refer to https://bioconda.github.io/user/install.html#set-up-channels\n" +
                     "  NB: The order of the channels matters!\n" +
                     "==================================================================================="
        }
    }

    static void aws_batch(workflow, params) {
        if (workflow.profile.contains('awsbatch')) {
            assert (params.awsqueue && params.awsregion) : "Specify correct --awsqueue and --awsregion parameters on AWSBatch!"
            // Check outdir paths to be S3 buckets if running on AWSBatch
            // related: https://github.com/nextflow-io/nextflow/issues/813
            assert params.outdir.startsWith('s3:')       : "Outdir not on S3 - specify S3 Bucket to run on AWSBatch!"
            // Prevent trace files to be stored on S3 since S3 does not support rolling files.
            assert !params.tracedir.startsWith('s3:')    :  "Specify a local tracedir or run without trace! S3 cannot be used for tracefiles."
        }
    }

    static void hostname(workflow, params, log) {
        Map colors = Headers.log_colours(params.monochrome_logs)
        if (params.hostnames) {
            def hostname = "hostname".execute().text.trim()
            params.hostnames.each { prof, hnames ->
                hnames.each { hname ->
                    if (hostname.contains(hname) && !workflow.profile.contains(prof)) {
                        log.info "=${colors.yellow}====================================================${colors.reset}=\n" +
                                  "${colors.yellow}WARN: You are running with `-profile $workflow.profile`\n" +
                                  "      but your machine hostname is ${colors.white}'$hostname'${colors.reset}.\n" +
                                  "      ${colors.yellow_bold}Please use `-profile $prof${colors.reset}`\n" +
                                  "=${colors.yellow}====================================================${colors.reset}="
                    }
                }
            }
        }
    }

    // Citation string
    private static String citation(workflow) {
        return "If you use ${workflow.manifest.name} for your analysis please cite:\n\n" +
               "* The pipeline\n" + 
               "  https://doi.org/10.1101/2020.06.11.145615\n\n" +
               "* The nf-core framework\n" +
               "  https://dx.doi.org/10.1038/s41587-020-0439-x\n" +
               "  https://rdcu.be/b1GjZ\n\n" +
               "* Software dependencies\n" +
               "  https://github.com/${workflow.manifest.name}/blob/master/CITATIONS.md"
    }


}


================================================
FILE: lib/Completion.groovy
================================================
/*
 * Functions to be run on completion of pipeline
 */

class Completion {
    static void email(workflow, params, summary_params, projectDir, log, multiqc_report=[]) {

        // Set up the e-mail variables
        def subject = "[$workflow.manifest.name] Successful: $workflow.runName"

        if (!workflow.success) {
            subject = "[$workflow.manifest.name] FAILED: $workflow.runName"
        }

        def summary = [:]
        for (group in summary_params.keySet()) {
            summary << summary_params[group]
        }
        
        def misc_fields = [:]
        misc_fields['Date Started']              = workflow.start
        misc_fields['Date Completed']            = workflow.complete
        misc_fields['Pipeline script file path'] = workflow.scriptFile
        misc_fields['Pipeline script hash ID']   = workflow.scriptId
        if (workflow.repository) misc_fields['Pipeline repository Git URL']    = workflow.repository
        if (workflow.commitId)   misc_fields['Pipeline repository Git Commit'] = workflow.commitId
        if (workflow.revision)   misc_fields['Pipeline Git branch/tag']        = workflow.revision
        misc_fields['Nextflow Version']           = workflow.nextflow.version
        misc_fields['Nextflow Build']             = workflow.nextflow.build
        misc_fields['Nextflow Compile Timestamp'] = workflow.nextflow.timestamp

        def email_fields = [:]
        email_fields['version']             = workflow.manifest.version
        email_fields['runName']             = workflow.runName
        email_fields['success']             = workflow.success
        email_fields['dateComplete']        = workflow.complete
        email_fields['duration']            = workflow.duration
        email_fields['exitStatus']          = workflow.exitStatus
        email_fields['errorMessage']        = (workflow.errorMessage ?: 'None')
        email_fields['errorReport']         = (workflow.errorReport ?: 'None')
        email_fields['commandLine']         = workflow.commandLine
        email_fields['projectDir']          = workflow.projectDir
        email_fields['summary']             = summary << misc_fields
        
        // On success try attach the multiqc report
        def mqc_report = null
        try {
            if (workflow.success) {
                mqc_report = multiqc_report.getVal()
                if (mqc_report.getClass() == ArrayList && mqc_report.size() >= 1) {
                    if (mqc_report.size() > 1) {
                        log.warn "[$workflow.manifest.name] Found multiple reports from process 'MULTIQC', will use only one"
                    }
                    mqc_report = mqc_report[0]
                }
            }
        } catch (all) {
            log.warn "[$workflow.manifest.name] Could not attach MultiQC report to summary email"
        }

        // Check if we are only sending emails on failure
        def email_address = params.email
        if (!params.email && params.email_on_fail && !workflow.success) {
            email_address = params.email_on_fail
        }

        // Render the TXT template
        def engine       = new groovy.text.GStringTemplateEngine()
        def tf           = new File("$projectDir/assets/email_template.txt")
        def txt_template = engine.createTemplate(tf).make(email_fields)
        def email_txt    = txt_template.toString()

        // Render the HTML template
        def hf            = new File("$projectDir/assets/email_template.html")
        def html_template = engine.createTemplate(hf).make(email_fields)
        def email_html    = html_template.toString()

        // Render the sendmail template
        def max_multiqc_email_size = params.max_multiqc_email_size as nextflow.util.MemoryUnit 
        def smail_fields           = [ email: email_address, subject: subject, email_txt: email_txt, email_html: email_html, projectDir: "$projectDir", mqcFile: mqc_report, mqcMaxSize:  max_multiqc_email_size.toBytes()]
        def sf                     = new File("$projectDir/assets/sendmail_template.txt")
        def sendmail_template      = engine.createTemplate(sf).make(smail_fields)
        def sendmail_html          = sendmail_template.toString()

        // Send the HTML e-mail
        Map colors = Headers.log_colours(params.monochrome_logs)
        if (email_address) {
            try {
                if (params.plaintext_email) { throw GroovyException('Send plaintext e-mail, not HTML') }
                // Try to send HTML e-mail using sendmail
                [ 'sendmail', '-t' ].execute() << sendmail_html
                log.info "-${colors.purple}[$workflow.manifest.name]${colors.green} Sent summary e-mail to $email_address (sendmail)-"
            } catch (all) {
                // Catch failures and try with plaintext
                def mail_cmd = [ 'mail', '-s', subject, '--content-type=text/html', email_address ]
                if ( mqc_report.size() <= max_multiqc_email_size.toBytes() ) {
                    mail_cmd += [ '-A', mqc_report ]
                }
                mail_cmd.execute() << email_html
                log.info "-${colors.purple}[$workflow.manifest.name]${colors.green} Sent summary e-mail to $email_address (mail)-"
            }
        }

        // Write summary e-mail HTML to a file
        def output_d = new File("${params.outdir}/pipeline_info/")
        if (!output_d.exists()) {
            output_d.mkdirs()
        }
        def output_hf = new File(output_d, "pipeline_report.html")
        output_hf.withWriter { w -> w << email_html }
        def output_tf = new File(output_d, "pipeline_report.txt")
        output_tf.withWriter { w -> w << email_txt }
    }

    static void summary(workflow, params, log, fail_percent_mapped=[:], pass_percent_mapped=[:]) {
        Map colors = Headers.log_colours(params.monochrome_logs)

        if (workflow.success) {
            if (workflow.stats.ignoredCount == 0) {
                log.info "-${colors.purple}[$workflow.manifest.name]${colors.green} Pipeline completed successfully${colors.reset}-"
            } else {
                log.info "-${colors.purple}[$workflow.manifest.name]${colors.red} Pipeline completed successfully, but with errored process(es) ${colors.reset}-"
            }
        } else {
            Checks.hostname(workflow, params, log)
            log.info "-${colors.purple}[$workflow.manifest.name]${colors.red} Pipeline completed with errors${colors.reset}-"
        }
    }
}


================================================
FILE: lib/Headers.groovy
================================================
/*
 * This file holds several functions used to render the nf-core ANSI header.
 */

class Headers {

    private static Map log_colours(Boolean monochrome_logs) {
        Map colorcodes = [:]
        colorcodes['reset']       = monochrome_logs ? '' : "\033[0m"
        colorcodes['dim']         = monochrome_logs ? '' : "\033[2m"
        colorcodes['black']       = monochrome_logs ? '' : "\033[0;30m"
        colorcodes['green']       = monochrome_logs ? '' : "\033[0;32m"
        colorcodes['yellow']      = monochrome_logs ? '' :  "\033[0;33m"
        colorcodes['yellow_bold'] = monochrome_logs ? '' : "\033[1;93m"
        colorcodes['blue']        = monochrome_logs ? '' : "\033[0;34m"
        colorcodes['purple']      = monochrome_logs ? '' : "\033[0;35m"
        colorcodes['cyan']        = monochrome_logs ? '' : "\033[0;36m"
        colorcodes['white']       = monochrome_logs ? '' : "\033[0;37m"
        colorcodes['red']         = monochrome_logs ? '' : "\033[1;91m"
        return colorcodes
    }

    static String dashed_line(monochrome_logs) {
        Map colors = log_colours(monochrome_logs)
        return "-${colors.dim}----------------------------------------------------${colors.reset}-"
    }

    static String nf_core(workflow, monochrome_logs) {
        Map colors = log_colours(monochrome_logs)
        String.format(
            """\n
            ${dashed_line(monochrome_logs)}
                                                    ${colors.green},--.${colors.black}/${colors.green},-.${colors.reset}
            ${colors.blue}        ___     __   __   __   ___     ${colors.green}/,-._.--~\'${colors.reset}
            ${colors.blue}  |\\ | |__  __ /  ` /  \\ |__) |__         ${colors.yellow}}  {${colors.reset}
            ${colors.blue}  | \\| |       \\__, \\__/ |  \\ |___     ${colors.green}\\`-._,-`-,${colors.reset}
                                                    ${colors.green}`._,._,\'${colors.reset}
            ${colors.purple}  ${workflow.manifest.name} v${workflow.manifest.version}${colors.reset}
            ${dashed_line(monochrome_logs)}
            """.stripIndent()
        )
    }
}


================================================
FILE: lib/NfcoreSchema.groovy
================================================
/*
 * This file holds several functions used to perform JSON parameter validation, help and summary rendering for the nf-core pipeline template.
 */

import org.everit.json.schema.Schema
import org.everit.json.schema.loader.SchemaLoader
import org.everit.json.schema.ValidationException
import org.json.JSONObject
import org.json.JSONTokener
import org.json.JSONArray
import groovy.json.JsonSlurper
import groovy.json.JsonBuilder

class NfcoreSchema {

    /*
    * Function to loop over all parameters defined in schema and check
    * whether the given paremeters adhere to the specificiations
    */
    /* groovylint-disable-next-line UnusedPrivateMethodParameter */
    private static void validateParameters(params, jsonSchema, log) {
        def has_error = false
        //=====================================================================//
        // Check for nextflow core params and unexpected params
        def json = new File(jsonSchema).text
        def Map schemaParams = (Map) new JsonSlurper().parseText(json).get('definitions')
        def nf_params = [
            // Options for base `nextflow` command
            'bg',
            'c',
            'C',
            'config',
            'd',
            'D',
            'dockerize',
            'h',
            'log',
            'q',
            'quiet',
            'syslog',
            'v',
            'version',

            // Options for `nextflow run` command
            'ansi',
            'ansi-log',
            'bg',
            'bucket-dir',
            'c',
            'cache',
            'config',
            'dsl2',
            'dump-channels',
            'dump-hashes',
            'E',
            'entry',
            'latest',
            'lib',
            'main-script',
            'N',
            'name',
            'offline',
            'params-file',
            'pi',
            'plugins',
            'poll-interval',
            'pool-size',
            'profile',
            'ps',
            'qs',
            'queue-size',
            'r',
            'resume',
            'revision',
            'stdin',
            'stub',
            'stub-run',
            'test',
            'w',
            'with-charliecloud',
            'with-conda',
            'with-dag',
            'with-docker',
            'with-mpi',
            'with-notification',
            'with-podman',
            'with-report',
            'with-singularity',
            'with-timeline',
            'with-tower',
            'with-trace',
            'with-weblog',
            'without-docker',
            'without-podman',
            'work-dir'
        ]
        def unexpectedParams = []

        // Collect expected parameters from the schema
        def expectedParams = []
        for (group in schemaParams) {
            for (p in group.value['properties']) {
                expectedParams.push(p.key)
            }
        }

        for (specifiedParam in params.keySet()) {
            // nextflow params
            if (nf_params.contains(specifiedParam)) {
                log.error "ERROR: You used a core Nextflow option with two hyphens: '--${specifiedParam}'. Please resubmit with '-${specifiedParam}'"
                has_error = true
            }
            // unexpected params
            def params_ignore = params.schema_ignore_params.split(',') + 'schema_ignore_params'
            def expectedParamsLowerCase = expectedParams.collect{ it.replace("-", "").toLowerCase() }
            def specifiedParamLowerCase = specifiedParam.replace("-", "").toLowerCase()
            if (!expectedParams.contains(specifiedParam) && !params_ignore.contains(specifiedParam) && !expectedParamsLowerCase.contains(specifiedParamLowerCase)) {
                // Temporarily remove camelCase/camel-case params #1035
                def unexpectedParamsLowerCase = unexpectedParams.collect{ it.replace("-", "").toLowerCase()}
                if (!unexpectedParamsLowerCase.contains(specifiedParamLowerCase)){
                    unexpectedParams.push(specifiedParam)
                }
            }
        }

        //=====================================================================//
        // Validate parameters against the schema
        InputStream inputStream = new File(jsonSchema).newInputStream()
        JSONObject rawSchema = new JSONObject(new JSONTokener(inputStream))

        // Remove anything that's in params.schema_ignore_params
        rawSchema = removeIgnoredParams(rawSchema, params)

        Schema schema = SchemaLoader.load(rawSchema)

        // Clean the parameters
        def cleanedParams = cleanParameters(params)

        // Convert to JSONObject
        def jsonParams = new JsonBuilder(cleanedParams)
        JSONObject paramsJSON = new JSONObject(jsonParams.toString())

        // Validate
        try {
            schema.validate(paramsJSON)
        } catch (ValidationException e) {
            println ''
            log.error 'ERROR: Validation of pipeline parameters failed!'
            JSONObject exceptionJSON = e.toJSON()
            printExceptions(exceptionJSON, paramsJSON, log)
            println ''
            has_error = true
        }

        // Check for unexpected parameters
        if (unexpectedParams.size() > 0) {
            Map colors = log_colours(params.monochrome_logs)
            println ''
            def warn_msg = 'Found unexpected parameters:'
            for (unexpectedParam in unexpectedParams) {
                warn_msg = warn_msg + "\n* --${unexpectedParam}: ${params[unexpectedParam].toString()}"
            }
            log.warn warn_msg
            log.info "- ${colors.dim}Ignore this warning: params.schema_ignore_params = \"${unexpectedParams.join(',')}\" ${colors.reset}"
            println ''
        }

        if (has_error) {
            System.exit(1)
        }
    }

    // Loop over nested exceptions and print the causingException
    private static void printExceptions(exJSON, paramsJSON, log) {
        def causingExceptions = exJSON['causingExceptions']
        if (causingExceptions.length() == 0) {
            def m = exJSON['message'] =~ /required key \[([^\]]+)\] not found/
            // Missing required param
            if (m.matches()) {
                log.error "* Missing required parameter: --${m[0][1]}"
            }
            // Other base-level error
            else if (exJSON['pointerToViolation'] == '#') {
                log.error "* ${exJSON['message']}"
            }
            // Error with specific param
            else {
                def param = exJSON['pointerToViolation'] - ~/^#\//
                def param_val = paramsJSON[param].toString()
                log.error "* --${param}: ${exJSON['message']} (${param_val})"
            }
        }
        for (ex in causingExceptions) {
            printExceptions(ex, paramsJSON, log)
        }
    }

    // Remove an element from a JSONArray
    private static JSONArray removeElement(jsonArray, element){
        def list = []
        int len = jsonArray.length()
        for (int i=0;i<len;i++){
            list.add(jsonArray.get(i).toString())
        }
        list.remove(element)
        JSONArray jsArray = new JSONArray(list)
        return jsArray
    }

    private static JSONObject removeIgnoredParams(rawSchema, params){
        // Remove anything that's in params.schema_ignore_params
        params.schema_ignore_params.split(',').each{ ignore_param ->
            if(rawSchema.keySet().contains('definitions')){
                rawSchema.definitions.each { definition ->
                    for (key in definition.keySet()){
                        if (definition[key].get("properties").keySet().contains(ignore_param)){
                            // Remove the param to ignore
                            definition[key].get("properties").remove(ignore_param)
                            // If the param was required, change this
                            if (definition[key].has("required")) {
                                def cleaned_required = removeElement(definition[key].required, ignore_param)
                                definition[key].put("required", cleaned_required)
                            }
                        }
                    }
                }
            }
            if(rawSchema.keySet().contains('properties') && rawSchema.get('properties').keySet().contains(ignore_param)) {
                rawSchema.get("properties").remove(ignore_param)
            }
            if(rawSchema.keySet().contains('required') && rawSchema.required.contains(ignore_param)) {
                def cleaned_required = removeElement(rawSchema.required, ignore_param)
                rawSchema.put("required", cleaned_required)
            }
        }
        return rawSchema
    }

    private static Map cleanParameters(params) {
        def new_params = params.getClass().newInstance(params)
        for (p in params) {
            // remove anything evaluating to false
            if (!p['value']) {
                new_params.remove(p.key)
            }
            // Cast MemoryUnit to String
            if (p['value'].getClass() == nextflow.util.MemoryUnit) {
                new_params.replace(p.key, p['value'].toString())
            }
            // Cast Duration to String
            if (p['value'].getClass() == nextflow.util.Duration) {
                new_params.replace(p.key, p['value'].toString().replaceFirst(/d(?!\S)/, "day"))
            }
            // Cast LinkedHashMap to String
            if (p['value'].getClass() == LinkedHashMap) {
                new_params.replace(p.key, p['value'].toString())
            }
        }
        return new_params
    }

     /*
     * This method tries to read a JSON params file
     */
    private static LinkedHashMap params_load(String json_schema) {
        def params_map = new LinkedHashMap()
        try {
            params_map = params_read(json_schema)
        } catch (Exception e) {
            println "Could not read parameters settings from JSON. $e"
            params_map = new LinkedHashMap()
        }
        return params_map
    }

    private static Map log_colours(Boolean monochrome_logs) {
        Map colorcodes = [:]

        // Reset / Meta
        colorcodes['reset']       = monochrome_logs ? '' : "\033[0m"
        colorcodes['bold']        = monochrome_logs ? '' : "\033[1m"
        colorcodes['dim']         = monochrome_logs ? '' : "\033[2m"
        colorcodes['underlined']  = monochrome_logs ? '' : "\033[4m"
        colorcodes['blink']       = monochrome_logs ? '' : "\033[5m"
        colorcodes['reverse']     = monochrome_logs ? '' : "\033[7m"
        colorcodes['hidden']      = monochrome_logs ? '' : "\033[8m"

        // Regular Colors
        colorcodes['black']       = monochrome_logs ? '' : "\033[0;30m"
        colorcodes['red']         = monochrome_logs ? '' : "\033[0;31m"
        colorcodes['green']       = monochrome_logs ? '' : "\033[0;32m"
        colorcodes['yellow']      = monochrome_logs ? '' : "\033[0;33m"
        colorcodes['blue']        = monochrome_logs ? '' : "\033[0;34m"
        colorcodes['purple']      = monochrome_logs ? '' : "\033[0;35m"
        colorcodes['cyan']        = monochrome_logs ? '' : "\033[0;36m"
        colorcodes['white']       = monochrome_logs ? '' : "\033[0;37m"

        // Bold
        colorcodes['bblack']      = monochrome_logs ? '' : "\033[1;30m"
        colorcodes['bred']        = monochrome_logs ? '' : "\033[1;31m"
        colorcodes['bgreen']      = monochrome_logs ? '' : "\033[1;32m"
        colorcodes['byellow']     = monochrome_logs ? '' : "\033[1;33m"
        colorcodes['bblue']       = monochrome_logs ? '' : "\033[1;34m"
        colorcodes['bpurple']     = monochrome_logs ? '' : "\033[1;35m"
        colorcodes['bcyan']       = monochrome_logs ? '' : "\033[1;36m"
        colorcodes['bwhite']      = monochrome_logs ? '' : "\033[1;37m"

        // Underline
        colorcodes['ublack']      = monochrome_logs ? '' : "\033[4;30m"
        colorcodes['ured']        = monochrome_logs ? '' : "\033[4;31m"
        colorcodes['ugreen']      = monochrome_logs ? '' : "\033[4;32m"
        colorcodes['uyellow']     = monochrome_logs ? '' : "\033[4;33m"
        colorcodes['ublue']       = monochrome_logs ? '' : "\033[4;34m"
        colorcodes['upurple']     = monochrome_logs ? '' : "\033[4;35m"
        colorcodes['ucyan']       = monochrome_logs ? '' : "\033[4;36m"
        colorcodes['uwhite']      = monochrome_logs ? '' : "\033[4;37m"

        // High Intensity
        colorcodes['iblack']      = monochrome_logs ? '' : "\033[0;90m"
        colorcodes['ired']        = monochrome_logs ? '' : "\033[0;91m"
        colorcodes['igreen']      = monochrome_logs ? '' : "\033[0;92m"
        colorcodes['iyellow']     = monochrome_logs ? '' : "\033[0;93m"
        colorcodes['iblue']       = monochrome_logs ? '' : "\033[0;94m"
        colorcodes['ipurple']     = monochrome_logs ? '' : "\033[0;95m"
        colorcodes['icyan']       = monochrome_logs ? '' : "\033[0;96m"
        colorcodes['iwhite']      = monochrome_logs ? '' : "\033[0;97m"

        // Bold High Intensity
        colorcodes['biblack']     = monochrome_logs ? '' : "\033[1;90m"
        colorcodes['bired']       = monochrome_logs ? '' : "\033[1;91m"
        colorcodes['bigreen']     = monochrome_logs ? '' : "\033[1;92m"
        colorcodes['biyellow']    = monochrome_logs ? '' : "\033[1;93m"
        colorcodes['biblue']      = monochrome_logs ? '' : "\033[1;94m"
        colorcodes['bipurple']    = monochrome_logs ? '' : "\033[1;95m"
        colorcodes['bicyan']      = monochrome_logs ? '' : "\033[1;96m"
        colorcodes['biwhite']     = monochrome_logs ? '' : "\033[1;97m"

        return colorcodes
    }

    static String dashed_line(monochrome_logs) {
        Map colors = log_colours(monochrome_logs)
        return "-${colors.dim}----------------------------------------------------${colors.reset}-"
    }

    /*
    Method to actually read in JSON file using Groovy.
    Group (as Key), values are all parameters
        - Parameter1 as Key, Description as Value
        - Parameter2 as Key, Description as Value
        ....
    Group
        -
    */
    private static LinkedHashMap params_read(String json_schema) throws Exception {
        def json = new File(json_schema).text
        def Map schema_definitions = (Map) new JsonSlurper().parseText(json).get('definitions')
        def Map schema_properties = (Map) new JsonSlurper().parseText(json).get('properties')
        /* Tree looks like this in nf-core schema
         * definitions <- this is what the first get('definitions') gets us
             group 1
               title
               description
                 properties
                   parameter 1
                     type
                     description
                   parameter 2
                     type
                     description
             group 2
               title
               description
                 properties
                   parameter 1
                     type
                     description
         * properties <- parameters can also be ungrouped, outside of definitions
            parameter 1
             type
             description
        */

        // Grouped params
        def params_map = new LinkedHashMap()
        schema_definitions.each { key, val ->
            def Map group = schema_definitions."$key".properties // Gets the property object of the group
            def title = schema_definitions."$key".title
            def sub_params = new LinkedHashMap()
            group.each { innerkey, value ->
                sub_params.put(innerkey, value)
            }
            params_map.put(title, sub_params)
        }

        // Ungrouped params
        def ungrouped_params = new LinkedHashMap()
        schema_properties.each { innerkey, value ->
            ungrouped_params.put(innerkey, value)
        }
        params_map.put("Other parameters", ungrouped_params)

        return params_map
    }

    /*
     * Get maximum number of characters across all parameter names
     */
    private static Integer params_max_chars(params_map) {
        Integer max_chars = 0
        for (group in params_map.keySet()) {
            def group_params = params_map.get(group)  // This gets the parameters of that particular group
            for (param in group_params.keySet()) {
                if (param.size() > max_chars) {
                    max_chars = param.size()
                }
            }
        }
        return max_chars
    }

    /*
     * Beautify parameters for --help
     */
    private static String params_help(workflow, params, json_schema, command) {
        Map colors = log_colours(params.monochrome_logs)
        Integer num_hidden = 0
        String output  = ''
        output        += 'Typical pipeline command:\n\n'
        output        += "  ${colors.cyan}${command}${colors.reset}\n\n"
        Map params_map = params_load(json_schema)
        Integer max_chars  = params_max_chars(params_map) + 1
        Integer desc_indent = max_chars + 14
        Integer dec_linewidth = 160 - desc_indent
        for (group in params_map.keySet()) {
            Integer num_params = 0
            String group_output = colors.underlined + colors.bold + group + colors.reset + '\n'
            def group_params = params_map.get(group)  // This gets the parameters of that particular group
            for (param in group_params.keySet()) {
                if (group_params.get(param).hidden && !params.show_hidden_params) {
                    num_hidden += 1
                    continue;
                }
                def type = '[' + group_params.get(param).type + ']'
                def description = group_params.get(param).description
                def defaultValue = group_params.get(param).default ? " [default: " + group_params.get(param).default.toString() + "]" : ''
                def description_default = description + colors.dim + defaultValue + colors.reset
                // Wrap long description texts
                // Loosely based on https://dzone.com/articles/groovy-plain-text-word-wrap
                if (description_default.length() > dec_linewidth){
                    List olines = []
                    String oline = "" // " " * indent
                    description_default.split(" ").each() { wrd ->
                        if ((oline.size() + wrd.size()) <= dec_linewidth) {
                            oline += wrd + " "
                        } else {
                            olines += oline
                            oline = wrd + " "
                        }
                    }
                    olines += oline
                    description_default = olines.join("\n" + " " * desc_indent)
                }
                group_output += "  --" +  param.padRight(max_chars) + colors.dim + type.padRight(10) + colors.reset + description_default + '\n'
                num_params += 1
            }
            group_output += '\n'
            if (num_params > 0){
                output += group_output
            }
        }
        output += dashed_line(params.monochrome_logs)
        if (num_hidden > 0){
            output += colors.dim + "\n Hiding $num_hidden params, use --show_hidden_params to show.\n" + colors.reset
            output += dashed_line(params.monochrome_logs)
        }
        return output
    }

    /*
     * Groovy Map summarising parameters/workflow options used by the pipeline
     */
    private static LinkedHashMap params_summary_map(workflow, params, json_schema) {
        // Get a selection of core Nextflow workflow options
        def Map workflow_summary = [:]
        if (workflow.revision) {
            workflow_summary['revision'] = workflow.revision
        }
        workflow_summary['runName']      = workflow.runName
        if (workflow.containerEngine) {
            workflow_summary['containerEngine'] = workflow.containerEngine
        }
        if (workflow.container) {
            workflow_summary['container'] = workflow.container
        }
        workflow_summary['launchDir']    = workflow.launchDir
        workflow_summary['workDir']      = workflow.workDir
        workflow_summary['projectDir']   = workflow.projectDir
        workflow_summary['userName']     = workflow.userName
        workflow_summary['profile']      = workflow.profile
        workflow_summary['configFiles']  = workflow.configFiles.join(', ')

        // Get pipeline parameters defined in JSON Schema
        def Map params_summary = [:]
        def blacklist  = ['hostnames']
        def params_map = params_load(json_schema)
        for (group in params_map.keySet()) {
            def sub_params = new LinkedHashMap()
            def group_params = params_map.get(group)  // This gets the parameters of that particular group
            for (param in group_params.keySet()) {
                if (params.containsKey(param) && !blacklist.contains(param)) {
                    def params_value = params.get(param)
                    def schema_value = group_params.get(param).default
                    def param_type   = group_params.get(param).type
                    if (schema_value != null) {
                        if (param_type == 'string') {
                            if (schema_value.contains('$projectDir') || schema_value.contains('${projectDir}')) {
                                def sub_string = schema_value.replace('\$projectDir', '')
                                sub_string     = sub_string.replace('\${projectDir}', '')
                                if (params_value.contains(sub_string)) {
                                    schema_value = params_value
                                }
                            }
                            if (schema_value.contains('$params.outdir') || schema_value.contains('${params.outdir}')) {
                                def sub_string = schema_value.replace('\$params.outdir', '')
                                sub_string     = sub_string.replace('\${params.outdir}', '')
                                if ("${params.outdir}${sub_string}" == params_value) {
                                    schema_value = params_value
                                }
                            }
                        }
                    }

                    // We have a default in the schema, and this isn't it
                    if (schema_value != null && params_value != schema_value) {
                        sub_params.put(param, params_value)
                    }
                    // No default in the schema, and this isn't empty
                    else if (schema_value == null && params_value != "" && params_value != null && params_value != false) {
                        sub_params.put(param, params_value)
                    }
                }
            }
            params_summary.put(group, sub_params)
        }
        return [ 'Core Nextflow options' : workflow_summary ] << params_summary
    }

    /*
     * Beautify parameters for summary and return as string
     */
    private static String params_summary_log(workflow, params, json_schema) {
        Map colors = log_colours(params.monochrome_logs)
        String output  = ''
        def params_map = params_summary_map(workflow, params, json_schema)
        def max_chars  = params_max_chars(params_map)
        for (group in params_map.keySet()) {
            def group_params = params_map.get(group)  // This gets the parameters of that particular group
            if (group_params) {
                output += colors.bold + group + colors.reset + '\n'
                for (param in group_params.keySet()) {
                    output += "  " + colors.blue + param.padRight(max_chars) + ": " + colors.green +  group_params.get(param) + colors.reset + '\n'
                }
                output += '\n'
            }
        }
        output += dashed_line(params.monochrome_logs)
        output += colors.dim + "\n Only displaying parameters that differ from defaults.\n" + colors.reset
        output += dashed_line(params.monochrome_logs)
        return output
    }

}


================================================
FILE: main.nf
================================================
#!/usr/bin/env nextflow
/*
------------------------------------------------------------------------------------------------------------
                         nf-core/eager
------------------------------------------------------------------------------------------------------------
 EAGER Analysis Pipeline. Started 2018-06-05
 #### Homepage / Documentation
 https://github.com/nf-core/eager
 #### Authors
 For a list of authors and contributors, see: https://github.com/nf-core/eager/tree/dev#authors-alphabetical
------------------------------------------------------------------------------------------------------------
*/
nextflow.enable.dsl=1

log.info Headers.nf_core(workflow, params.monochrome_logs)

////////////////////////////////////////////////////
/* --               PRINT HELP                 -- */
////////////////////////////////////////////////////+
def json_schema = "$projectDir/nextflow_schema.json"
if (params.help) {
    def command = "nextflow run nf-core/eager --input '*_R{1,2}.fastq.gz' -profile docker"
    log.info NfcoreSchema.params_help(workflow, params, json_schema, command)
    exit 0
}

////////////////////////////////////////////////////
/* --         VALIDATE PARAMETERS              -- */
////////////////////////////////////////////////////+
if (params.validate_params) {
    NfcoreSchema.validateParameters(params, json_schema, log)
}

// Validate BAM input isn't set to paired_end
if ( params.bam && !params.single_end ) {
  exit 1, "[nf-core/eager] error: bams can only be specified with --single_end. Please check input command."
}

// Do not allow input bams to be suffixed with '.unmapped.bam'
if (params.bam && params.input.endsWith('.unmapped.bam')) {
  exit 1, "[nf-core/eager] error: Input BAM file names ending in '.unmapped.bam' are not allowed. Please rename your input BAM(s)."
}

// Validate that skip_collapse is only set to True for paired_end reads!
if (!has_extension(params.input, "tsv") && params.skip_collapse  && params.single_end){
    exit 1, "[nf-core/eager] error: --skip_collapse can only be set for paired_end samples."
}

// Validate not trying to both skip collapse and skip trim
if ( params.skip_collapse && params.skip_trim ) {
  exit 1, "[nf-core/eager error]: you have specified to skip both merging and trimming of paired end samples. Use --skip_adapterremoval instead."
}

// Bedtools validation
if( params.run_bedtools_coverage && !params.anno_file ){
  exit 1, "[nf-core/eager] error: you have turned on bedtools coverage, but not specified a BED or GFF file with --anno_file. Please validate your parameters."
}

// Bedtools validation
if( !params.skip_preseq && !( params.preseq_mode == 'c_curve' || params.preseq_mode == 'lc_extrap' ) ) {
  exit 1, "[nf-core/eager] error: you are running preseq with a unsupported mode. See documentation for more information. You gave: ${params.preseq_mode}."
}

// BAM filtering validation
if (!params.run_bam_filtering && params.bam_mapping_quality_threshold != 0) {
  exit 1, "[nf-core/eager] error: please turn on BAM filtering if you want to perform mapping quality filtering! Provide: --run_bam_filtering."
}

if (params.dedupper == 'dedup' && !params.mergedonly) {
    log.warn "[nf-core/eager] Warning: you are using DeDup but without specifying --mergedonly for AdapterRemoval, dedup will likely fail! See documentation for more information."
}

// Genotyping validation
if (params.run_genotyping){

  if (params.genotyping_tool == 'pileupcaller' && ( !params.pileupcaller_bedfile || !params.pileupcaller_snpfile ) ) {
    exit 1, "[nf-core/eager] error: please check your pileupCaller bed file and snp file parameters. You must supply a bed file and a snp file."
  }

  if (params.genotyping_tool == 'angsd' && ! ( params.angsd_glformat == 'text' || params.angsd_glformat == 'binary' || params.angsd_glformat == 'binary_three' || params.angsd_glformat == 'beagle' ) ) {
    exit 1, "[nf-core/eager] error: please check your ANGSD output format! Options: 'text', 'binary', 'binary_three', 'beagle'. Found parameter: --angsd_glformat '${params.angsd_glformat}'."
  }
}

// Consensus sequence generation validation
if (params.run_vcf2genome) {
    if (!params.run_genotyping) {
      exit 1, "[nf-core/eager] error: consensus sequence generation requires genotyping via UnifiedGenotyper on be turned on with the parameter --run_genotyping and --genotyping_tool 'ug'. Please check your genotyping parameters."
    }

    if (params.genotyping_tool != 'ug') {
      exit 1, "[nf-core/eager] error: consensus sequence generation requires genotyping via UnifiedGenotyper on be turned on with the parameter --run_genotyping and --genotyping_tool 'ug'. Found parameter: --genotyping_tool '${params.genotyping_tool}'."
    }
}

// MultiVCFAnalyzer validation
if (params.run_multivcfanalyzer) {
  if (!params.run_genotyping) {
    exit 1, "[nf-core/eager] error: MultiVCFAnalyzer requires genotyping to be turned on with the parameter --run_genotyping. Please check your genotyping parameters."
  }

  if (params.genotyping_tool != "ug") {
    exit 1, "[nf-core/eager] error: MultiVCFAnalyzer only accepts VCF files from GATK UnifiedGenotyper. Found parameter: --genotyping_tool '${params.genotyping_tool}'."
  }

  if (params.gatk_ploidy != 2) {
    exit 1, "[nf-core/eager] error: MultiVCFAnalyzer only accepts VCF files generated with a GATK ploidy set to 2. Found parameter: --gatk_ploidy ${params.gatk_ploidy}."
  }

  if (params.additional_vcf_files) {
      ch_extravcfs_for_multivcfanalyzer = Channel.fromPath(params.additional_vcf_files, checkIfExists: true)
  }
}

if (params.run_metagenomic_screening) {

  if ( !params.run_bam_filtering ) {
  exit 1, "[nf-core/eager] error: metagenomic classification can only run on unmapped reads. Please supply --run_bam_filtering --bam_unmapped_type 'fastq'."
  }

  if ( params.bam_unmapped_type != "fastq" ) {
  exit 1, "[nf-core/eager] error: metagenomic classification can only run on unmapped reads. Please supply --bam_unmapped_type 'fastq'. Supplied: --bam_unmapped_type '${params.bam_unmapped_type}'."
  }

  if (!params.database) {
    exit 1, "[nf-core/eager] error: metagenomic classification requires a path to a database directory. Please specify one with --database '/path/to/database/'."
  }

  if (params.metagenomic_tool == 'malt' && params.malt_min_support_mode == 'percent' && params.metagenomic_min_support_reads != 1) {
    exit 1, "[nf-core/eager] error: incompatible MALT min support configuration. Percent can only be used with --malt_min_support_percent. You modified: --metagenomic_min_support_reads."
  }

  if (params.metagenomic_tool == 'malt' && params.malt_min_support_mode == 'reads' && params.malt_min_support_percent != 0.01) {
    exit 1, "[nf-core/eager] error: incompatible MALT min support configuration. Reads can only be used with --malt_min_supportreads. You modified: --malt_min_support_percent."
  }

  if (!params.metagenomic_min_support_reads.toString().isInteger()){
    exit 1, "[nf-core/eager] error: incompatible min_support_reads configuration. min_support_reads can only be used with integers. --metagenomic_min_support_reads Found parameter: ${params.metagenomic_min_support_reads}."
  }
}

// MaltExtract validation
if (params.run_maltextract) {

  if (params.run_metagenomic_screening && params.metagenomic_tool != 'malt') {
    exit 1, "[nf-core/eager] error: MaltExtract can only accept MALT output. Please supply --metagenomic_tool 'malt'. Found parameter: --metagenomic_tool '${params.metagenomic_tool}'"
  }

  if (params.run_metagenomic_screening && params.metagenomic_tool != 'malt') {
    exit 1, "[nf-core/eager] error: MaltExtract can only accept MALT output. Please supply --metagenomic_tool 'malt'. Found parameter: --metagenomic_tool '${params.metagenomic_tool}'"
  }

  if (!params.maltextract_taxon_list) {
    exit 1, "[nf-core/eager] error: MaltExtract requires a taxon list specifying the target taxa of interest. Specify the file with --params.maltextract_taxon_list."
  }
}

/////////////////////////////////////////////////////////
/* --          VALIDATE INPUT FILES                 -- */
/////////////////////////////////////////////////////////

// Set up channels for annotation file
if (!params.run_bedtools_coverage){
  ch_anno_for_bedtools = Channel.empty()
} else {
  ch_anno_for_bedtools = Channel.fromPath(params.anno_file, checkIfExists: true)
    .ifEmpty { exit 1, "[nf-core/eager] error: bedtools annotation file not found. Supplied parameter: --anno_file ${params.anno_file}."}
}

if (params.fasta) {
    file(params.fasta, checkIfExists: true)
    lastPath = params.fasta.lastIndexOf(File.separator)
    lastExt = params.fasta.lastIndexOf(".")
    fasta_base = params.fasta.substring(lastPath+1)
    index_base = params.fasta.substring(lastPath+1,lastExt)
    if (params.fasta.endsWith('.gz')) {
        fasta_base = params.fasta.substring(lastPath+1,lastExt)
        index_base = fasta_base.substring(0,fasta_base.lastIndexOf("."))

    }
} else {
    exit 1, "[nf-core/eager] error: please specify --fasta with the path to your reference"
}

// Validate reference inputs
if("${params.fasta}".endsWith(".gz")){
    process unzip_reference{
        tag "${zipped_fasta}"

        input:
        path zipped_fasta from file(params.fasta) // path doesn't like it if a string of an object is not prefaced with a root dir (/), so use file() to resolve string before parsing to `path` 

        output:
        path "$unzip" into ch_fasta into ch_fasta_for_bwaindex,ch_fasta_for_bt2index,ch_fasta_for_faidx,ch_fasta_for_seqdict,ch_fasta_for_circulargenerator,ch_fasta_for_circularmapper,ch_fasta_for_damageprofiler, ch_fasta_for_mapdamage ,ch_fasta_for_qualimap,ch_unmasked_fasta_for_masking,ch_unmasked_fasta_for_pmdtools,ch_fasta_for_genotyping_ug,ch_fasta_for_genotyping_hc,ch_fasta_for_genotyping_freebayes,ch_fasta_for_genotyping_pileupcaller,ch_fasta_for_vcf2genome,ch_fasta_for_multivcfanalyzer,ch_fasta_for_genotyping_angsd,ch_fasta_for_damagerescaling,ch_fasta_for_bcftools_stats

        script:
        unzip = zipped_fasta.toString() - '.gz'
        """
        pigz -f -d -p ${task.cpus} $zipped_fasta
        """
        }
    } else {
    fasta_for_indexing = Channel
    .fromPath("${params.fasta}", checkIfExists: true)
    .into{ ch_fasta_for_bwaindex; ch_fasta_for_bt2index; ch_fasta_for_faidx; ch_fasta_for_seqdict; ch_fasta_for_circulargenerator; ch_fasta_for_circularmapper; ch_fasta_for_damageprofiler; ch_fasta_for_mapdamage; ch_fasta_for_qualimap; ch_unmasked_fasta_for_masking; ch_unmasked_fasta_for_pmdtools; ch_fasta_for_genotyping_ug; ch_fasta__for_genotyping_hc; ch_fasta_for_genotyping_hc; ch_fasta_for_genotyping_freebayes; ch_fasta_for_genotyping_pileupcaller; ch_fasta_for_vcf2genome; ch_fasta_for_multivcfanalyzer; ch_fasta_for_genotyping_angsd; ch_fasta_for_damagerescaling; ch_fasta_for_bcftools_stats }
}

// Check that fasta index file path ends in '.fai'
if (params.fasta_index && !params.fasta_index.endsWith(".fai")) {
    exit 1, "The specified fasta index file (${params.fasta_index}) is not valid. Fasta index files should end in '.fai'."
}

// Check if genome exists in the config file. params.genomes is from igenomes.conf, params.genome specified by user
if (params.genomes && params.genome && !params.genomes.containsKey(params.genome)) {
    exit 1, "The provided genome '${params.genome}' is not available in the iGenomes file. Currently the available genomes are ${params.genomes.keySet().join(', ')}"
}

// Index files provided? Then check whether they are correct and complete
if( params.bwa_index && (params.mapper == 'bwaaln' | params.mapper == 'bwamem' | params.mapper == 'circularmapper')){
    Channel
        .fromPath(params.bwa_index, checkIfExists: true)
        .ifEmpty { exit 1, "[nf-core/eager] error: bwa indices not found in: ${index_base}." }
        .into {bwa_index; bwa_index_bwamem}

    bt2_index = Channel.empty()
}

if( params.bt2_index && params.mapper == 'bowtie2' ){
    lastPath = params.bt2_index.lastIndexOf(File.separator)
    bt2_dir =  params.bt2_index.substring(0,lastPath+1)
    bt2_base = params.bt2_index.substring(lastPath+1)

    Channel
        .fromPath(params.bt2_index, checkIfExists: true)
        .ifEmpty { exit 1, "[nf-core/eager] error: bowtie2 indices not found in: ${bt2_dir}." }
        .into {bt2_index; bt2_index_bwamem}

    bwa_index = Channel.empty()
    bwa_index_bwamem = Channel.empty()
}

// Adapter removal adapter-list setup
if ( !params.clip_adapters_list ) {
    Channel
      .fromPath("$projectDir/assets/nf-core_eager_dummy2.txt", checkIfExists: true)
      .ifEmpty { exit 1, "[nf-core/eager] error: adapters list file not found. Please check input. Supplied: --clip_adapters_list '${params.clip_adapters_list}'." }
      .collect()
      .set {ch_adapterlist}
} else {
    Channel
      .fromPath("${params.clip_adapters_list}", checkIfExists: true)
      .ifEmpty { exit 1, "[nf-core/eager] error: adapters list file not found. Please check input. Supplied: --clip_adapters_list '${params.clip_adapters_list}'." }
      .collect()
      .set {ch_adapterlist}
}

if ( params.snpcapture_bed ) {
    ch_snpcapture_bed = Channel.fromPath(params.snpcapture_bed, checkIfExists: true).collect()
} else {
    ch_snpcapture_bed = Channel.fromPath("$projectDir/assets/nf-core_eager_dummy.txt").collect()
}

// Set up channel with pmdtools reference mask bedfile
if (!params.pmdtools_reference_mask) {
  ch_bedfile_for_reference_masking = Channel.fromPath("$projectDir/assets/nf-core_eager_dummy.txt").collect()
} else {
  ch_bedfile_for_reference_masking = Channel.fromPath(params.pmdtools_reference_mask, checkIfExists: true).collect()
}

// SexDetermination channel set up and bedfile validation
if (!params.sexdeterrmine_bedfile) {
  ch_bed_for_sexdeterrmine = Channel.fromPath("$projectDir/assets/nf-core_eager_dummy.txt").collect()
} else {
  ch_bed_for_sexdeterrmine = Channel.fromPath(params.sexdeterrmine_bedfile, checkIfExists: true).collect()
}

 // pileupCaller channel generation and input checks for 'random sampling' genotyping
if (!params.pileupcaller_bedfile) {
  ch_bed_for_pileupcaller = Channel.fromPath("$projectDir/assets/nf-core_eager_dummy.txt").collect()
} else {
  ch_bed_for_pileupcaller = Channel.fromPath(params.pileupcaller_bedfile, checkIfExists: true).collect()
}

if (!params.pileupcaller_snpfile) {
  ch_snp_for_pileupcaller = Channel.fromPath("$projectDir/assets/nf-core_eager_dummy2.txt").collect()
} else {
  ch_snp_for_pileupcaller = Channel.fromPath(params.pileupcaller_snpfile, checkIfExists: true).collect()
}

// Create input channel for MALT database directory, checking directory exists
if ( !params.database ) {
    ch_db_for_malt = Channel.empty()
} else {
    ch_db_for_malt = Channel.fromPath(params.database, checkIfExists: true)
}

// Create input channel for MaltExtract taxon list, to allow downloading of taxon list, checking file exists.
if ( !params.maltextract_taxon_list ) {
    ch_taxonlist_for_maltextract = Channel.empty()
} else {
    ch_taxonlist_for_maltextract = Channel.fromPath(params.maltextract_taxon_list, checkIfExists: true)
}

// Create input channel for MaltExtract NCBI files, checking files exists.
if ( !params.maltextract_ncbifiles ) {
    ch_ncbifiles_for_maltextract = Channel.empty()
} else {
    ch_ncbifiles_for_maltextract = Channel.fromPath(params.maltextract_ncbifiles, checkIfExists: true)
}

////////////////////////////////////////////////////
/* --     Collect configuration parameters     -- */
////////////////////////////////////////////////////

// Check if genome exists in the config file
if (params.genomes && params.genome && !params.genomes.containsKey(params.genome)) {
    exit 1, "The provided genome '${params.genome}' is not available in the iGenomes file. Currently the available genomes are ${params.genomes.keySet().join(', ')}"
}

// Check AWS batch settings
if (workflow.profile.contains('awsbatch')) {
    // AWSBatch sanity checking
    if (!params.awsqueue || !params.awsregion) exit 1, 'Specify correct --awsqueue and --awsregion parameters on AWSBatch!'
    // Check outdir paths to be S3 buckets if running on AWSBatch
    // related: https://github.com/nextflow-io/nextflow/issues/813
    if (!params.outdir.startsWith('s3:')) exit 1, 'Outdir not on S3 - specify S3 Bucket to run on AWSBatch!'
    // Prevent trace files to be stored on S3 since S3 does not support rolling files.
    if (params.tracedir.startsWith('s3:')) exit 1, 'Specify a local tracedir or run without trace! S3 cannot be used for tracefiles.'
}

ch_multiqc_config = file("$projectDir/assets/multiqc_config.yaml", checkIfExists: true)
ch_multiqc_custom_config = params.multiqc_config ? Channel.fromPath(params.multiqc_config, checkIfExists: true) : Channel.empty()
ch_eager_logo = file("$projectDir/docs/images/nf-core_eager_logo_outline_drop.png")
ch_output_docs = file("$projectDir/docs/output.md", checkIfExists: true)
ch_output_docs_images = file("$projectDir/docs/images/", checkIfExists: true)
where_are_my_files = file("$projectDir/assets/where_are_my_files.txt")

///////////////////////////////////////////////////
/* --    INPUT FILE LOADING AND VALIDATING    -- */
///////////////////////////////////////////////////

// check if we have valid --reads or --input
if (!params.input) {
  exit 1, "[nf-core/eager] error: --input was not supplied! Please check '--help' or documentation under 'running the pipeline' for details"
}

// Read in files properly from TSV file
tsv_path = null
if (params.input && (has_extension(params.input, "tsv"))) tsv_path = params.input

ch_input_sample = Channel.empty()

if (tsv_path) {

    tsv_file = file(tsv_path)
    
    if (tsv_file instanceof List) exit 1, "[nf-core/eager] error: can only accept one TSV file per run."
    if (!tsv_file.exists()) exit 1, "[nf-core/eager] error: input TSV file could not be found. Does the file exist and is it in the right place? You gave the path: ${params.input}"

    ch_input_sample = extract_data(tsv_path)

} else if (params.input && !has_extension(params.input, "tsv")) {

    log.info ""
    log.info "No TSV file provided - creating TSV from supplied directory."
    log.info "Reading path(s): ${params.input}\n"
    inputSample = retrieve_input_paths(params.input, params.colour_chemistry, params.single_end, params.single_stranded, params.udg_type, params.bam)
    ch_input_sample = inputSample

} else exit 1, "[nf-core/eager] error: --input file(s) not correctly not supplied or improperly defined, see '--help' flag and documentation under 'running the pipeline' for details."

ch_input_sample
  .into { ch_input_sample_downstream; ch_input_sample_check }

///////////////////////////////////////////////////
/* --         INPUT CHANNEL CREATION          -- */
///////////////////////////////////////////////////

// Check we don't have any duplicate file names
ch_input_sample_check
    .map {
      it ->
        def r1 = file(it[8]).getName()
        def r2 = file(it[9]).getName()
        def bam = file(it[10]).getName()

        // Throw error and exit if the input bam has a name ending in '.unmapped.bam'
        if (bam.endsWith('.unmapped.bam')) { exit 1, "[nf-core/eager] error: Input BAM file names ending in '.unmapped.bam' are not allowed. Please rename your input BAM(s)." }

      [r1, r2, bam]

    }
    .collect()
    .map{
      file -> 
      filenames = file
      filenames -= 'NA'
      
      if( filenames.size() != filenames.unique().size() )
          exit 1, "[nf-core/eager] error: You have duplicate input FASTQ and/or BAM file names! All files must have unique names, different directories are not sufficent. Please check your input."
    }

// Drop samples with R1/R2 to fastQ channel, BAM samples to other channel
ch_branched_input = ch_input_sample_downstream.branch{
    fastq: it[8] != 'NA' //These are all fastqs
    bam: it[10] != 'NA' //These are all BAMs
}

//Removing BAM/BAI in case of a FASTQ input
ch_fastq_channel = ch_branched_input.fastq.map {
  samplename, libraryid, lane, colour, seqtype, organism, strandedness, udg, r1, r2, bam ->
    [samplename, libraryid, lane, colour, seqtype, organism, strandedness, udg, r1, r2]
}

//Removing R1/R2 in case of BAM input
ch_bam_channel = ch_branched_input.bam.map {
  samplename, libraryid, lane, colour, seqtype, organism, strandedness, udg, r1, r2, bam ->
    [samplename, libraryid, lane, colour, seqtype, organism, strandedness, udg, bam]
}

// Prepare starting channels, here we go
ch_input_for_convertbam = Channel.empty()

ch_bam_channel
  .into { ch_input_for_convertbam; ch_input_for_indexbam; }

// Also need to send raw files for lane merging, if we want to host removed fastq
ch_fastq_channel
  .into { ch_input_for_skipconvertbam; ch_input_for_lanemerge_hostremovalfastq }
  
////////////////////////////////////////////////////
/* --         PRINT PARAMETER SUMMARY          -- */
////////////////////////////////////////////////////

log.info NfcoreSchema.params_summary_log(workflow, params, json_schema)

// Header log info
def summary = [:]
if (workflow.revision) summary['Pipeline Release'] = workflow.revision
summary['Run Name']         = workflow.runName
summary['Input']            = params.input
summary['Fasta Ref']        = params.fasta
summary['Max Resources']    = "$params.max_memory memory, $params.max_cpus cpus, $params.max_time time per job"
if (workflow.containerEngine) summary['Container'] = "$workflow.containerEngine - $workflow.container"
summary['Output dir']       = params.outdir
summary['Launch dir']       = workflow.launchDir
summary['Working dir']      = workflow.workDir
summary['Script dir']       = workflow.projectDir
summary['User']             = workflow.userName
if (workflow.profile.contains('awsbatch')) {
    summary['AWS Region']   = params.awsregion
    summary['AWS Queue']    = params.awsqueue
    summary['AWS CLI']      = params.awscli
}
summary['Config Profile'] = workflow.profile
if (params.config_profile_description) summary['Config Profile Description'] = params.config_profile_description
if (params.config_profile_contact)     summary['Config Profile Contact']     = params.config_profile_contact
if (params.config_profile_url)         summary['Config Profile URL']         = params.config_profile_url
summary['Config Files'] = workflow.configFiles.join(', ')
if (params.email || params.email_on_fail) {
    summary['E-mail Address']    = params.email
    summary['E-mail on failure'] = params.email_on_fail
    summary['MultiQC maxsize']   = params.max_multiqc_email_size
}

Channel.from(summary.collect{ [it.key, it.value] })
    .map { k,v -> "<dt>$k</dt><dd><samp>${v ?: '<span style=\"color:#999999;\">N/A</a>'}</samp></dd>" }
    .reduce { a, b -> return [a, b].join("\n            ") }
    .map { x -> """
    id: 'nf-core-eager-summary'
    description: " - this information is collected when the pipeline is started."
    section_name: 'nf-core/eager Workflow Summary'
    section_href: 'https://github.com/nf-core/eager'
    plot_type: 'html'
    data: |
        <dl class=\"dl-horizontal\">
            $x
        </dl>
    """.stripIndent() }
    .set { ch_workflow_summary }


// Check the hostnames against configured profiles
checkHostname()

log.info "Schaffa, Schaffa, Genome Baua!"

///////////////////////////////////////////////////
/* --          REFERENCE FASTA INDEXING       -- */
///////////////////////////////////////////////////

// BWA Index
if( !params.bwa_index && params.fasta && (params.mapper == 'bwaaln' || params.mapper == 'bwamem' || params.mapper == 'circularmapper')){
  process makeBWAIndex {
    label 'sc_medium'
    tag "${fasta}"
    publishDir path: "${params.outdir}/reference_genome/bwa_index", mode: params.publish_dir_mode, saveAs: { filename -> 
            if (params.save_reference) filename 
            else if(!params.save_reference && filename == "where_are_my_files.txt") filename
            else null
    }

    input:
    path fasta from ch_fasta_for_bwaindex
    path where_are_my_files

    output:
    path "BWAIndex" into (bwa_index, bwa_index_bwamem)
    path "where_are_my_files.txt"

    script:
    """
    bwa index $fasta
    mkdir BWAIndex && mv ${fasta}* BWAIndex
    """
    }
    
    bt2_index = Channel.empty()
}

// bowtie2 Index
if( !params.bt2_index && params.fasta && params.mapper == "bowtie2"){
  process makeBT2Index {
    label 'mc_medium'
    tag "${fasta}"
    publishDir path: "${params.outdir}/reference_genome/bt2_index", mode: params.publish_dir_mode, saveAs: { filename -> 
            if (params.save_reference) filename 
            else if(!params.save_reference && filename == "where_are_my_files.txt") filename
            else null
    }

    input:
    path fasta from ch_fasta_for_bt2index
    path where_are_my_files

    output:
    path "BT2Index" into (bt2_index)
    path "where_are_my_files.txt"

    script:
    """
    bowtie2-build --threads ${task.cpus} $fasta $fasta
    mkdir BT2Index && mv ${fasta}* BT2Index
    """
    }

  bwa_index = Channel.empty()
  bwa_index_bwamem = Channel.empty()

}

// FASTA Index (FAI)
if (params.fasta_index) {
  Channel
    .fromPath( params.fasta_index )
    .set { ch_fai_for_skipfastaindexing }
} else {
  Channel
    .empty()
    .set { ch_fai_for_skipfastaindexing }
}

process makeFastaIndex {
    label 'sc_small'
    tag "${fasta}"
    publishDir path: "${params.outdir}/reference_genome/fasta_index", mode: params.publish_dir_mode, saveAs: { filename -> 
            if (params.save_reference) filename 
            else if(!params.save_reference && filename == "where_are_my_files.txt") filename
            else null
    }
    
    when: !params.fasta_index && params.fasta

    input:
    path fasta from ch_fasta_for_faidx
    path where_are_my_files

    output:
    path "*.fai" into ch_fasta_faidx_index
    path "where_are_my_files.txt"

    script:
    """
    samtools faidx $fasta
    """
}

ch_fai_for_skipfastaindexing.mix(ch_fasta_faidx_index) 
  .into { ch_fai_for_damageprofiler; ch_fai_for_ug; ch_fai_for_hc; ch_fai_for_freebayes; ch_fai_for_pileupcaller; ch_fai_for_angsd }

// Stage dict index file if supplied, else load it into the channel

if (params.seq_dict) {
  Channel
    .fromPath( params.seq_dict )
    .set { ch_dict_for_skipdict }
} else {
  Channel
    .empty()
    .set { ch_dict_for_skipdict }
}

process makeSeqDict {
    label 'sc_medium'
    tag "${fasta}"
    publishDir path: "${params.outdir}/reference_genome/seq_dict", mode: params.publish_dir_mode, saveAs: { filename -> 
            if (params.save_reference) filename 
            else if(!params.save_reference && filename == "where_are_my_files.txt") filename
            else null
    }
    
    when: !params.seq_dict && params.fasta

    input:
    path fasta from ch_fasta_for_seqdict
    path where_are_my_files

    output:
    path "*.dict" into ch_seq_dict
    path "where_are_my_files.txt"

    script:
    """
    picard -Xmx${task.memory.toMega()}M CreateSequenceDictionary R=$fasta O="${fasta.baseName}.dict"
    """
}

ch_dict_for_skipdict.mix(ch_seq_dict)
  .into { ch_dict_for_ug; ch_dict_for_hc; ch_dict_for_freebayes; ch_dict_for_pileupcaller; ch_dict_for_angsd }

//////////////////////////////////////////////////
/* --         BAM INPUT PREPROCESSING        -- */
//////////////////////////////////////////////////

// Convert to FASTQ if re-mapping is requested
process convertBam {
    label 'mc_small'
    tag "$libraryid"
    
    when: 
    params.run_convertinputbam

    input: 
    tuple samplename, libraryid, lane, colour, seqtype, organism, strandedness, udg, path(bam) from ch_input_for_convertbam 

    output:
    tuple samplename, libraryid, lane, colour, seqtype, organism, strandedness, udg, path("*fastq.gz"), val('NA') into ch_output_from_convertbam

    script:
    base = "${bam.baseName}"
    """
    samtools fastq -t ${bam} | pigz -p ${task.cpus} > ${base}.converted.fastq.gz
    """ 
}

// If not converted to FASTQ generate pipeline compatible BAM index file (i.e. with correct samtools version) 
process indexinputbam {
  label 'sc_small'
  tag "$libraryid"

  when: 
  bam != 'NA' && !params.run_convertinputbam

  input:
  tuple samplename, libraryid, lane, colour, seqtype, organism, strandedness, udg, path(bam) from ch_input_for_indexbam 

  output:
  tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(bam), file("*.{bai,csi}")  into ch_indexbam_for_filtering

  script:
  def size = params.large_ref ? '-c' : ''
  """
  samtools index ${bam} ${size}
  """
}

// convertbam bypass
    ch_input_for_skipconvertbam.mix(ch_output_from_convertbam)
        .into { ch_convertbam_for_fastp; ch_convertbam_for_fastqc } 

//////////////////////////////////////////////////
/* -- SEQUENCING QC AND FASTQ PREPROCESSING  -- */
//////////////////////////////////////////////////

// Raw sequencing QC - allow user evaluate if sequencing any good?

process fastqc {
    label 'mc_small'
    tag "${libraryid}_L${lane}"
    publishDir "${params.outdir}/fastqc/input_fastq", mode: params.publish_dir_mode,
        saveAs: { filename ->
                      filename.indexOf(".zip") > 0 ? "zips/$filename" : "$filename"
                }


    input:
    tuple samplename, libraryid, lane, colour, seqtype, organism, strandedness, udg, file(r1), file(r2) from ch_convertbam_for_fastqc

    output:
    path "*_fastqc.{zip,html}" into ch_prefastqc_for_multiqc

    when: 
    !params.skip_fastqc

    script:
    if ( seqtype == 'PE' ) {
    """
    fastqc -t ${task.cpus} -q $r1 $r2
    rename 's/_fastqc\\.zip\$/_raw_fastqc.zip/' *_fastqc.zip
    rename 's/_fastqc\\.html\$/_raw_fastqc.html/' *_fastqc.html
    """
    } else {
    """
    fastqc -t ${task.cpus} -q $r1
    rename 's/_fastqc\\.zip\$/_raw_fastqc.zip/' *_fastqc.zip
    rename 's/_fastqc\\.html\$/_raw_fastqc.html/' *_fastqc.html
    """
    }
}

// Poly-G clipping for 2-colour chemistry sequencers, to reduce erroenous mapping of sequencing artefacts

if (params.complexity_filter_poly_g) {
  ch_input_for_fastp = ch_convertbam_for_fastp.branch{
    twocol: it[3] == '2' // Nextseq/Novaseq data with possible sequencing artefact
    fourcol: it[3] == '4'  // HiSeq/MiSeq data where polyGs would be true
  }

} else {
  ch_input_for_fastp = ch_convertbam_for_fastp.branch{
    twocol: it[3] == "dummy" // seq/Novaseq data with possible sequencing artefact
    fourcol: it[3] == '4' || it[3] == '2'  // HiSeq/MiSeq data where polyGs would be true
  }

}

process fastp {
    label 'mc_small'
    tag "${libraryid}_L${lane}"
    publishDir "${params.outdir}/FastP", mode: params.publish_dir_mode

    when: 
    params.complexity_filter_poly_g

    input:
    tuple samplename, libraryid, lane, colour, seqtype, organism, strandedness, udg, file(r1), file(r2) from ch_input_for_fastp.twocol

    output:
    tuple samplename, libraryid, lane, colour, seqtype, organism, strandedness, udg, path("*.pG.fq.gz") into ch_output_from_fastp
    path("*.json") into ch_fastp_for_multiqc

    script:
    if( seqtype == 'SE' ){
    """
    fastp --in1 ${r1} --out1 "${r1.baseName}.pG.fq.gz" -A -g --poly_g_min_len "${params.complexity_filter_poly_g_min}" -Q -L -w ${task.cpus} --json "${r1.baseName}"_L${lane}_fastp.json 
    """
    } else {
    """
    fastp --in1 ${r1} --in2 ${r2} --out1 "${r1.baseName}.pG.fq.gz" --out2 "${r2.baseName}.pG.fq.gz" -A -g --poly_g_min_len "${params.complexity_filter_poly_g_min}" -Q -L -w ${task.cpus} --json "${libraryid}"_L${lane}_polyg_fastp.json 
    """
    }
}

// Colour column only useful for fastp, so dropping now to reduce complexity downstream
ch_input_for_fastp.fourcol
  .map {
      def samplename = it[0]
      def libraryid  = it[1]
      def lane = it[2]
      def seqtype = it[4]
      def organism = it[5]
      def strandedness = it[6]
      def udg = it[7]
      def r1 = it[8]
      def r2 = seqtype == "PE" ? it[9] : file("$projectDir/assets/nf-core_eager_dummy.txt")
      
      [ samplename, libraryid, lane, seqtype, organism, strandedness, udg, r1, r2 ]

    }
  .set { ch_skipfastp_for_merge }

ch_output_from_fastp
  .map{
    def samplename = it[0]
    def libraryid  = it[1]
    def lane = it[2]
    def seqtype = it[4]
    def organism = it[5]
    def strandedness = it[6]
    def udg = it[7]
    def r1 = it[8] instanceof ArrayList ? it[8].sort()[0] : it[8]
    def r2 = seqtype == "PE" ? it[8].sort()[1] : file("$projectDir/assets/nf-core_eager_dummy.txt")

    [ samplename, libraryid, lane, seqtype, organism, strandedness, udg, r1, r2 ]

  }
  .set{ ch_fastp_for_merge }

ch_skipfastp_for_merge.mix(ch_fastp_for_merge)
  .into { ch_fastp_for_adapterremoval; ch_fastp_for_skipadapterremoval } 

// Sequencing adapter clipping and optional paired-end merging in preparation for mapping

process adapter_removal {
    label 'mc_small'
    tag "${libraryid}_L${lane}"
    publishDir "${params.outdir}/adapterremoval", mode: params.publish_dir_mode

    input:
    tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, file(r1), file(r2) from ch_fastp_for_adapterremoval
    path adapterlist from ch_adapterlist.collect()

    output:
    tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("output/*{combined.fq,.se.truncated,pair1.truncated}.gz") into ch_output_from_adapterremoval_r1
    tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("output/*pair2.truncated.gz") optional true into ch_output_from_adapterremoval_r2
    tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("output/*.settings") into ch_adapterremoval_logs
    
    when: 
    !params.skip_adapterremoval

    script:
    def base = "${r1.baseName}_L${lane}"
    def adapters_to_remove = !params.clip_adapters_list ? "--adapter1 ${params.clip_forward_adaptor} --adapter2 ${params.clip_reverse_adaptor}" : "--adapter-list ${adapterlist}"
    //This checks whether we skip trimming and defines a variable respectively
    def preserve5p = params.preserve5p ? '--preserve5p' : '' // applies to any AR command - doesn't affect output file combination
    
    if ( seqtype == 'PE'  && !params.skip_collapse && !params.skip_trim  && !params.mergedonly && !params.preserve5p ) {
    """
    mkdir -p output

    AdapterRemoval --file1 ${r1} --file2 ${r2} --basename ${base}.pe --gzip --threads ${task.cpus} --qualitymax ${params.qualitymax} --collapse ${preserve5p} --trimns --trimqualities ${adapters_to_remove} --minlength ${params.clip_readlength} --minquality ${params.clip_min_read_quality} --minadapteroverlap ${params.min_adap_overlap}

    cat *.collapsed.gz *.collapsed.truncated.gz *.singleton.truncated.gz *.pair1.truncated.gz *.pair2.truncated.gz > output/${base}.pe.combined.tmp.fq.gz
    
    mv *.settings output/

    ## Add R_ and L_ for unmerged reads for DeDup compatibility
    AdapterRemovalFixPrefix -Xmx${task.memory.toGiga()}g output/${base}.pe.combined.tmp.fq.gz | pigz -p ${task.cpus - 1} > output/${base}.pe.combined.fq.gz
    
    """
    //PE mode, collapse and trim, outputting all reads, preserving 5p
    } else if (seqtype == 'PE'  && !params.skip_collapse && !params.skip_trim  && !params.mergedonly && params.preserve5p) {
    """
    mkdir -p output

    AdapterRemoval --file1 ${r1} --file2 ${r2} --basename ${base}.pe --gzip --threads ${task.cpus} --qualitymax ${params.qualitymax} --collapse ${preserve5p} --trimns --trimqualities ${adapters_to_remove} --minlength ${params.clip_readlength} --minquality ${params.clip_min_read_quality} --minadapteroverlap ${params.min_adap_overlap}

    cat *.collapsed.gz *.singleton.truncated.gz *.pair1.truncated.gz *.pair2.truncated.gz > output/${base}.pe.combined.tmp.fq.gz

    mv *.settings output/

    ## Add R_ and L_ for unmerged reads for DeDup compatibility
    AdapterRemovalFixPrefix -Xmx${task.memory.toGiga()}g output/${base}.pe.combined.tmp.fq.gz | pigz -p ${task.cpus - 1} > output/${base}.pe.combined.fq.gz

    """
    // PE mode, collapse and trim but only output collapsed reads
    } else if ( seqtype == 'PE'  && !params.skip_collapse && !params.skip_trim && params.mergedonly && !params.preserve5p ) {
    """
    mkdir -p output
    AdapterRemoval --file1 ${r1} --file2 ${r2} --basename ${base}.pe  --gzip --threads ${task.cpus} --qualitymax ${params.qualitymax} --collapse ${preserve5p} --trimns --trimqualities ${adapters_to_remove} --minlength ${params.clip_readlength} --minquality ${params.clip_min_read_quality} --minadapteroverlap ${params.min_adap_overlap}
    
    cat *.collapsed.gz *.collapsed.truncated.gz > output/${base}.pe.combined.tmp.fq.gz
        
    ## Add R_ and L_ for unmerged reads for DeDup compatibility
    AdapterRemovalFixPrefix -Xmx${task.memory.toGiga()}g output/${base}.pe.combined.tmp.fq.gz | pigz -p ${task.cpus - 1} > output/${base}.pe.combined.fq.gz

    mv *.settings output/
    """
    // PE mode, collapse and trim but only output collapsed reads, preserving 5p
    } else if ( seqtype == 'PE'  && !params.skip_collapse && !params.skip_trim && params.mergedonly && params.preserve5p ) {
    """
    mkdir -p output
    AdapterRemoval --file1 ${r1} --file2 ${r2} --basename ${base}.pe  --gzip --threads ${task.cpus} --qualitymax ${params.qualitymax} --collapse ${preserve5p} --trimns --trimqualities ${adapters_to_remove} --minlength ${params.clip_readlength} --minquality ${params.clip_min_read_quality} --minadapteroverlap ${params.min_adap_overlap}
    
    cat *.collapsed.gz > output/${base}.pe.combined.tmp.fq.gz
    
    ## Add R_ and L_ for unmerged reads for DeDup compatibility
    AdapterRemovalFixPrefix -Xmx${task.memory.toGiga()}g output/${base}.pe.combined.tmp.fq.gz | pigz -p ${task.cpus - 1} > output/${base}.pe.combined.fq.gz

    mv *.settings output/
    """
    // PE mode, collapsing but skip trim, (output all reads). Note: seems to still generate `truncated` files for some reason, so merging for safety.
    // Will still do default AR length filtering I guess
    } else if ( seqtype == 'PE'  && !params.skip_collapse && params.skip_trim && !params.mergedonly ) {
    """
    mkdir -p output
    AdapterRemoval --file1 ${r1} --file2 ${r2} --basename ${base}.pe --gzip --threads ${task.cpus} --qualitymax ${params.qualitymax} --collapse ${preserve5p} --adapter1 "" --adapter2 ""
    
    cat *.collapsed.gz *.pair1.truncated.gz *.pair2.truncated.gz > output/${base}.pe.combined.tmp.fq.gz
        
    ## Add R_ and L_ for unmerged reads for DeDup compatibility
    AdapterRemovalFixPrefix -Xmx${task.memory.toGiga()}g output/${base}.pe.combined.tmp.fq.gz | pigz -p ${task.cpus - 1} > output/${base}.pe.combined.fq.gz

    mv *.settings output/
    """
    // PE mode, collapsing but skip trim, and only output collapsed reads. Note: seems to still generate `truncated` files for some reason, so merging for safety.
    // Will still do default AR length filtering I guess
    } else if ( seqtype == 'PE'  && !params.skip_collapse && params.skip_trim && params.mergedonly ) {
    """
    mkdir -p output
    AdapterRemoval --file1 ${r1} --file2 ${r2} --basename ${base}.pe --gzip --threads ${task.cpus} --qualitymax ${params.qualitymax} --collapse ${preserve5p}  --adapter1 "" --adapter2 ""
    
    cat *.collapsed.gz > output/${base}.pe.combined.tmp.fq.gz
    
    ## Add R_ and L_ for unmerged reads for DeDup compatibility
    AdapterRemovalFixPrefix -Xmx${task.memory.toGiga()}g output/${base}.pe.combined.tmp.fq.gz | pigz -p ${task.cpus - 1} > output/${base}.pe.combined.fq.gz

    mv *.settings output/
    """
    // PE mode, skip collapsing but trim (output all reads, as merging not possible) - activates paired-end mapping!
    } else if ( seqtype == 'PE'  && params.skip_collapse && !params.skip_trim ) {
    """
    mkdir -p output
    AdapterRemoval --file1 ${r1} --file2 ${r2} --basename ${base}.pe --gzip --threads ${task.cpus} --qualitymax ${params.qualitymax} ${preserve5p} --trimns --trimqualities ${adapters_to_remove} --minlength ${params.clip_readlength} --minquality ${params.clip_min_read_quality} --minadapteroverlap ${params.min_adap_overlap}
    
    mv ${base}.pe.pair*.truncated.gz *.settings output/
    """
    } else if ( seqtype != 'PE' && !params.skip_trim ) {
    //SE, collapse not possible, trim reads only
    """
    mkdir -p output
    AdapterRemoval --file1 ${r1} --basename ${base}.se --gzip --threads ${task.cpus} --qualitymax ${params.qualitymax} ${preserve5p} --trimns --trimqualities ${adapters_to_remove} --minlength ${params.clip_readlength} --minquality ${params.clip_min_read_quality} --minadapteroverlap ${params.min_adap_overlap}
    mv *.settings *.se.truncated.gz output/
    """
    } else if ( seqtype != 'PE' && params.skip_trim ) {
    //SE, collapse not possible, trim reads only
    """
    mkdir -p output
    AdapterRemoval --file1 ${r1} --basename ${base}.se --gzip --threads ${task.cpus} --qualitymax ${params.qualitymax} ${preserve5p} --adapter1 "" --adapter2 ""
    mv *.settings *.se.truncated.gz output/
    """
    }
}

// When not collapsing paired-end data, re-merge the R1 and R2 files into single map. Otherwise if SE or collapsed PE, R2 now becomes NA
// Sort to make sure we get consistent R1 and R2 ordered when using `-resume`, even if not needed for FastQC
if ( params.skip_collapse ){
  ch_output_from_adapterremoval_r1
    .mix(ch_output_from_adapterremoval_r2)
    .groupTuple(by: [0,1,2,3,4,5,6])
    .map{
      it -> 
        def samplename = it[0]
        def libraryid  = it[1]
        def lane = it[2]
        def seqtype = it[3]
        def organism = it[4]
        def strandedness = it[5]
        def udg = it[6]
        def r1 = file(it[7].sort()[0])
        def r2 = seqtype == "PE" ? file(it[7].sort()[1]) : file("$projectDir/assets/nf-core_eager_dummy.txt")

        [ samplename, libraryid, lane, seqtype, organism, strandedness, udg, r1, r2 ]

    }
    .into { ch_output_from_adapterremoval; ch_adapterremoval_for_postfastqc }
} else {
  ch_output_from_adapterremoval_r1
    .map{
      it -> 
        def samplename = it[0]
        def libraryid  = it[1]
        def lane = it[2]
        def seqtype = it[3]
        def organism = it[4]
        def strandedness = it[5]
        def udg = it[6]
        def r1 = file(it[7])
        def r2 = file("$projectDir/assets/nf-core_eager_dummy.txt")

        [ samplename, libraryid, lane, seqtype, organism, strandedness, udg, r1, r2 ]
    }
    .into { ch_output_from_adapterremoval; ch_adapterremoval_for_postfastqc }
}

// AdapterRemoval bypass when not running it
if (!params.skip_adapterremoval) {
    ch_output_from_adapterremoval.mix(ch_fastp_for_skipadapterremoval)
        .filter { it =~/.*combined.fq.gz|.*truncated.gz/ }
        .into { ch_adapterremoval_for_post_ar_trimming; ch_adapterremoval_for_skip_post_ar_trimming; } 
} else {
    ch_fastp_for_skipadapterremoval
        .into { ch_adapterremoval_for_post_ar_trimming; ch_adapterremoval_for_skip_post_ar_trimming; } 
}

// Post AR fastq trimming

process post_ar_fastq_trimming {
  label 'mc_small'
  tag "${libraryid}"
  publishDir "${params.outdir}/post_ar_fastq_trimmed", mode: params.publish_dir_mode

  when: params.run_post_ar_trimming

  input:
  tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(r1), path(r2) from ch_adapterremoval_for_post_ar_trimming

  output:
  tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*_R1_postartrimmed.fq.gz") into ch_post_ar_trimming_for_lanemerge_r1
  tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*_R2_postartrimmed.fq.gz") optional true into ch_post_ar_trimming_for_lanemerge_r2

  script:
  if ( seqtype == 'SE' | (seqtype == 'PE' && !params.skip_collapse) ) {
  """
  fastp --in1 ${r1} --trim_front1 ${params.post_ar_trim_front} --trim_tail1 ${params.post_ar_trim_tail} -A -G -Q -L -w ${task.cpus} --out1 "${libraryid}"_L"${lane}"_R1_postartrimmed.fq.gz
  """
  } else if ( seqtype == 'PE' && params.skip_collapse ) {
  """
  fastp --in1 ${r1} --in2 ${r2}  --trim_front1 ${params.post_ar_trim_front} --trim_tail1 ${params.post_ar_trim_tail} --trim_front2 ${params.post_ar_trim_front2} --trim_tail2 ${params.post_ar_trim_tail2} -A -G -Q -L -w ${task.cpus} --out1 "${libraryid}"_L"${lane}"_R1_postartrimmed.fq.gz --out2 "${libraryid}"_L"${lane}"_R2_postartrimmed.fq.gz
  """
  }

}

// When not collapsing paired-end data, re-merge the R1 and R2 files into single map. Otherwise if SE or collapsed PE, R2 now becomes NA
// Sort to make sure we get consistent R1 and R2 ordered when using `-resume`, even if not needed for FastQC
if ( params.skip_collapse ){
  ch_post_ar_trimming_for_lanemerge_r1
    .mix(ch_post_ar_trimming_for_lanemerge_r2)
    .groupTuple(by: [0,1,2,3,4,5,6])
    .map{
      it -> 
        def samplename = it[0]
        def libraryid  = it[1]
        def lane = it[2]
        def seqtype = it[3]
        def organism = it[4]
        def strandedness = it[5]
        def udg = it[6]
        def r1 = file(it[7].sort()[0])
        def r2 = seqtype == "PE" ? file(it[7].sort()[1]) : file("$projectDir/assets/nf-core_eager_dummy.txt")

        [ samplename, libraryid, lane, seqtype, organism, strandedness, udg, r1, r2 ]

    }
    .set { ch_post_ar_trimming_for_lanemerge; }
} else {
  ch_post_ar_trimming_for_lanemerge_r1
    .map{
      it -> 
        def samplename = it[0]
        def libraryid  = it[1]
        def lane = it[2]
        def seqtype = it[3]
        def organism = it[4]
        def strandedness = it[5]
        def udg = it[6]
        def r1 = file(it[7])
        def r2 = file("$projectDir/assets/nf-core_eager_dummy.txt")

        [ samplename, libraryid, lane, seqtype, organism, strandedness, udg, r1, r2 ]
    }
    .set { ch_post_ar_trimming_for_lanemerge; }
}


// Inline barcode removal bypass when not running it 
if (params.run_post_ar_trimming) {
    ch_post_ar_trimming_for_lanemerge
        .into { ch_inlinebarcoderemoval_for_fastqc_after_clipping; ch_inlinebarcoderemoval_for_lanemerge; } 
} else {
    ch_adapterremoval_for_skip_post_ar_trimming
        .into { ch_inlinebarcoderemoval_for_fastqc_after_clipping; ch_inlinebarcoderemoval_for_lanemerge; } 
}

// Lane merging for libraries sequenced over multiple lanes (e.g. NextSeq)
ch_branched_for_lanemerge = ch_inlinebarcoderemoval_for_lanemerge
  .groupTuple(by: [0,1,3,4,5,6])
  .map {
    it ->
      def samplename = it[0]
      def libraryid  = it[1]
      def lane = it[2]
      def seqtype = it[3]
      def organism = it[4]
      def strandedness = it[5]
      def udg = it[6]
      def r1 = it[7]
      def r2 = it[8]

      [ samplename, libraryid, lane, seqtype, organism, strandedness, udg, r1, r2 ]

  }
  .branch {
    skip_merge: it[7].size() == 1 // Can skip merging if only single lanes
    merge_me: it[7].size() > 1
  }

ch_branched_for_lanemerge_skipme = ch_branched_for_lanemerge.skip_merge
  .map{
    it -> 
        def samplename = it[0]
        def libraryid  = it[1]
        def lane = it[2]
        def seqtype = it[3]
        def organism = it[4]
        def strandedness = it[5]
        def udg = it[6]
        def r1 = it[7][0]
        def r2 = it[8][0]

        [ samplename, libraryid, lane, seqtype, organism, strandedness, udg, r1, r2 ]
  }


ch_branched_for_lanemerge_ready = ch_branched_for_lanemerge.merge_me
  .map{
      it -> 
        def samplename = it[0]
        def libraryid  = it[1]
        def lane = it[2]
        def seqtype = it[3]
        def organism = it[4]
        def strandedness = it[5]
        def udg = it[6]
        def r1 = it[7]

        // find and remove duplicate dummies to prevent file collision error
        def r2 = it[8]*.toString()
        r2.removeAll{ it == "$projectDir/assets/nf-core_eager_dummy.txt" }

        [ samplename, libraryid, lane, seqtype, organism, strandedness, udg, r1, r2 ]
  }

process lanemerge {
  label 'sc_tiny'
  tag "${libraryid}"
  publishDir "${params.outdir}/lanemerging", mode: params.publish_dir_mode

  input:
  tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(r1), path(r2) from ch_branched_for_lanemerge_ready

  output:
  tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*_R1_lanemerged.fq.gz") into ch_lanemerge_for_mapping_r1
  tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*_R2_lanemerged.fq.gz") optional true into ch_lanemerge_for_mapping_r2

  script:
  if ( seqtype == 'PE' && ( params.skip_collapse || params.skip_adapterremoval ) ){
  def lane = 0
  """
  cat ${r1} > "${libraryid}"_R1_lanemerged.fq.gz
  cat ${r2} > "${libraryid}"_R2_lanemerged.fq.gz
  """
  } else {
  """
  cat ${r1} > "${libraryid}"_R1_lanemerged.fq.gz
  """
  }

}

// Ensuring always valid R2 file even if doesn't exist for AWS
if ( ( params.skip_collapse || params.skip_adapterremoval ) ) {
  ch_lanemerge_for_mapping_r1
    .mix(ch_lanemerge_for_mapping_r2)
    .groupTuple(by: [0,1,2,3,4,5,6])
    .map{
      it -> 
        def samplename = it[0]
        def libraryid  = it[1]
        def lane = it[2]
        def seqtype = it[3]
        def organism = it[4]
        def strandedness = it[5]
        def udg = it[6]
        def r1 = file(it[7].sort()[0])
        def r2 = seqtype == "PE" ? file(it[7].sort()[1]) : file("$projectDir/assets/nf-core_eager_dummy.txt")

        [ samplename, libraryid, lane, seqtype, organism, strandedness, udg, r1, r2 ]

    }
    .mix(ch_branched_for_lanemerge_skipme)
    .into { ch_lanemerge_for_skipmap; ch_lanemerge_for_bwa; ch_lanemerge_for_cm; ch_lanemerge_for_bwamem; ch_lanemerge_for_bt2 }
} else {
  ch_lanemerge_for_mapping_r1
    .map{
      it -> 
        def samplename = it[0]
        def libraryid  = it[1]
        def lane = it[2]
        def seqtype = it[3]
        def organism = it[4]
        def strandedness = it[5]
        def udg = it[6]
        def r1 = file(it[7])
        def r2 = file("$projectDir/assets/nf-core_eager_dummy.txt")

        [ samplename, libraryid, lane, seqtype, organism, strandedness, udg, r1, r2 ]
    }
    .mix(ch_branched_for_lanemerge_skipme)
    .into { ch_lanemerge_for_skipmap; ch_lanemerge_for_bwa; ch_lanemerge_for_cm; ch_lanemerge_for_bwamem; ch_lanemerge_for_bt2 }
}

// ENA upload doesn't do separate lanes, so merge raw FASTQs for mapped-reads removal 

// Per-library lane grouping done within process
process lanemerge_hostremoval_fastq {
  label 'sc_tiny'
  tag "${libraryid}"

  when: 
  params.hostremoval_input_fastq

  input:
  tuple samplename, libraryid, lane, colour, seqtype, organism, strandedness, udg, file(r1), file(r2) from ch_input_for_lanemerge_hostremovalfastq.groupTuple(by: [0,1,3,4,5,6,7])

  output:
  tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, file("*.fq.gz") into ch_fastqlanemerge_for_hostremovalfastq

  script:
  if ( seqtype == 'PE' ){
  lane = 0
  """
  cat ${r1} > "${libraryid}"_R1_lanemerged.fq.gz
  cat ${r2} > "${libraryid}"_R2_lanemerged.fq.gz
  """
  } else {
  """
  cat ${r1} > "${libraryid}"_R1_lanemerged.fq.gz
  """
  }

}

// Post-preprocessing QC to help user check pre-processing removed all sequencing artefacts. If doing post-AR trimming includes this step in output.

process fastqc_after_clipping {
    label 'mc_small'
    tag "${libraryid}_L${lane}"
    publishDir "${params.outdir}/fastqc/after_clipping", mode: params.publish_dir_mode,
        saveAs: { filename ->
                      filename.indexOf(".zip") > 0 ? "zips/$filename" : "$filename"
                }


    when: !params.skip_adapterremoval && !params.skip_fastqc

    input:
    tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, file(r1), file(r2) from ch_inlinebarcoderemoval_for_fastqc_after_clipping

    output:
    path("*_fastqc.{zip,html}") into ch_fastqc_after_clipping

    script:
    if ( params.skip_collapse && seqtype == 'PE' ) {
    """
    fastqc -t ${task.cpus} -q ${r1} ${r2}
    """
    } else {
    """
    fastqc -t ${task.cpus} -q ${r1}
    """
    }

}

//////////////////////////////////////////////////
/* --    READ MAPPING AND POSTPROCESSING     -- */
//////////////////////////////////////////////////

// bwa aln as standard aDNA mapper

process bwa {
    label 'mc_medium'
    tag "${libraryid}"
    publishDir "${params.outdir}/mapping/bwa", mode: params.publish_dir_mode

    input:
    tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(r1), path(r2) from ch_lanemerge_for_bwa
    path index from bwa_index.collect()

    output:
    tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*.mapped.bam"), path("*.{bai,csi}") into ch_output_from_bwa   

    when: 
    params.mapper == 'bwaaln'

    script:
    def size = params.large_ref ? '-c' : ''
    def fasta = "${index}/${fasta_base}"

    //PE data without merging, PE data without any AR applied
    if ( seqtype == 'PE' && ( params.skip_collapse || params.skip_adapterremoval ) ){
    """
    bwa aln -t ${task.cpus} $fasta ${r1} -n ${params.bwaalnn} -l ${params.bwaalnl} -k ${params.bwaalnk} -o ${params.bwaalno} -f ${libraryid}.r1.sai
    bwa aln -t ${task.cpus} $fasta ${r2} -n ${params.bwaalnn} -l ${params.bwaalnl} -k ${params.bwaalnk} -o ${params.bwaalno} -f ${libraryid}.r2.sai
    bwa sampe -r "@RG\\tID:ILLUMINA-${samplename}_${libraryid}\\tSM:${samplename}\\tLB:${libraryid}\\tPL:illumina\\tPU:ILLUMINA-${libraryid}-${seqtype}" $fasta ${libraryid}.r1.sai ${libraryid}.r2.sai ${r1} ${r2} | samtools sort -@ ${task.cpus - 1} -O bam - > ${libraryid}_"${seqtype}".mapped.bam
    samtools index "${libraryid}"_"${seqtype}".mapped.bam ${size}
    """
    } else {
    //PE collapsed, or SE data
    """
    bwa aln -t ${task.cpus} ${fasta} ${r1} -n ${params.bwaalnn} -l ${params.bwaalnl} -k ${params.bwaalnk} -o ${params.bwaalno} -f ${libraryid}.sai
    bwa samse -r "@RG\\tID:ILLUMINA-${samplename}_${libraryid}\\tSM:${samplename}\\tLB:${libraryid}\\tPL:illumina\\tPU:ILLUMINA-${libraryid}-${seqtype}" $fasta ${libraryid}.sai $r1 | samtools sort -@ ${task.cpus - 1} -O bam - > "${libraryid}"_"${seqtype}".mapped.bam
    samtools index "${libraryid}"_"${seqtype}".mapped.bam ${size}
    """
    }
    
}

// bwa mem for more complex or for modern data mapping

process bwamem {
    label 'mc_medium'
    tag "$libraryid"
    publishDir "${params.outdir}/mapping/bwamem", mode: params.publish_dir_mode

    input:
    tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, file(r1), file(r2) from ch_lanemerge_for_bwamem
    path index from bwa_index_bwamem.collect()

    output:
    tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*.mapped.bam"), path("*.{bai,csi}") into ch_output_from_bwamem

    when: 
    params.mapper == 'bwamem'

    script:
    def split_cpus = Math.floor(task.cpus/2)
    def fasta = "${index}/${fasta_base}"
    def size = params.large_ref ? '-c' : ''

    if (!params.single_end && params.skip_collapse){
    """
    bwa mem -t ${split_cpus} $fasta $r1 $r2 -R "@RG\\tID:ILLUMINA-${samplename}_${libraryid}\\tSM:${samplename}\\tLB:${libraryid}\\tPL:illumina\\tPU:ILLUMINA-${libraryid}-${seqtype}" | samtools sort -@ ${split_cpus} -O bam - > "${libraryid}"_"${seqtype}".mapped.bam
    samtools index ${size} -@ ${task.cpus} "${libraryid}"_"${seqtype}".mapped.bam
    """
    } else {
    """
    bwa mem -t ${split_cpus} $fasta $r1 -R "@RG\\tID:ILLUMINA-${samplename}_${libraryid}\\tSM:${samplename}\\tLB:${libraryid}\\tPL:illumina\\tPU:ILLUMINA-${libraryid}-${seqtype}" | samtools sort -@ ${split_cpus} -O bam - > "${libraryid}"_"${seqtype}".mapped.bam
    samtools index -@ ${task.cpus} "${libraryid}"_"${seqtype}".mapped.bam ${size} 
    """
    }
    
}

// CircularMapper reference preparation and mapping for circular genomes e.g. mtDNA

process circulargenerator{
    label 'sc_medium'
    tag "$prefix"
    publishDir "${params.outdir}/reference_genome/circularmapper_index", mode: params.publish_dir_mode, saveAs: { filename -> 
            if (params.save_reference) filename 
            else if(!params.save_reference && filename == "where_are_my_files.txt") filename
            else null
    }

    input:
    file fasta from ch_fasta_for_circulargenerator

    output:
    file "${prefix}.{amb,ann,bwt,sa,pac}" into ch_circularmapper_indices
    file "*_elongated" into ch_circularmapper_elongatedfasta

    when: 
    params.mapper == 'circularmapper'

    script:
    prefix = "${fasta.baseName}_${params.circularextension}.${fasta.extension}"
    """
    circulargenerator -Xmx${task.memory.toGiga()}g -e ${params.circularextension} -i $fasta -s ${params.circulartarget}
    bwa index $prefix
    """

}

process circularmapper{
    label 'mc_medium'
    tag "$libraryid"
    publishDir "${params.outdir}/mapping/circularmapper", mode: params.publish_dir_mode

    input:
    tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, file(r1), file(r2) from ch_lanemerge_for_cm
    file index from ch_circularmapper_indices.collect()
    file fasta from ch_fasta_for_circularmapper.collect()
    file elongated from ch_circularmapper_elongatedfasta.collect()

    output:
    tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, file("*.mapped.bam"), file("*.{bai,csi}") into ch_output_from_cm

    when: 
    params.mapper == 'circularmapper'

    script:
    def filter = params.circularfilter ? '-f true -x true' : ''
    def elongated_root = "${fasta.baseName}_${params.circularextension}.${fasta.extension}"
    def size = params.large_ref ? '-c' : ''

    if (!params.single_end && params.skip_collapse ){
    """
    bwa aln -t ${task.cpus} $elongated_root $r1 -n ${params.bwaalnn} -l ${params.bwaalnl} -k ${params.bwaalnk} -f ${libraryid}.r1.sai
    bwa aln -t ${task.cpus} $elongated_root $r2 -n ${params.bwaalnn} -l ${params.bwaalnl} -k ${params.bwaalnk} -f ${libraryid}.r2.sai
    bwa sampe -r "@RG\\tID:ILLUMINA-${samplename}_${libraryid}\\tSM:${samplename}\\tLB:${libraryid}\\tPL:illumina\\tPU:ILLUMINA-${libraryid}-${seqtype}" $elongated_root ${libraryid}.r1.sai ${libraryid}.r2.sai $r1 $r2 > tmp.out
    realignsamfile -Xmx${task.memory.toGiga()}g -e ${params.circularextension} -i tmp.out -r $fasta $filter 
    samtools sort -@ ${task.cpus} -O bam tmp_realigned.bam > ${libraryid}_"${seqtype}".mapped.bam
    samtools index "${libraryid}"_"${seqtype}".mapped.bam ${size} 
    """
    } else {
    """ 
    bwa aln -t ${task.cpus} $elongated_root $r1 -n ${params.bwaalnn} -l ${params.bwaalnl} -k ${params.bwaalnk} -f ${libraryid}.sai
    bwa samse -r "@RG\\tID:ILLUMINA-${samplename}_${libraryid}\\tSM:${samplename}\\tLB:${libraryid}\\tPL:illumina\\tPU:ILLUMINA-${libraryid}-${seqtype}" $elongated_root ${libraryid}.sai $r1 > tmp.out
    realignsamfile -Xmx${task.memory.toGiga()}g -e ${params.circularextension} -i tmp.out -r $fasta $filter 
    samtools sort -@ ${task.cpus} -O bam tmp_realigned.bam > "${libraryid}"_"${seqtype}".mapped.bam
    samtools index "${libraryid}"_"${seqtype}".mapped.bam ${size}
    """
    }
    
}

process bowtie2 {
    label 'mc_medium'
    tag "${libraryid}"
    publishDir "${params.outdir}/mapping/bt2", mode: params.publish_dir_mode

    input:
    tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, file(r1), file(r2) from ch_lanemerge_for_bt2
    path index from bt2_index.collect()

    output:
    tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*.mapped.bam"), path("*.{bai,csi}") into ch_output_from_bt2
    tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*_bt2.log") into ch_bt2_for_multiqc

    when: 
    params.mapper == 'bowtie2'

    script:
    def split_cpus = Math.floor(task.cpus/2)
    def size = params.large_ref ? '-c' : ''
    def fasta = "${index}/${fasta_base}"
    def trim5 = params.bt2_trim5 != 0 ? "--trim5 ${params.bt2_trim5}" : ""
    def trim3 = params.bt2_trim3 != 0 ? "--trim3 ${params.bt2_trim3}" : ""
    def bt2n = params.bt2n != 0 ? "-N ${params.bt2n}" : ""
    def bt2l = params.bt2l != 0 ? "-L ${params.bt2l}" : ""

    if ( "${params.bt2_alignmode}" == "end-to-end"  ) {
      switch ( "${params.bt2_sensitivity}" ) {
        case "no-preset":
        sensitivity = ""; break
        case "very-fast":
        sensitivity = "--very-fast"; break
        case "fast":
        sensitivity = "--fast"; break
        case "sensitive":
        sensitivity = "--sensitive"; break
        case "very-sensitive":
        sensitivity = "--very-sensitive"; break
        default:
        sensitivity = ""; break
        }
      } else if ("${params.bt2_alignmode}" == "local") {
      switch ( "${params.bt2_sensitivity}" ) {
        case "no-preset":
        sensitivity = ""; break
        case "very-fast":
        sensitivity = "--very-fast-local"; break
        case "fast":
        sensitivity = "--fast-local"; break
        case "sensitive":
        sensitivity = "--sensitive-local"; break
        case "very-sensitive":
        sensitivity = "--very-sensitive-local"; break
        default:
        sensitivity = ""; break

        }
      }

    //PE data without merging, PE data without any AR applied
    if ( seqtype == 'PE' && ( params.skip_collapse || params.skip_adapterremoval ) ){
    """
    bowtie2 -x ${fasta} -1 ${r1} -2 ${r2} -p ${split_cpus} ${sensitivity} ${bt2n} ${bt2l} ${trim5} ${trim3} --maxins ${params.bt2_maxins} --rg-id ILLUMINA-${samplename}_${libraryid} --rg SM:${samplename} --rg LB:${libraryid} --rg PL:illumina --rg PU:ILLUMINA-${libraryid}-${seqtype} 2> "${libraryid}"_bt2.log | samtools sort -@ ${split_cpus} -O bam > "${libraryid}"_"${seqtype}".mapped.bam
    samtools index "${libraryid}"_"${seqtype}".mapped.bam ${size}
    """
    } else {
    //PE collapsed, or SE data 
    """
    bowtie2 -x ${fasta} -U ${r1} -p ${split_cpus} ${sensitivity} ${bt2n} ${bt2l} ${trim5} ${trim3} --rg-id ILLUMINA-${samplename}_${libraryid} --rg SM:${samplename} --rg LB:${libraryid} --rg PL:illumina --rg PU:ILLUMINA-${libraryid}-${seqtype} 2> "${libraryid}"_bt2.log | samtools sort -@ ${split_cpus} -O bam > "${libraryid}"_"${seqtype}".mapped.bam
    samtools index "${libraryid}"_"${seqtype}".mapped.bam ${size}
    """
    }
    
}

// Gather all mapped BAMs from all possible mappers into common channels to send downstream
ch_output_from_bwa.mix(ch_output_from_bwamem, ch_output_from_cm, ch_indexbam_for_filtering, ch_output_from_bt2)
  .into { ch_mapping_for_hostremovalfastq; ch_mapping_for_seqtype_merging }

// Synchronise the mapped input FASTQ and input non-remapped BAM channels
ch_fastqlanemerge_for_hostremovalfastq
    .map {
        def samplename = it[0]
        def libraryid  = it[1]
        def lane = it[2]
        def seqtype = it[3]
        def organism = it[4]
        def strandedness = it[5]
        def udg = it[6]
        def r1 = seqtype == "PE" ? file(it[7].sort()[0]) : file(it[7])
        def r2 = seqtype == "PE" ? file(it[7].sort()[1]) : file("$projectDir/assets/nf-core_eager_dummy.txt")

        [ samplename, libraryid, lane, seqtype, organism, strandedness, udg, r1, r2 ]

    }
    .mix(ch_mapping_for_hostremovalfastq)
    .groupTuple(by: [0,1,3,4,5,6])
    .map {
        def samplename = it[0]
        def libraryid  = it[1]
        def lane = it[2]
        def seqtype = it[3]
        def organism = it[4]
        def strandedness = it[5]
        def udg = it[6]
        def r1 = it[7][0]
        def r2 = it[8][0]
        def bam = it[7][1]
        def bai = it[8][1]

      [ samplename, libraryid, seqtype, organism, strandedness, udg, r1, r2, bam, bai ]

    }
    .filter{ it[8] != null }
    .set { ch_synced_for_hostremovalfastq }

// Remove mapped reads from original (lane merged) input FASTQ e.g. for sensitive host data when running metagenomic data

process hostremoval_input_fastq {
    label 'mc_medium'
    tag "${libraryid}"
    publishDir "${params.outdir}/hostremoved_fastq", mode: params.publish_dir_mode

    when: 
    params.hostremoval_input_fastq

    input: 
    tuple samplename, libraryid, seqtype, organism, strandedness, udg, file(r1), file(r2), file(bam), file(bai) from ch_synced_for_hostremovalfastq

    output:
    tuple samplename, libraryid, seqtype, organism, strandedness, udg, file("*.fq.gz") into ch_output_from_hostremovalfastq

    script:
    def merged = params.skip_collapse ? "": "-merged"
    if ( seqtype == 'SE' ) {
        out_fwd = bam.baseName+'.hostremoved.fq.gz'
        """
        samtools index $bam
        extract_map_reads.py $bam ${r1} -m ${params.hostremoval_mode} $merged -of $out_fwd -t ${task.cpus} 
        """
    } else {
        out_fwd = bam.baseName+'.hostremoved.fwd.fq.gz'
        out_rev = bam.baseName+'.hostremoved.rev.fq.gz'
        """
        samtools index $bam
        extract_map_reads.py $bam ${r1} -rev ${r2} -m ${params.hostremoval_mode} $merged -of $out_fwd -or $out_rev -t ${task.cpus}
        """ 
    }
    
}

// Seqtype merging to combine paired end with single end  sequenceing data of the same libraries
// goes here, goes into flagstat, filter etc. Important: This type of merge of this isn't technically valid for DeDup!
// and should only be used with markduplicates!
ch_branched_for_seqtypemerge = ch_mapping_for_seqtype_merging
  .groupTuple(by: [0,1,4,5,6])
  .map {
    it ->
      def samplename = it[0]
      def libraryid  = it[1]
      def lane = 0
      def seqtype = it[3].unique() // How to deal with this?
      def organism = it[4]
      def strandedness = it[5]
      def udg = it[6]
      def r1 = it[7]
      def r2 = it[8]

      // 1. We will assume if mixing it is better to set as PE as this is informative
      // for DeDup (and markduplicates doesn't care), but will throw a warning!
      // 2. We will also flatten to a single value to address problems with 'unstable' 
      // Nextflow ArrayBag object types not allowing the .join to work between resumes
      // See: https://github.com/nf-core/eager/issues/880

      def seqtype_new = seqtype.flatten().size() > 1 ? 'PE' : seqtype.flatten()[0] 
                      
      if ( seqtype.flatten().size() > 1 &&  params.dedupper == 'dedup' ) {
        log.warn "[nf-core/eager] Warning: you are running DeDup on BAMs with a mixture of PE/SE data for library: ${libraryid}. DeDup is designed for PE data only, deduplication maybe suboptimal!"
      }
      
      [ samplename, libraryid, lane, seqtype_new, organism, strandedness, udg, r1, r2 ]

  }
  .branch {
    skip_merge: it[7].size() == 1 // Can skip merging if only single lanes
    merge_me: it[7].size() > 1
  }

  process seqtype_merge {

    label 'sc_tiny'
    tag "$libraryid"

    input:
    tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, file(bam), file(bai) from ch_branched_for_seqtypemerge.merge_me

    output:
    tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, file("*_seqtypemerged.bam"), file("*_seqtypemerged*.{bai,csi}")  into ch_seqtypemerge_for_filtering

    script:
    def size = params.large_ref ? '-c' : ''
    """
    samtools merge ${libraryid}_seqtypemerged.bam ${bam}
    samtools index ${libraryid}_seqtypemerged.bam ${size}
    """
    
  }

ch_seqtypemerge_for_filtering
  .mix(ch_branched_for_seqtypemerge.skip_merge)
  .into { ch_seqtypemerged_for_skipfiltering; ch_seqtypemerged_for_samtools_filter; ch_seqtypemerged_for_samtools_flagstat } 

// Post-mapping QC

process samtools_flagstat {
    label 'sc_tiny'
    tag "$libraryid"
    publishDir "${params.outdir}/samtools/stats", mode: params.publish_dir_mode

    input:
    tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, file(bam), file(bai) from ch_seqtypemerged_for_samtools_flagstat


    output:
    tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*stats") into ch_flagstat_for_multiqc,ch_flagstat_for_endorspy

    script:
    """
    samtools flagstat $bam > ${libraryid}_flagstat.stats
    """
}


// BAM filtering e.g. to extract unmapped reads for downstream or stricter mapping quality

process samtools_filter {
    label 'mc_medium'
    tag "$libraryid"
    publishDir "${params.outdir}/samtools/filter", mode: params.publish_dir_mode,
    saveAs: {filename ->
            if (filename.indexOf(".fq.gz") > 0) "$filename"
            else if (filename.indexOf(".unmapped.bam") > 0) "$filename"
            else if (filename.indexOf(".filtered.bam")) "$filename"
            else null
    }

    when: 
    params.run_bam_filtering

    input: 
    tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, file(bam), file(bai) from ch_seqtypemerged_for_samtools_filter

    output:
    tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, file("*filtered.bam"), file("*.{bai,csi}") into ch_output_from_filtering
    tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, file("*.unmapped.fastq.gz") optional true into ch_bam_filtering_for_metagenomic,ch_metagenomic_for_skipentropyfilter
    tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, file("*.unmapped.bam") optional true

    script:
    
    def size = params.large_ref ? '-c' : ''
    
    // Unmapped/MAPQ Filtering WITHOUT min-length filtering
    if ( "${params.bam_unmapped_type}" == "keep"  && params.bam_filter_minreadlength == 0 ) {
        """
        samtools view -h ${bam} -@ ${task.cpus} -q ${params.bam_mapping_quality_threshold} -b > ${libraryid}.filtered.bam
        samtools index ${libraryid}.filtered.bam ${size}
        """
    } else if ( "${params.bam_unmapped_type}" == "discard" && params.bam_filter_minreadlength == 0 ){
        """
        samtools view -h ${bam} -@ ${task.cpus} -F4 -q ${params.bam_mapping_quality_threshold} -b > ${libraryid}.filtered.bam
        samtools index ${libraryid}.filtered.bam ${size}
        """
    } else if ( "${params.bam_unmapped_type}" == "bam" && params.bam_filter_minreadlength == 0 ){
        """
        samtools view -h ${bam} -@ ${task.cpus} -f4 -b > ${libraryid}.unmapped.bam
        samtools view -h ${bam} -@ ${task.cpus} -F4 -q ${params.bam_mapping_quality_threshold} -b > ${libraryid}.filtered.bam
        samtools index ${libraryid}.filtered.bam ${size}
        """
    } else if ( "${params.bam_unmapped_type}" == "fastq" && params.bam_filter_minreadlength == 0 ){
        """
        samtools view -h ${bam} -@ ${task.cpus} -f4 -b > ${libraryid}.unmapped.bam
        samtools view -h ${bam} -@ ${task.cpus} -F4 -q ${params.bam_mapping_quality_threshold} -b > ${libraryid}.filtered.bam
        samtools index ${libraryid}.filtered.bam ${size}

        ## FASTQ
        samtools fastq -tN ${libraryid}.unmapped.bam | pigz -p ${task.cpus - 1} > ${libraryid}.unmapped.fastq.gz
        rm ${libraryid}.unmapped.bam
        """
    } else if ( "${params.bam_unmapped_type}" == "both" && params.bam_filter_minreadlength == 0 ){
        """
        samtools view -h ${bam} -@ ${task.cpus} -f4 -b > ${libraryid}.unmapped.bam
        samtools view -h ${bam} -@ ${task.cpus} -F4 -q ${params.bam_mapping_quality_threshold} -b > ${libraryid}.filtered.bam
        samtools index ${libraryid}.filtered.bam ${size}
        
        ## FASTQ
        samtools fastq -tN ${libraryid}.unmapped.bam | pigz -p ${task.cpus -1} > ${libraryid}.unmapped.fastq.gz
        """
    // Unmapped/MAPQ Filtering WITH min-length filtering
    } else if ( "${params.bam_unmapped_type}" == "keep" && params.bam_filter_minreadlength != 0 ) {
        """
        samtools view -h ${bam} -@ ${task.cpus} -q ${params.bam_mapping_quality_threshold} -b > tmp_mapped.bam
        filter_bam_fragment_length.py -a -l ${params.bam_filter_minreadlength} -o ${libraryid} tmp_mapped.bam
        samtools index ${libraryid}.filtered.bam ${size}
        """
    } else if ( "${params.bam_unmapped_type}" == "discard" && params.bam_filter_minreadlength != 0 ){
        """
        samtools view -h ${bam} -@ ${task.cpus} -F4 -q ${params.bam_mapping_quality_threshold} -b > tmp_mapped.bam
        filter_bam_fragment_length.py -a -l ${params.bam_filter_minreadlength} -o ${libraryid} tmp_mapped.bam
        samtools index ${libraryid}.filtered.bam ${size}
        """
    } else if ( "${params.bam_unmapped_type}" == "bam" && params.bam_filter_minreadlength != 0 ){
        """
        samtools view -h ${bam} -@ ${task.cpus} -f4 -b > ${libraryid}.unmapped.bam
        samtools view -h ${bam} -@ ${task.cpus} -F4 -q ${params.bam_mapping_quality_threshold} -b > tmp_mapped.bam
        filter_bam_fragment_length.py -a -l ${params.bam_filter_minreadlength} -o ${libraryid} tmp_mapped.bam
        samtools index ${libraryid}.filtered.bam ${size}
        """
    } else if ( "${params.bam_unmapped_type}" == "fastq" && params.bam_filter_minreadlength != 0 ){
        """
        samtools view -h ${bam} -@ ${task.cpus} -f4 -b > ${libraryid}.unmapped.bam
        samtools view -h ${bam} -@ ${task.cpus} -F4 -q ${params.bam_mapping_quality_threshold} -b > tmp_mapped.bam
        filter_bam_fragment_length.py -a -l ${params.bam_filter_minreadlength} -o ${libraryid} tmp_mapped.bam
        samtools index ${libraryid}.filtered.bam ${size}

        ## FASTQ
        samtools fastq -tN ${libraryid}.unmapped.bam | pigz -p ${task.cpus - 1} > ${libraryid}.unmapped.fastq.gz
        rm ${libraryid}.unmapped.bam
        """
    } else if ( "${params.bam_unmapped_type}" == "both" && params.bam_filter_minreadlength != 0 ){
        """
        samtools view -h ${bam} -@ ${task.cpus} -f4 -b > ${libraryid}.unmapped.bam
        samtools view -h ${bam} -@ ${task.cpus} -F4 -q ${params.bam_mapping_quality_threshold} -b > tmp_mapped.bam
        filter_bam_fragment_length.py -a -l ${params.bam_filter_minreadlength} -o ${libraryid} tmp_mapped.bam
        samtools index ${libraryid}.filtered.bam ${size}
        
        ## FASTQ
        samtools fastq -tN ${libraryid}.unmapped.bam | pigz -p ${task.cpus} > ${libraryid}.unmapped.fastq.gz
        """
    }
}

// samtools_filter bypass in case not run
if (params.run_bam_filtering) {
    ch_seqtypemerged_for_skipfiltering.mix(ch_output_from_filtering)
        .filter { it =~/.*filtered.bam/ }
        .into { ch_filtering_for_skiprmdup; ch_filtering_for_dedup; ch_filtering_for_markdup; ch_filtering_for_flagstat; ch_skiprmdup_for_libeval; ch_mapped_for_preseq } 

} else {
    ch_seqtypemerged_for_skipfiltering
        .into { ch_filtering_for_skiprmdup; ch_filtering_for_dedup; ch_filtering_for_markdup; ch_filtering_for_flagstat; ch_skiprmdup_for_libeval; ch_mapped_for_preseq } 

}

// Post filtering mapping QC - particularly to help see how much was removed from mapping quality filtering

process samtools_flagstat_after_filter {
    label 'sc_tiny'
    tag "$libraryid"
    publishDir "${params.outdir}/samtools/filtered_stats", mode: params.publish_dir_mode

    when:
    params.run_bam_filtering

    input:
    tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(bam), path(bai) from ch_filtering_for_flagstat

    output:
    tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*.stats") into ch_bam_filtered_flagstat_for_multiqc, ch_bam_filtered_flagstat_for_endorspy

    script:
    """
    samtools flagstat $bam > ${libraryid}_postfilterflagstat.stats
    """
}

if (params.run_bam_filtering) {
  ch_flagstat_for_endorspy
    .join(ch_bam_filtered_flagstat_for_endorspy, by: [0,1,2,3,4,5,6])
    .set{ ch_allflagstats_for_endorspy }

} else {
  // Add a file entry to match expected no. tuple elements for endorS.py even if not giving second file
  ch_flagstat_for_endorspy
    .map { it -> 
        def samplename = it[0]
        def libraryid  = it[1]
        def lane = it[2]
        def seqtype = it[3]
        def organism = it[4]
        def strandedness = it[5]
        def udg = it[6]
        def stats = file(it[7])
        def poststats = file("$projectDir/assets/nf-core_eager_dummy.txt")

      [samplename, libraryid, lane, seqtype, organism, strandedness, udg, stats, poststats ] 
    }
    .set{ ch_allflagstats_for_endorspy }
}

// Endogenous DNA calculator to say how much of a library contained 'on-target' DNA

process endorSpy {
    label 'sc_tiny'
    tag "$libraryid"
    publishDir "${params.outdir}/endorspy", mode: params.publish_dir_mode

    input:
    tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(stats), path(poststats) from ch_allflagstats_for_endorspy

    output:
    tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*.json") into ch_endorspy_for_multiqc

    script:
    if (params.run_bam_filtering) {
      """
      endorS.py -o json -n ${libraryid} ${stats} ${poststats}
      """
    } else {
      """
      endorS.py -o json -n ${libraryid} ${stats}
      """
    }
}

// Post-mapping PCR amplicon removal because these lab artefacts inflate coverage statistics

process dedup{
    label 'mc_small'
    tag "${libraryid}"
    publishDir "${params.outdir}/deduplication/", mode: params.publish_dir_mode,
        saveAs: {filename -> "${libraryid}/$filename"}

    when:
    !params.skip_deduplication && params.dedupper == 'dedup'

    input:
    tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(bam), path(bai) from ch_filtering_for_dedup

    output:
    tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*.hist") into ch_hist_for_preseq
    tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*.json") into ch_dedup_results_for_multiqc
    tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("${libraryid}_rmdup.bam"), path("*.{bai,csi}") into ch_output_from_dedup, ch_dedup_for_libeval

    script:
    def treat_merged = params.dedup_all_merged ? '-m' : ''
    def size = params.large_ref ? '-c' : ''
    
    if ( bam.baseName != libraryid ) {
    // To make sure direct BAMs have a clean name
    """
    mv ${bam} ${libraryid}.bam
    dedup -Xmx${task.memory.toGiga()}g -i ${libraryid}.bam $treat_merged -o . -u 
    mv *.log dedup.log
    samtools sort -@ ${task.cpus} "${libraryid}"_rmdup.bam -o "${libraryid}"_rmdup.bam
    samtools index "${libraryid}"_rmdup.bam ${size}
    """
    } else {
    """
    dedup -Xmx${task.memory.toGiga()}g -i ${libraryid}.bam $treat_merged -o . -u 
    mv *.log dedup.log
    samtools sort -@ ${task.cpus} "${libraryid}"_rmdup.bam -o "${libraryid}"_rmdup.bam
    samtools index "${libraryid}"_rmdup.bam ${size}
    """
    }
}

process markduplicates{
    label 'mc_small'
    tag "${libraryid}"
    publishDir "${params.outdir}/deduplication/", mode: params.publish_dir_mode,
        saveAs: {filename -> "${libraryid}/$filename"}

    when:
    !params.skip_deduplication && params.dedupper == 'markduplicates'

    input:
    tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(bam), path(bai) from ch_filtering_for_markdup

    output:
    tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*.metrics") into ch_markdup_results_for_multiqc
    tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("${libraryid}_rmdup.bam"), path("*.{bai,csi}") into ch_output_from_markdup, ch_markdup_for_libeval

    script:
    def size = params.large_ref ? '-c' : ''

    if ( bam.baseName != libraryid ) {
    // To make sure direct BAMs have a clean name
    """
    mv ${bam} ${libraryid}.bam
    picard -Xmx${task.memory.toMega()}M MarkDuplicates INPUT=${libraryid}.bam OUTPUT=${libraryid}_rmdup.bam REMOVE_DUPLICATES=TRUE AS=TRUE METRICS_FILE="${libraryid}_rmdup.metrics" VALIDATION_STRINGENCY=SILENT
    samtools index ${libraryid}_rmdup.bam ${size}
    """
    } else {
    """
    picard -Xmx${task.memory.toMega()}M MarkDuplicates INPUT=${libraryid}.bam OUTPUT=${libraryid}_rmdup.bam REMOVE_DUPLICATES=TRUE AS=TRUE METRICS_FILE="${libraryid}_rmdup.metrics" VALIDATION_STRINGENCY=SILENT
    samtools index ${libraryid}_rmdup.bam ${size}
    """
    }

}

// This is for post-deduplcation per-library evaluation steps _without_ any 
// form of library merging. 
if ( params.skip_deduplication ) {
  ch_skiprmdup_for_libeval.mix(ch_dedup_for_libeval, ch_markdup_for_libeval)
    .into{ ch_rmdup_for_preseq; ch_rmdup_for_damageprofiler; ch_rmdup_for_mapdamage; ch_for_nuclear_contamination; ch_rmdup_formtnucratio }
} else {
  ch_dedup_for_libeval.mix(ch_markdup_for_libeval)
    .into{ ch_rmdup_for_preseq; ch_rmdup_for_damageprofiler; ch_rmdup_for_mapdamage; ch_for_nuclear_contamination; ch_rmdup_formtnucratio }
}

// Merge independent libraries sequenced but with same treatment (often done to 
// improve complexity) with the same _sample_ name. Different strand/UDG libs 
// not merged because bamtrim/pmdtools/genotyping needs that info.

// Step one: work out which are single libraries (from skipping rmdup and both dedups) that do not need merging and pass to a skipping
if ( params.skip_deduplication ) {
  ch_input_for_librarymerging = ch_filtering_for_skiprmdup
    .groupTuple(by:[0,4,5,6])
    .branch{
      clean_libraryid: it[7].size() == 1
      merge_me: it[7].size() > 1
    }
} else {
    ch_input_for_librarymerging = ch_output_from_dedup.mix(ch_output_from_markdup)
    .groupTuple(by:[0,4,5,6])
    .branch{
      clean_libraryid: it[7].size() == 1
      merge_me: it[7].size() > 1
    }
}

// For non-merging libraries, fix group libraryIDs into single values. 
// This is a bit hacky as theoretically could have different, but this should
// rarely be the case.
ch_input_for_librarymerging.clean_libraryid
  .map{
    it ->
      def libraryid = it[1][0]
      def bam = it[7].flatten()
      def bai = it[8].flatten()

      [it[0], libraryid, it[2], it[3], it[4], it[5], it[6], bam, bai ]
    }
  .set { ch_input_for_skiplibrarymerging }

ch_input_for_librarymerging.merge_me
  .map{
    it ->
      def libraryid = it[1][0]
      def seqtype = "merged"
      def bam = it[7].flatten()
      def bai = it[8].flatten()

      [it[0], libraryid, it[2], seqtype, it[4], it[5], it[6], bam, bai ]
    }
  .set { ch_fixedinput_for_librarymerging }

process library_merge {
  label 'sc_tiny'
  tag "${samplename}"
  publishDir "${params.outdir}/merged_bams/initial", mode: params.publish_dir_mode

  input:
  tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, file(bam), file(bai) from ch_fixedinput_for_librarymerging

  output:
  tuple samplename, val("${samplename}_libmerged"), lane, seqtype, organism, strandedness, udg, path("*_libmerged_rmdup.bam"), path("*_libmerged_rmdup.bam.{bai,csi}") into ch_output_from_librarymerging

  script:
  def size = params.large_ref ? '-c' : ''
  """
  samtools merge ${samplename}_udg${udg}_libmerged_rmdup.bam ${bam}
  samtools index ${samplename}_udg${udg}_libmerged_rmdup.bam ${size}
  """
}

// Mix back in libraries from skipping dedup, skipping library merging
if (!params.skip_deduplication) {
    ch_input_for_skiplibrarymerging.mix(ch_output_from_librarymerging)
        .filter { it =~/.*_rmdup.bam/ }
        .into { ch_rmdup_for_skipdamagemanipulation;  ch_rmdup_for_pmdtools; ch_rmdup_for_bamutils; ch_rmdup_for_bedtools; ch_rmdup_for_damagerescaling } 

} else {
    ch_input_for_skiplibrarymerging.mix(ch_output_from_librarymerging)
        .into { ch_rmdup_for_skipdamagemanipulation; ch_rmdup_for_pmdtools; ch_rmdup_for_bamutils; ch_rmdup_for_bedtools; ch_rmdup_for_damagerescaling } 
}

//////////////////////////////////////////////////
/* --     POST DEDUPLICATION EVALUATION      -- */
//////////////////////////////////////////////////

// Library complexity calculation from mapped reads - could a user cost-effectively sequence deeper for more unique information?
if ( params.skip_deduplication ) {
  ch_input_for_preseq = ch_rmdup_for_preseq.map{ it[0,1,2,3,4,5,6,7] }

} else if ( !params.skip_deduplication && params.dedupper == "markduplicates" ) {
  ch_input_for_preseq = ch_mapped_for_preseq.map{ it[0,1,2,3,4,5,6,7] }

} else if ( !params.skip_deduplication && params.dedupper == "dedup" ) {
  ch_input_for_preseq = ch_hist_for_preseq

}

process preseq {
    label 'sc_tiny'
    tag "${libraryid}"
    publishDir "${params.outdir}/preseq", mode: params.publish_dir_mode

    when:
    !params.skip_preseq

    input:
    tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, file(input) from ch_input_for_preseq

    output:
    tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("${input.baseName}.preseq") into ch_preseq_for_multiqc

    script:
    pe_mode = params.skip_collapse && seqtype == "PE" ? '-P' : ''
    if(!params.skip_deduplication && params.preseq_mode == 'c_curve' && params.dedupper == "dedup"){
    """
    preseq c_curve -s ${params.preseq_step_size} -o ${input.baseName}.preseq -H ${input}
    """
    } else if( !params.skip_deduplication && params.preseq_mode == 'c_curve' && params.dedupper == "markduplicates"){
    """
    preseq c_curve -s ${params.preseq_step_size} -o ${input.baseName}.preseq -B ${input} ${pe_mode}
    """
    } else if ( params.skip_deduplication && params.preseq_mode == 'c_curve' ) {
    """
    preseq c_curve -s ${params.preseq_step_size} -o ${input.baseName}.preseq -B ${input} ${pe_mode}
    """
    } else if(!params.skip_deduplication && params.preseq_mode == 'lc_extrap' && params.dedupper == "dedup"){
    """
    preseq lc_extrap -s ${params.preseq_step_size} -o ${input.baseName}.preseq -H ${input} -n ${params.preseq_bootstrap} -e ${params.preseq_maxextrap} -cval ${params.preseq_cval} -x ${params.preseq_terms}
    """
    } else if( !params.skip_deduplication && params.preseq_mode == 'lc_extrap' && params.dedupper == "markduplicates"){
    """
    preseq lc_extrap -s ${params.preseq_step_size} -o ${input.baseName}.preseq -B ${input} ${pe_mode} -n ${params.preseq_bootstrap} -e ${params.preseq_maxextrap} -cval ${params.preseq_cval} -x ${params.preseq_terms}
    """
    } else if ( params.skip_deduplication && params.preseq_mode == 'lc_extrap' ) {
    """
    preseq lc_extrap -s ${params.preseq_step_size} -o ${input.baseName}.preseq -B ${input} ${pe_mode} -n ${params.preseq_bootstrap} -e ${params.preseq_maxextrap} -cval ${params.preseq_cval} -x ${params.preseq_terms}
    """
    }
}

// Optional mapping statistics for specific annotations - e.g. genes in bacterial genome

process bedtools {
  label 'mc_small'
  tag "${libraryid}"
  publishDir "${params.outdir}/bedtools", mode: params.publish_dir_mode

  when:
  params.run_bedtools_coverage

  input:
  tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(bam), path(bai) from ch_rmdup_for_bedtools
  file anno_file from ch_anno_for_bedtools.collect()

  output:
  tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*")

  script:
  sorting_of_anno = params.anno_file_is_unsorted ? "" : "-sorted"
  """
  ## Create genome file from bam header
  samtools view -H ${bam} | grep '@SQ' | sed 's#@SQ\tSN:\\|LN:##g' > genome.txt
  
  ##  Run bedtools
  bedtools coverage -nonamecheck -g genome.txt ${sorting_of_anno} -a ${anno_file} -b ${bam} | pigz -p ${task.cpus - 1} > "${bam.baseName}".breadth.gz
  bedtools coverage -nonamecheck -g genome.txt ${sorting_of_anno} -a ${anno_file} -b ${bam} -mean | pigz -p ${task.cpus - 1} > "${bam.baseName}".depth.gz
  """
}

//////////////////////////////////////////////////////////////
/* --    ANCIENT DNA EVALUATION AND BAM MODIFICATION     -- */
//////////////////////////////////////////////////////////////

// Calculate typical aDNA damage frequency distribution with DamageProfiler

process damageprofiler {
    label 'sc_small'
    tag "${libraryid}"

    publishDir "${params.outdir}/damageprofiler", mode: params.publish_dir_mode

    when:
    !params.skip_damage_calculation && params.damage_calculation_tool == 'damageprofiler'

    input:
    tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(bam), path(bai) from ch_rmdup_for_damageprofiler
    file fasta from ch_fasta_for_damageprofiler.collect()
    file fai from ch_fai_for_damageprofiler.collect()

    output:
    tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("${base}/*.txt") optional true
    tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("${base}/*.log")
    tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("${base}/*.pdf") optional true
    tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("${base}/*.json") optional true into ch_damageprofiler_results

    script:
    base = "${bam.baseName}"
    """
    damageprofiler -Xmx${task.memory.toGiga()}g -i $bam -r $fasta -l ${params.damageprofiler_length} -t ${params.damageprofiler_threshold} -o . -yaxis_damageplot ${params.damageprofiler_yaxis}
    """
}

// Calculate typical aDNA damage frequency distribution with mapDamage

process mapdamage_calculation {
    label 'sc_small'
    tag "${libraryid}"

    publishDir "${params.outdir}/mapdamage", mode: params.publish_dir_mode

    when:
    !params.skip_damage_calculation && params.damage_calculation_tool == 'mapdamage'

    input:
    tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(bam), path(bai) from ch_rmdup_for_mapdamage
    file fasta from ch_fasta_for_mapdamage.collect()

    output:
    tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("results_${base}") into ch_output_from_mapdamage
    path ("results_${base}") into ch_mapdamage_for_multiqc

    script:
    base = "${bam.baseName}"
    def singlestranded = strandedness == "single" ? '--single-stranded' : ''
    def downsample = params.mapdamage_downsample != 0 ? "-n ${params.mapdamage_downsample} --downsample-seed=1" : '' // Include seed to make results consistent between runs
    """
    mapDamage -i ${bam} -r ${fasta} ${singlestranded} ${downsample} --ymax=${params.mapdamage_yaxis} --no-stats
    """
}

// Damage rescaling with mapDamage

process mapdamage_rescaling {

    label 'sc_small'
    tag "${libraryid}"

    publishDir "${params.outdir}/damage_rescaling", mode: params.publish_dir_mode

    when:
    params.run_mapdamage_rescaling

    input:
    tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(bam), path(bai) from ch_rmdup_for_damagerescaling
    file fasta from ch_fasta_for_damagerescaling.collect()

    output:
    tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*_rescaled.bam"), path("*rescaled.bam.{bai,csi}") into ch_output_from_damagerescaling

    script:
    def base = "${bam.baseName}"
    def singlestranded = strandedness == "single" ? '--single-stranded' : ''
    def size = params.large_ref ? '-c' : ''
    def rescale_length_3p = params.rescale_length_3p != 0 ? "--rescale-length-3p=${params.rescale_length_3p}" : ""
    def rescale_length_5p = params.rescale_length_5p != 0 ? "--rescale-length-5p=${params.rescale_length_5p}" : ""
    """
    mapDamage -i ${bam} -r ${fasta} --rescale --rescale-out="${base}_rescaled.bam" --seq-length=${params.rescale_seqlength} ${rescale_length_5p} ${rescale_length_3p} ${singlestranded}
    samtools index ${base}_rescaled.bam ${size}
    """

}

// Optionally perform further aDNA evaluation or filtering for just reads with damage etc.

process mask_reference_for_pmdtools {
    label 'sc_tiny'
    tag "${fasta}"
    publishDir "${params.outdir}/reference_genome/masked_reference", mode: params.publish_dir_mode

    when: (params.pmdtools_reference_mask && params.run_pmdtools)

    input:
    file fasta from ch_unmasked_fasta_for_masking
    file bedfile from ch_bedfile_for_reference_masking

    output:
    file "${fasta.baseName}_masked.fa" into ch_masked_fasta_for_pmdtools

    script:
    log.info "[nf-core/eager]: Masking reference \'${fasta}\' at positions found in \'${bedfile}\'. Masked reference will be used for pmdtools."
    """
    bedtools maskfasta -fi ${fasta} -bed ${bedfile} -fo ${fasta.baseName}_masked.fa
    """
}

// If masking was requested, use masked reference for pmdtools, else use original reference
if (params.pmdtools_reference_mask) {
  ch_masked_fasta_for_pmdtools.set{ch_fasta_for_pmdtools}
} else {
  ch_unmasked_fasta_for_pmdtools.set{ch_fasta_for_pmdtools}
}

process pmdtools {
    label 'mc_medium'
    tag "${libraryid}"
    publishDir "${params.outdir}/pmdtools", mode: params.publish_dir_mode

    when: params.run_pmdtools

    input: 
    tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(bam), path(bai) from ch_rmdup_for_pmdtools
    file fasta from ch_fasta_for_pmdtools.collect()

    output:
    tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*.pmd.bam"), path("*.pmd.bam.{bai,csi}") into ch_output_from_pmdtools
    file "*.cpg.range*.txt"

    script:
    //Check which treatment for the libraries was used
    def treatment = udg ? (udg == 'half' ? '--UDGhalf' : '--CpG') : '--UDGminus'
    def size = params.large_ref ? '-c' : ''
    def platypus = params.pmdtools_platypus ? '--platypus' : ''
    """
    #Run Filtering step 
    samtools calmd ${bam} ${fasta} | pmdtools --threshold ${params.pmdtools_threshold} ${treatment} --header | samtools view -Sb - > "${libraryid}".pmd.bam
    
    #Run Calc Range step
    ## To allow early shut off of pipe: https://github.com/nextflow-io/nextflow/issues/1564
    trap 'if [[ \$? == 141 ]]; then echo "Shutting samtools early due to -n parameter" && samtools index ${libraryid}.pmd.bam ${size}; exit 0; fi' EXIT
    samtools calmd ${bam} ${fasta} | pmdtools --deamination ${platypus} --range ${params.pmdtools_range} ${treatment} -n ${params.pmdtools_max_reads} > "${libraryid}".cpg.range."${params.pmdtools_range}".txt
    
    samtools index ${libraryid}.pmd.bam ${size}
    """
}

// BAM Trimming for just non-UDG or half-UDG libraries to remove damage prior genotyping

if ( params.run_trim_bam ) {

    // You wouldn't want to make UDG treated reads even shorter, so skip trimming if UDG.
    // We assume same trim amount for both non-UDG/UDG half as could trim a bit more off half-UDG to match non-UDG if needed, with minimal effect 
    // Note: Trimming of e.g. adapters are sequencing artefacts and should be removed before mapping, so we don't account for this here.
    ch_bamutils_decision = ch_rmdup_for_bamutils.branch{
        totrim: it[6] == 'none' || it[6] == 'half' 
        notrim: it[6] == 'full'
    }

} else {

    ch_bamutils_decision = ch_rmdup_for_bamutils.branch{
        totrim: it[6] == "dummy"
        notrim: it[6] == 'full' || it[6] == 'none' || it[6] == 'half'
    }

}

process bam_trim {
    label 'mc_small'
    tag "${libraryid}" 
    publishDir "${params.outdir}/trimmed_bam", mode: params.publish_dir_mode

    when: params.run_trim_bam

    input:
    tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(bam), path(bai) from ch_bamutils_decision.totrim

    output: 
    tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*.trimmed.bam"), path("*.trimmed.bam.{bai,csi}") into ch_trimmed_from_bamutils

    script:
    def softclip = params.bamutils_softclip ? '-c' : '' 
    def size = params.large_ref ? '-c' : ''
    def left_clipping = strandedness == "double" ? (udg == "half" ? "${params.bamutils_clip_double_stranded_half_udg_left}" : "${params.bamutils_clip_double_stranded_none_udg_left}") : (udg == "half" ? "${params.bamutils_clip_single_stranded_half_udg_left}" : "${params.bamutils_clip_single_stranded_none_udg_left}")
    def right_clipping = strandedness == "double" ? (udg == "half" ? "${params.bamutils_clip_double_stranded_half_udg_right}" : "${params.bamutils_clip_double_stranded_none_udg_right}") : (udg == "half" ? "${params.bamutils_clip_single_stranded_half_udg_right}" : "${params.bamutils_clip_single_stranded_none_udg_right}")

    // def left_clipping = udg == "half" ? "${params.bamutils_clip_half_udg_left}" : "${params.bamutils_clip_none_udg_left}"
    // def right_clipping = udg == "half" ? "${params.bamutils_clip_half_udg_right}" : "${params.bamutils_clip_none_udg_right}"
    """
    bam trimBam $bam tmp.bam -L ${left_clipping} -R ${right_clipping} ${softclip}
    samtools sort -@ ${task.cpus} tmp.bam -o ${libraryid}_udg${udg}.trimmed.bam 
    samtools index ${libraryid}_udg${udg}.trimmed.bam ${size}
    """
}

// Post-trimming merging of libraries to single samples, except for SS/DS 
// libraries as they should be genotyped separately, because we will assume 
// that if trimming is turned on, 'lab-removed' libraries can be combined with 
// merged with 'in-silico damage removed' libraries to improve genotyping

ch_trimmed_formerge = ch_bamutils_decision.notrim
  .mix(ch_trimmed_from_bamutils)
  .groupTuple(by:[0,4,5])
  .map{
        def samplename = it[0]
        def libraryid  = it[1]
        def lane = it[2]
        def seqtype = it[3]
        def organism = it[4]
        def strandedness = it[5]
        def udg = it[6]
        def bam = it[7].flatten()
        def bai = it[8].flatten()

      [samplename, libraryid, lane, seqtype, organism, strandedness, udg, bam, bai ]
  }
  .branch{
    skip_merging: it[7].size() == 1
    merge_me: it[7].size() > 1
  }

//////////////////////////////////////////////////////////////////////////
/* --    POST aDNA BAM MODIFICATION AND FINAL MAPPING STATISTICS     -- */
//////////////////////////////////////////////////////////////////////////

process additional_library_merge {
  label 'sc_tiny'
  tag "${samplename}"
  publishDir "${params.outdir}/merged_bams/additional", mode: params.publish_dir_mode

  input:
  tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(bam), path(bai) from ch_trimmed_formerge.merge_me

  output:
  tuple samplename, val("${samplename}_libmerged"), lane, seqtype, organism, strandedness, udg, path("*_libmerged_add.bam"), path("*_libmerged_add.bam.{bai,csi}") into ch_output_from_trimmerge

  script:
  def size = params.large_ref ? '-c' : ''
  """
  samtools merge ${samplename}_libmerged_add.bam ${bam}
  samtools index ${samplename}_libmerged_add.bam ${size}
  """
}

ch_trimmed_formerge.skip_merging
  .mix(ch_output_from_trimmerge)
  .into{ ch_output_from_bamutils; ch_addlibmerge_for_qualimap; ch_for_sexdeterrmine_prep }

  // General mapping quality statistics for whole reference sequence - e.g. X and % coverage

process qualimap {
    label 'mc_small'
    tag "${samplename}"
    publishDir "${params.outdir}/qualimap", mode: params.publish_dir_mode

    when:
    !params.skip_qualimap

    input:
    tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(bam), path(bai) from ch_addlibmerge_for_qualimap
    file fasta from ch_fasta_for_qualimap.collect()
    path snpcapture_bed from ch_snpcapture_bed 

    output:
    tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*") into ch_qualimap_results

    script:
    def snpcap = snpcapture_bed.getName() != 'nf-core_eager_dummy.txt' ? "-gff ${snpcapture_bed}" : ''
    """
    qualimap bamqc -bam $bam -nt ${task.cpus} -outdir . -outformat "HTML" ${snpcap} --java-mem-size=${task.memory.toGiga()}G
    """
}

/////////////////////////////
/* --    GENOTYPING     -- */
/////////////////////////////

// Reroute files for genotyping; we have to ensure to select lib-merged BAMs, as input channel will also contain the un-merged ones resulting in unwanted multi-sample VCFs
if ( params.run_genotyping && params.genotyping_source == 'raw' ) {
    ch_output_from_bamutils
      .into { ch_damagemanipulation_for_skipgenotyping; ch_damagemanipulation_for_readgroupreplacement; ch_damagemanipulation_for_genotyping_ug; ch_damagemanipulation_for_genotyping_hc; ch_damagemanipulation_for_genotyping_freebayes; ch_damagemanipulation_for_genotyping_pileupcaller; ch_damagemanipulation_for_genotyping_angsd }

} else if ( params.run_genotyping && params.genotyping_source == "trimmed" && !params.run_trim_bam )  {
    exit 1, "[nf-core/eager] error: Cannot run genotyping with 'trimmed' source without running BAM trimming (--run_trim_bam)! Please check input parameters."

} else if ( params.run_genotyping && params.genotyping_source == "trimmed" && params.run_trim_bam )  {
    ch_output_from_bamutils
        .into { ch_damagemanipulation_for_skipgenotyping; ch_damagemanipulation_for_readgroupreplacement; ch_damagemanipulation_for_genotyping_ug; ch_damagemanipulation_for_genotyping_hc; ch_damagemanipulation_for_genotyping_freebayes; ch_damagemanipulation_for_genotyping_pileupcaller; ch_damagemanipulation_for_genotyping_angsd }

} else if ( params.run_genotyping && params.genotyping_source == "pmd" && !params.run_pmdtools )  {
    exit 1, "[nf-core/eager] error: Cannot run genotyping with 'pmd' source without running pmdtools (--run_pmdtools)! Please check input parameters."

} else if ( params.run_genotyping && params.genotyping_source == "pmd" && params.run_pmdtools )  {
  ch_output_from_pmdtools
    .into { ch_damagemanipulation_for_skipgenotyping; ch_damagemanipulation_for_readgroupreplacement; ch_damagemanipulation_for_genotyping_ug; ch_damagemanipulation_for_genotyping_hc; ch_damagemanipulation_for_genotyping_freebayes; ch_damagemanipulation_for_genotyping_pileupcaller; ch_damagemanipulation_for_genotyping_angsd }

} else if ( params.run_genotyping && params.genotyping_source == "rescaled" && params.run_mapdamage_rescaling) {
  ch_output_from_damagerescaling
    .into { ch_damagemanipulation_for_skipgenotyping; ch_damagemanipulation_for_readgroupreplacement; ch_damagemanipulation_for_genotyping_ug; ch_damagemanipulation_for_genotyping_hc; ch_damagemanipulation_for_genotyping_freebayes; ch_damagemanipulation_for_genotyping_pileupcaller; ch_damagemanipulation_for_genotyping_angsd }

} else if ( params.run_genotyping && params.genotyping_source == "rescaled" && !params.run_mapdamage_rescaling) {
    exit 1, "[nf-core/eager] error: Cannot run genotyping with 'rescaled' source without running damage rescaling (--run_damagescaling)! Please check input parameters."

} else if ( !params.run_genotyping && !params.run_trim_bam && !params.run_pmdtools )  {
    ch_rmdup_for_skipdamagemanipulation
    .into { ch_damagemanipulation_for_skipgenotyping; ch_damagemanipulation_for_readgroupreplacement; ch_damagemanipulation_for_genotyping_ug; ch_damagemanipulation_for_genotyping_hc; ch_damagemanipulation_for_genotyping_freebayes; ch_damagemanipulation_for_genotyping_pileupcaller; ch_damagemanipulation_for_genotyping_angsd }

} else if ( !params.run_genotyping && !params.run_trim_bam && params.run_pmdtools )  {
    ch_rmdup_for_skipdamagemanipulation
    .into { ch_damagemanipulation_for_skipgenotyping; ch_damagemanipulation_for_readgroupreplacement; ch_damagemanipulation_for_genotyping_ug; ch_damagemanipulation_for_genotyping_hc; ch_damagemanipulation_for_genotyping_freebayes; ch_damagemanipulation_for_genotyping_pileupcaller; ch_damagemanipulation_for_genotyping_angsd }

} else if ( !params.run_genotyping && params.run_trim_bam && !params.run_pmdtools )  {
    ch_rmdup_for_skipdamagemanipulation
    .into { ch_damagemanipulation_for_skipgenotyping; ch_damagemanipulation_for_readgroupreplacement; ch_damagemanipulation_for_genotyping_ug; ch_damagemanipulation_for_genotyping_hc; ch_damagemanipulation_for_genotyping_freebayes; ch_damagemanipulation_for_genotyping_pileupcaller; ch_damagemanipulation_for_genotyping_angsd }

}

// replace readgroups to ensure single 'sample' per VCF for MultiVCFAnalyzer only

process picard_addorreplacereadgroups {
  label 'sc_tiny'
  tag "${samplename}"

  when:
  params.run_genotyping && params.genotyping_tool == 'ug' && params.run_multivcfanalyzer

  input:
  tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(bam), path(bai) from ch_damagemanipulation_for_readgroupreplacement

  output:
  tuple samplename, val("${samplename}"), lane, seqtype, organism, strandedness, udg, path("*rg.bam"), path("*rg.bam.{bai,csi}") into ch_readgroup_replacement_for_ug

  script:
  def size = params.large_ref ? '-c' : ''
  """
  picard -Xmx${task.memory.toGiga()}g AddOrReplaceReadGroups I=${bam} O=${samplename}_rg.bam RGID=1 RGLB="${samplename}_rg" RGPL=illumina RGPU=4410 RGSM="${samplename}_rg" VALIDATION_STRINGENCY=LENIENT
  samtools index ${samplename}_rg.bam ${size}
  """

}

if ( params.run_genotyping && params.genotyping_tool == 'ug' && params.run_multivcfanalyzer ) {
  ch_input_for_ug = ch_readgroup_replacement_for_ug
} else {
  ch_input_for_ug = ch_damagemanipulation_for_genotyping_ug
}

// Unified Genotyper - although not-supported, better for aDNA (because HC does de novo assembly which requires higher coverages), and needed for MultiVCFAnalyzer

process genotyping_ug {
  label 'mc_small'
  tag "${samplename}"
  publishDir "${params.outdir}/genotyping", mode: params.publish_dir_mode, pattern: '*{.vcf.gz,.realign.bam,realign.bai}'

  when:
  params.run_genotyping && params.genotyping_tool == 'ug'

  input:
  tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, file(bam), file(bai) from ch_input_for_ug
  file fasta from ch_fasta_for_genotyping_ug.collect()
  file fai from ch_fai_for_ug.collect()
  file dict from ch_dict_for_ug.collect()

  output: 
  tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, file("*vcf.gz") into ch_ug_for_multivcfanalyzer,ch_ug_for_vcf2genome,ch_ug_for_bcftools_stats
  tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, file("*.realign.{bam,bai}") optional true

  script:
  def defaultbasequalities = !params.gatk_ug_defaultbasequalities ? '' : " --defaultBaseQualities ${params.gatk_ug_defaultbasequalities}" 
  def keep_realign = params.gatk_ug_keep_realign_bam ? "samtools index ${samplename}.realign.bam" : "rm ${samplename}.realign.{bam,bai}"
  if (!params.gatk_dbsnp)
    """
    samtools index -b ${bam}
    gatk3 -Xmx${task.memory.toGiga()}g -T RealignerTargetCreator -R ${fasta} -I ${bam} -nt ${task.cpus} -o ${samplename}.intervals ${defaultbasequalities}
    gatk3 -Xmx${task.memory.toGiga()}g -T IndelRealigner -R ${fasta} -I ${bam} -targetIntervals ${samplename}.intervals -o ${samplename}.realign.bam ${defaultbasequalities}
    gatk3 -Xmx${task.memory.toGiga()}g -T UnifiedGenotyper -R ${fasta} -I ${samplename}.realign.bam -o ${samplename}.unifiedgenotyper.vcf -nt ${task.cpus} --genotype_likelihoods_model ${params.gatk_ug_genotype_model} -stand_call_conf ${params.gatk_call_conf} --sample_ploidy ${params.gatk_ploidy} -dcov ${params.gatk_downsample} --output_mode ${params.gatk_ug_out_mode} ${defaultbasequalities}
    
    $keep_realign
    
    bgzip -@ ${task.cpus} ${samplename}.unifiedgenotyper.vcf
    """
  else if (params.gatk_dbsnp)
    """
    samtools index ${bam}
    gatk3 -Xmx${task.memory.toGiga()}g -T RealignerTargetCreator -R ${fasta} -I ${bam} -nt ${task.cpus} -o ${samplename}.intervals ${defaultbasequalities}
    gatk3 -Xmx${task.memory.toGiga()}g -T IndelRealigner -R ${fasta} -I ${bam} -targetIntervals ${samplename}.intervals -o ${samplename}.realign.bam ${defaultbasequalities}
    gatk3 -Xmx${task.memory.toGiga()}g -T UnifiedGenotyper -R ${fasta} -I ${samplename}.realign.bam -o ${samplename}.unifiedgenotyper.vcf -nt ${task.cpus} --dbsnp ${params.gatk_dbsnp} --genotype_likelihoods_model ${params.gatk_ug_genotype_model} -stand_call_conf ${params.gatk_call_conf} --sample_ploidy ${params.gatk_ploidy} -dcov ${params.gatk_downsample} --output_mode ${params.gatk_ug_out_mode} ${defaultbasequalities}
    
    $keep_realign
    
    bgzip -@  ${task.cpus} ${samplename}.unifiedgenotyper.vcf
    """
}

 // HaplotypeCaller as 'best practise' tool for human DNA in particular 

process genotyping_hc {
  label 'mc_small'
  tag "${samplename}"
  publishDir "${params.outdir}/genotyping", mode: params.publish_dir_mode

  when:
  params.run_genotyping && params.genotyping_tool == 'hc'

  input:
  tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, file(bam), file(bai) from ch_damagemanipulation_for_genotyping_hc
  file fasta from ch_fasta_for_genotyping_hc.collect()
  file fai from ch_fai_for_hc.collect()
  file dict from ch_dict_for_hc.collect()

  output: 
  tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*vcf.gz") into ch_hc_for_bcftools_stats

  script:
  if (!params.gatk_dbsnp)
    """
    gatk HaplotypeCaller --java-options "-Xmx${task.memory.toGiga()}G" -R ${fasta} -I ${bam} -O ${samplename}.haplotypecaller.vcf -stand-call-conf ${params.gatk_call_conf} --sample-ploidy ${params.gatk_ploidy} --output-mode ${params.gatk_hc_out_mode} --emit-ref-confidence ${params.gatk_hc_emitrefconf}
    bgzip -@ ${task.cpus} ${samplename}.haplotypecaller.vcf
    """

  else if (params.gatk_dbsnp)
    """
    gatk HaplotypeCaller --java-options "-Xmx${task.memory.toGiga()}G" -R ${fasta} -I ${bam} -O ${samplename}.haplotypecaller.vcf --dbsnp ${params.gatk_dbsnp} -stand-call-conf ${params.gatk_call_conf} --sample_ploidy ${params.gatk_ploidy} --output_mode ${params.gatk_hc_out_mode} --emit-ref-confidence ${params.gatk_hc_emitrefconf}
    bgzip -@  ${task.cpus} ${samplename}.haplotypecaller.vcf
    """
}

 // Freebayes for 'more efficient/simple' and more generic genotyping (vs HC) 

process genotyping_freebayes {
  label 'mc_small'
  tag "${samplename}"
  publishDir "${params.outdir}/genotyping", mode: params.publish_dir_mode

  when:
  params.run_genotyping && params.genotyping_tool == 'freebayes'

  input:
  tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, file(bam), file(bai) from ch_damagemanipulation_for_genotyping_freebayes
  file fasta from ch_fasta_for_genotyping_freebayes.collect()
  file fai from ch_fai_for_freebayes.collect()
  file dict from ch_dict_for_freebayes.collect()

  output: 
  tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*vcf.gz") into ch_fb_for_bcftools_stats
  
  script:
  def skip_coverage = "${params.freebayes_g}" == 0 ? "" : "-g ${params.freebayes_g}"
  """
  freebayes -f ${fasta} -p ${params.freebayes_p} -C ${params.freebayes_C} ${skip_coverage} ${bam} > ${samplename}.freebayes.vcf
  bgzip -@  ${task.cpus} ${samplename}.freebayes.vcf
  """
}


 // Branch channel by strandedness
ch_damagemanipulation_for_genotyping_pileupcaller
  .branch{
      singleStranded: it[5] == "single"
      doubleStranded: it[5] == "double"
  }
  .set{ch_input_for_genotyping_pileupcaller}

 // Create pileupcaller input tuples
ch_input_for_genotyping_pileupcaller.singleStranded
  .groupTuple(by:[5])
  .map{
        def samplename = it[0]
        def libraryid  = it[1]
        def lane = it[2]
        def seqtype = it[3]
        def organism = it[4]
        def strandedness = it[5]
        def udg = it[6]
        def bam = it[7].flatten()
        def bai = it[8].flatten()

      [samplename, libraryid, lane, seqtype, organism, strandedness, udg, bam, bai ]
  }
  .set {ch_prepped_for_pileupcaller_single}

ch_input_for_genotyping_pileupcaller.doubleStranded
  .groupTuple(by:[5])
  .map{
        def samplename = it[0]
        def libraryid  = it[1]
        def lane = it[2]
        def seqtype = it[3]
        def organism = it[4]
        def strandedness = it[5]
        def udg = it[6]
        def bam = it[7].flatten()
        def bai = it[8].flatten()

      [samplename, libraryid, lane, seqtype, organism, strandedness, udg, bam, bai ]
  }
  .set {ch_prepped_for_pileupcaller_double}

process genotyping_pileupcaller {
  label 'mc_small'
  tag "${strandedness}"
  publishDir "${params.outdir}/genotyping", mode: params.publish_dir_mode

  when:
  params.run_genotyping && params.genotyping_tool == 'pileupcaller'

  input:
  tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(bam), path(bai) from ch_prepped_for_pileupcaller_double.mix(ch_prepped_for_pileupcaller_single)
  file fasta from ch_fasta_for_genotyping_pileupcaller.collect()
  file fai from ch_fai_for_pileupcaller.collect()
  file dict from ch_dict_for_pileupcaller.collect()
  path(bed) from ch_bed_for_pileupcaller.collect()
  path(snp) from ch_snp_for_pileupcaller.collect()

  output:
  tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("pileupcaller.${strandedness}.*") into ch_for_eigenstrat_snp_coverage

  script:
  def use_bed = bed.getName() != 'nf-core_eager_dummy.txt' ? "-l ${bed}" : ''
  def use_snp = snp.getName() != 'nf-core_eager_dummy2.txt' ? "-f ${snp}" : ''

  def transitions_mode = strandedness == "single" ? "" : "${params.pileupcaller_transitions_mode}" == 'SkipTransitions' ? "--skipTransitions" : "${params.pileupcaller_transitions_mode}" == 'TransitionsMissing' ? "--transitionsMissing" : ""
  def caller = "--${params.pileupcaller_method}"
  def ssmode = strandedness == "single" ? "--singleStrandMode" : ""
  def bam_list = bam.flatten().join(" ")
  def sample_names = samplename.flatten().join(",")
  def map_q = params.pileupcaller_min_map_quality
  def base_q = params.pileupcaller_min_base_quality

  """
  samtools mpileup -B --ignore-RG -q ${map_q} -Q ${base_q} ${use_bed} -f ${fasta} ${bam_list} | pileupCaller ${caller} ${ssmode} ${transitions_mode} --sampleNames ${sample_names} ${use_snp} -e pileupcaller.${strandedness}
  """
}

process eigenstrat_snp_coverage {
  label 'mc_tiny'
  tag "${strandedness}"
  publishDir "${params.outdir}/genotyping", mode: params.publish_dir_mode
  
  when:
  params.run_genotyping && params.genotyping_tool == 'pileupcaller'
  
  input:
  tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*") from ch_for_eigenstrat_snp_coverage
  
  output:
  tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*.json") into ch_eigenstrat_snp_cov_for_multiqc
  path("*_eigenstrat_coverage.txt")
  
  script:
  /* 
  The following code block can be swapped in once the eigenstratdatabasetools MultiQC module becomes available.
  """
  eigenstrat_snp_coverage -i pileupcaller.${strandedness} >${strandedness}_eigenstrat_coverage.txt -j ${strandedness}_eigenstrat_coverage_mqc.json
  """
  */
  """
  eigenstrat_snp_coverage -i pileupcaller.${strandedness} >${strandedness}_eigenstrat_coverage.txt
  parse_snp_cov.py ${strandedness}_eigenstrat_coverage.txt
  """
}

process genotyping_angsd {
  label 'mc_small'
  tag "${samplename}"
  publishDir "${params.outdir}/genotyping", mode: params.publish_dir_mode

  when:
  params.run_genotyping && params.genotyping_tool == 'angsd'

  input:
  tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, file(bam), file(bai) from ch_damagemanipulation_for_genotyping_angsd
  file fasta from ch_fasta_for_genotyping_angsd.collect()
  file fai from ch_fai_for_angsd.collect()
  file dict from ch_dict_for_angsd.collect()

  output: 
  path("${samplename}*")
  
  script:
  switch ( "${params.angsd_glmodel}" ) {
    case "samtools":
    angsd_glmodel = "1"; break
    case "gatk":
    angsd_glmodel = "2"; break
    case "soapsnp":
    angsd_glmodel = "3"; break
    case "syk":
    angsd_glmodel = "4"; break
  }

  switch ( "${params.angsd_glformat}" ) {
    case "text":
    angsd_glformat = "4"; break
    case "binary":
    angsd_glformat = "1"; break
    case "beagle":
    angsd_glformat = "2"; break
    case "binary_three":
    angsd_glformat = "3"; break
  }
  
  def angsd_fasta = !params.angsd_createfasta ? '' : params.angsd_fastamethod == 'random' ? '-doFasta 1 -doCounts 1' : '-doFasta 2 -doCounts 1' 
  def angsd_majorminor = params.angsd_glformat != "beagle" ? '' : '-doMajorMinor 1'
  """
  echo ${bam} > bam.filelist
  mkdir angsd
  angsd -bam bam.filelist -nThreads ${task.cpus} -GL ${angsd_glmodel} -doGlF ${angsd_glformat} ${angsd_majorminor} ${angsd_fasta} -out ${samplename}.angsd
  """
}

////////////////////////////////////
/* --    GENOTYPING STATS     -- */
////////////////////////////////////

process bcftools_stats {
  label  'mc_small'
  tag "${samplename}"
  publishDir "${params.outdir}/bcftools/stats", mode: params.publish_dir_mode

  when: 
  params.run_bcftools_stats

  input:
  tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(vcf) from ch_ug_for_bcftools_stats.mix(ch_hc_for_bcftools_stats,ch_fb_for_bcftools_stats)
  file fasta from ch_fasta_for_bcftools_stats.collect()

  output:
  tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*.vcf.stats") into ch_bcftools_stats_for_multiqc

  script:
  """
  bcftools stats *.vcf.gz -F ${fasta} > ${samplename}.vcf.stats
  """
}

////////////////////////////////////
/* --    CONSENSUS CALLING     -- */
////////////////////////////////////

// Generate a simple consensus-called FASTA file based on genotype VCF

process vcf2genome {
  label  'mc_small'
  tag "${samplename}"
  publishDir "${params.outdir}/consensus_sequence", mode: params.publish_dir_mode

  when: 
  params.run_vcf2genome

  input:
  tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(vcf) from ch_ug_for_vcf2genome
  file fasta from ch_fasta_for_vcf2genome.collect()

  output:
  tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*.fasta.gz")

  script:
  def out = !params.vcf2genome_outfile ? "${samplename}.fasta" : "${samplename}_${params.vcf2genome_outfile}.fasta"
  def fasta_head = !params.vcf2genome_header ? "${samplename}" : "${params.vcf2genome_header}"
  """
  pigz -d -f -p ${task.cpus} ${vcf}
  vcf2genome -Xmx${task.memory.toGiga()}g -draft ${out} -draftname "${fasta_head}" -in ${vcf.baseName} -minc ${params.vcf2genome_minc} -minfreq ${params.vcf2genome_minfreq} -minq ${params.vcf2genome_minq} -ref ${fasta} -refMod ${out}_refmod.fasta -uncertain ${out}_uncertainty.fasta
  pigz -f -p ${task.cpus} ${out}*
  bgzip -@ ${task.cpus} *.vcf
  """
}

// More complex consensus caller with additional filtering functionality (e.g. for heterozygous calls) to generate SNP tables and other things sometimes used in aDNA bacteria studies

// Create input channel for MultiVCFAnalyzer, possibly mixing with pre-made VCFs.
if (!params.additional_vcf_files) {
    ch_vcfs_for_multivcfanalyzer = ch_ug_for_multivcfanalyzer.map{ it[-1] }.collect()
} else {
    ch_vcfs_for_multivcfanalyzer = ch_ug_for_multivcfanalyzer.map{ it[-1] }.mix(ch_extravcfs_for_multivcfanalyzer).collect()
}

process multivcfanalyzer {
  label  'mc_small'
  publishDir "${params.outdir}/multivcfanalyzer", mode: params.publish_dir_mode

  when:
  params.genotyping_tool == 'ug' && params.run_multivcfanalyzer && params.gatk_ploidy.toString() == '2'

  input:
  file vcf from ch_vcfs_for_multivcfanalyzer
  file fasta from ch_fasta_for_multivcfanalyzer

  output:
  file('fullAlignment.fasta.gz')
  file('info.txt.gz')
  file('snpAlignment.fasta.gz')
  file('snpAlignmentIncludingRefGenome.fasta.gz')
  file('snpStatistics.tsv.gz')
  file('snpTable.tsv.gz')
  file('snpTableForSnpEff.tsv.gz')
  file('snpTableWithUncertaintyCalls.tsv.gz')
  file('structureGenotypes.tsv.gz')
  file('structureGenotypes_noMissingData-Columns.tsv.gz')
  file('MultiVCFAnalyzer.json') optional true into ch_multivcfanalyzer_for_multiqc

  script:
  def write_freqs = params.write_allele_frequencies ? "T" : "F"
  """
  pigz -d -f -p ${task.cpus} ${vcf}
  multivcfanalyzer -Xmx${task.memory.toGiga()}g ${params.snp_eff_results} ${fasta} ${params.reference_gff_annotations} . ${write_freqs} ${params.min_genotype_quality} ${params.min_base_coverage} ${params.min_allele_freq_hom} ${params.min_allele_freq_het} ${params.reference_gff_exclude} *.vcf
  pigz -p ${task.cpus} *.tsv *.txt snpAlignment.fasta snpAlignmentIncludingRefGenome.fasta fullAlignment.fasta
  bgzip -@ ${task.cpus} *.vcf
  """
 }

////////////////////////////////////////////////////////////
/* --    HUMAN DNA SPECIFIC ADDITIONAL INFORMATION     -- */
////////////////////////////////////////////////////////////

// Mitochondrial to nuclear ratio helps to evaluate quality of tissue sampled

 process mtnucratio {
  label 'sc_small'
  tag "${samplename}"
  publishDir "${params.outdir}/mtnucratio", mode: params.publish_dir_mode

  when: 
  params.run_mtnucratio

  input:
  tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(bam), path(bai) from ch_rmdup_formtnucratio

  output:
  tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*.mtnucratio")
  tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*.json") into ch_mtnucratio_for_multiqc

  script:
  """
  mtnucratio -Xmx${task.memory.toGiga()}g ${bam} "${params.mtnucratio_header}"
  """
 }

// Human biological sex estimation

// rename to prevent single/double stranded library sample name-based file conflict
process sexdeterrmine_prep {
  label 'sc_small'
  
  input:
  tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(bam), path(bai) from ch_for_sexdeterrmine_prep
  
  output:
  file "*_{single,double}strand.bam" into ch_prepped_for_sexdeterrmine

  when:
  params.run_sexdeterrmine

  script:
  """
  mv ${bam} ${bam.baseName}_${strandedness}strand.bam
  """

}

// As we collect all files for a single sex_deterrmine run, we DO NOT use the normal input/output tuple
process sexdeterrmine {
    label 'mc_small'
    publishDir "${params.outdir}/sex_determination", mode: params.publish_dir_mode

    input:
    path bam from ch_prepped_for_sexdeterrmine.collect()
    path(bed) from ch_bed_for_sexdeterrmine

    output:
    file "SexDet.txt"
    file "*.json" into ch_sexdet_for_multiqc

    when:
    params.run_sexdeterrmine
    
    script:
    def filter = bed.getName() != 'nf-core_eager_dummy.txt' ? "-b $bed" : ''
    """
    ls *.bam >> bamlist.txt
    samtools depth -aa -q30 -Q30 $filter -f bamlist.txt | sexdeterrmine -f bamlist.txt > SexDet.txt
    """
}

// Human DNA nuclear contamination estimation

 process nuclear_contamination{
    label 'sc_small'
    tag "${samplename}"
    publishDir "${params.outdir}/nuclear_contamination", mode: params.publish_dir_mode

    when:
    params.run_nuclear_contamination

    input:
    tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(input), path(bai) from ch_for_nuclear_contamination

    output:
    tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path('*.X.contamination.out') into ch_from_nuclear_contamination

    script:
    """
    samtools index ${input}
    angsd -i ${input} -r ${params.contamination_chrom_name}:5000000-154900000 -doCounts 1 -iCounts 1 -minMapQ 30 -minQ 30 -out ${libraryid}.doCounts
    contamination -a ${libraryid}.doCounts.icnts.gz -h ${projectDir}/assets/angsd_resources/HapMapChrX.gz 2> ${libraryid}.X.contamination.out
    """
 }
 
// As we collect all files for a single print_nuclear_contamination run, we DO NOT use the normal input/output tuple
process print_nuclear_contamination{
    label 'sc_tiny'
    publishDir "${params.outdir}/nuclear_contamination", mode: params.publish_dir_mode

    when:
    params.run_nuclear_contamination

    input:
    path Contam from ch_from_nuclear_contamination.map { it[7] }.collect()

    output:
    file 'nuclear_contamination.txt'
    file 'nuclear_contamination_mqc.json' into ch_nuclear_contamination_for_multiqc

    script:
    """
    print_x_contamination.py ${Contam.join(' ')}
    """
 }

/////////////////////////////////////////////////////////
/* --    METAGENOMICS-SPECIFIC ADDITIONAL STEPS     -- */
/////////////////////////////////////////////////////////

// Low entropy read filter to reduce input sequences of reads that are highly uninformative, and thus reduce runtime/false positives

process metagenomic_complexity_filter {
  label 'mc_small'
  tag "${samplename}"
  publishDir "${params.outdir}/metagenomic_complexity_filter/", mode: params.publish_dir_mode

  when:
  params.metagenomic_complexity_filter
  
  input:
  tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(fastq) from ch_bam_filtering_for_metagenomic


  output:
  tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*_lowcomplexityremoved.fq.gz") into ch_lowcomplexityfiltered_for_metagenomic
  path("*_bbduk.stats") into ch_metagenomic_complexity_filter_for_multiqc

  script:
  """
  bbduk.sh -Xmx${task.memory.toGiga()}g in=${fastq} threads=${task.cpus} entropymask=f entropy=${params.metagenomic_complexity_entropy} out=${fastq}_lowcomplexityremoved.fq.gz 2> ${fastq}_bbduk.stats
  """

}

// metagenomic complexity filter bypass

if ( params.metagenomic_complexity_filter ) {
  ch_lowcomplexityfiltered_for_metagenomic
    .set{ ch_filtered_for_metagenomic }
} else {
  ch_metagenomic_for_skipentropyfilter
    .set{ ch_filtered_for_metagenomic }
}

// MALT is a super-fast BLAST replacement typically used for pathogen detection or microbiome profiling against large databases, here using off-target reads from mapping

// As we collect all files for a all metagenomic runs, we DO NOT use the normal input/output tuple!
if (params.metagenomic_tool == 'malt') {
  ch_filtered_for_metagenomic
    .set {ch_input_for_metagenomic_malt}

  ch_input_for_metagenomic_kraken = Channel.empty()
} else if (params.metagenomic_tool == 'kraken') {
  ch_filtered_for_metagenomic
    .set {ch_input_for_metagenomic_kraken}

  ch_input_for_metagenomic_malt = Channel.empty()
} else if ( !params.metagenomic_tool ) {
  ch_input_for_metagenomic_malt = Channel.empty()
  ch_input_for_metagenomic_kraken = Channel.empty()

}

// As we collect all files for a single MALT run, we DO NOT use the normal input/output tuple
process malt {
  label 'mc_small'
  publishDir "${params.outdir}/metagenomic_classification/malt", mode: params.publish_dir_mode

  when:
  params.run_metagenomic_screening && params.run_bam_filtering && params.bam_unmapped_type == 'fastq' && params.metagenomic_tool == 'malt'

  input:
  file fastqs from ch_input_for_metagenomic_malt.map { it[7] }.collect()
  file db from ch_db_for_malt

  output:
  path("*.rma6") into ch_rma_for_maltExtract
  path("*.sam.gz") optional true
  path("malt.log") into ch_malt_for_multiqc

  script:
  if ( "${params.malt_min_support_mode}" == "percent" ) {
    min_supp = "-supp ${params.malt_min_support_percent}" 
  } else if ( "${params.malt_min_support_mode}" == "reads" ) {
    min_supp = "-sup ${params.metagenomic_min_support_reads}"
  }
  def sam_out = params.malt_sam_output ? "-a . -f SAM" : ""
  """
  malt-run \
  -J-Xmx${task.memory.toGiga()}g \
  -t ${task.cpus} \
  -v \
  -o . \
  -d ${db} \
  ${sam_out} \
  -id ${params.percent_identity} \
  -m ${params.malt_mode} \
  -at ${params.malt_alignment_mode} \
  -top ${params.malt_top_percent} \
  ${min_supp} \
  -mq ${params.malt_max_queries} \
  --memoryMode ${params.malt_memory_mode} \
  -i ${fastqs.join(' ')} |&tee malt.log
  """
}

// MaltExtract performs aDNA evaluation from the output of MALT (damage patterns, read lengths etc.)

// As we collect all files for a single MALT extract run, we DO NOT use the normal input/output tuple
process maltextract {
  label 'mc_medium'
  publishDir "${params.outdir}/maltextract/", mode: params.publish_dir_mode

  when: 
  params.run_maltextract && params.metagenomic_tool == 'malt'

  input:
  file rma6 from ch_rma_for_maltExtract.collect()
  file taxon_list from ch_taxonlist_for_maltextract
  file ncbifiles from ch_ncbifiles_for_maltextract
  
  output:
  path "results/" type('dir')
  file "results/*_Wevid.json" optional true into ch_hops_for_multiqc 

  script:
  def destack = params.maltextract_destackingoff ? "--destackingOff" : ""
  def downsam = params.maltextract_downsamplingoff ? "--downSampOff" : ""
  def dupremo = params.maltextract_duplicateremovaloff ? "--dupRemOff" : ""
  def matches = params.maltextract_matches ? "--matches" : ""
  def megsum = params.maltextract_megansummary ? "--meganSummary" : ""
  def topaln = params.maltextract_topalignment ?  "--useTopAlignment" : ""
  def ss = params.single_stranded ? "--singleStranded" : ""
  """
  MaltExtract \
  -Xmx${task.memory.toGiga()}g \
  -t ${taxon_list} \
  -i ${rma6.join(' ')} \
  -o results/ \
  -r ${ncbifiles} \
  -p ${task.cpus} \
  -f ${params.maltextract_filter} \
  -a ${params.maltextract_toppercent} \
  --minPI ${params.maltextract_percentidentity} \
  ${destack} \
  ${downsam} \
  ${dupremo} \
  ${matches} \
  ${megsum} \
  ${topaln} \
  ${ss}

  postprocessing.AMPS.r -r results/ -m ${params.maltextract_filter} -t ${task.cpus} -n ${taxon_list} -j
  """
}

// Kraken is offered as a replacement for MALT as MALT is _very_ resource hungry

if (params.run_metagenomic_screening && params.database.endsWith(".tar.gz") && params.metagenomic_tool == 'kraken'){
  comp_kraken = file(params.database)

  process decomp_kraken {
    input:
    path(ckdb) from comp_kraken
    
    output:
    path(dbname) into ch_krakendb
    
    script:
    dbname = ckdb.toString() - '.tar.gz'
    """
    tar xvzf $ckdb
    mkdir -p $dbname
    mv *.k2d $dbname || echo "nothing to do"
    """
  }

} else if (params.database && ! params.database.endsWith(".tar.gz") && params.run_metagenomic_screening && params.metagenomic_tool == 'kraken') {
    ch_krakendb = Channel.fromPath(params.database).first()
} else {
    ch_krakendb = Channel.empty()
}

process kraken {
  tag "$prefix"
  label 'mc_huge'
  publishDir "${params.outdir}/metagenomic_classification/kraken", mode: params.publish_dir_mode

  when:
  params.run_metagenomic_screening && params.run_bam_filtering && params.bam_unmapped_type == 'fastq' && params.metagenomic_tool == 'kraken'

  input:
  path(fastq) from ch_input_for_metagenomic_kraken.map { it[7] }
  path(krakendb) from ch_krakendb

  output:
  file "*.kraken.out" optional true into ch_kraken_out
  tuple prefix, path("*.kraken2_report") optional true into ch_kraken_report, ch_kraken_for_multiqc

  script:
  prefix = fastq.baseName
  out = prefix+".kraken.out"
  kreport = prefix+".kraken2_report"
  kreport_old = prefix+".kreport"

  """
  kraken2 --db ${krakendb} --threads ${task.cpus} --output $out --report-minimizer-data --report $kreport $fastq
  cut -f1-3,6-8 $kreport > $kreport_old
  """
}

process kraken_parse {
  tag "$name"
  errorStrategy 'ignore'

  input:
  tuple val(name), path(kraken_r) from ch_kraken_report

  output:
  path('*_kraken_parsed.csv') into ch_kraken_parsed

  script:
  read_out = name+".read_kraken_parsed.csv"
  kmer_out =  name+".kmer_kraken_parsed.csv"
  """
  kraken_parse.py -c ${params.metagenomic_min_support_reads} -or $read_out -ok $kmer_out $kraken_r
  """    
}

process kraken_merge {
  publishDir "${params.outdir}/metagenomic_classification/kraken", mode: params.publish_dir_mode

  input:
  file csv_count from ch_kraken_parsed.collect()

  output:
  path('*.csv')

  script:
  read_out = "kraken_read_count.csv"
  kmer_out = "kraken_kmer_duplication.csv"
  """
  merge_kraken_res.py -or $read_out -ok $kmer_out
  """    
}

//////////////////////////////////////
/* --    PIPELINE COMPLETION     -- */
//////////////////////////////////////

// Pipeline documentation for on-server guidance

process output_documentation {
    label 'sc_tiny'
    publishDir "${params.outdir}/documentation", mode: params.publish_dir_mode

    input:
    file output_docs from ch_output_docs
    file images from ch_output_docs_images

    output:
    file "results_description.html"

    script:
    """
    markdown_to_html.py $output_docs -o results_description.html
    """
}

/*
 * Parse software version numbers
 */

process get_software_versions {
  label 'mc_small'
    publishDir "${params.outdir}/pipeline_info", mode: params.publish_dir_mode,
        saveAs: { filename ->
                      if (filename.indexOf(".csv") > 0) filename
                      else null
                }

    output:
    file 'software_versions_mqc.yaml' into software_versions_yaml
    file "software_versions.csv"

    script:
    """
    echo $workflow.manifest.version &> v_pipeline.txt
    echo $workflow.nextflow.version &> v_nextflow.txt
    
    fastqc -t ${task.cpus} --version &> v_fastqc.txt 2>&1 || true
    AdapterRemoval --version  &> v_adapterremoval.txt 2>&1 || true
    fastp --version &> v_fastp.txt 2>&1 || true
    bwa &> v_bwa.txt 2>&1 || true
    circulargenerator -Xmx${task.memory.toGiga()}g --help | head -n 1 &> v_circulargenerator.txt 2>&1 || true
    samtools --version &> v_samtools.txt 2>&1 || true
    dedup -Xmx${task.memory.toGiga()}g -v &> v_dedup.txt 2>&1 || true
    ## bioconda recipe of picard is incorrectly set up and extra warning made with stderr, this ugly command ensures only version exported
    ( exec 7>&1; picard -Xmx${task.memory.toMega()}M MarkDuplicates --version 2>&1 >&7 | grep -v '/' >&2 ) 2> v_markduplicates.txt || true
    qualimap --version --java-mem-size=${task.memory.toGiga()}G &> v_qualimap.txt 2>&1 || true
    preseq &> v_preseq.txt 2>&1 || true
    gatk --java-options "-Xmx${task.memory.toGiga()}G" --version 2>&1 | grep '(GATK)' > v_gatk.txt 2>&1 || true
    gatk3 -Xmx${task.memory.toGiga()}g  --version 2>&1 | head -n 1 > v_gatk3.txt 2>&1 || true
    freebayes --version &> v_freebayes.txt 2>&1 || true
    bedtools --version &> v_bedtools.txt 2>&1 || true
    damageprofiler -Xmx${task.memory.toGiga()}g --version &> v_damageprofiler.txt 2>&1 || true
    bam --version &> v_bamutil.txt 2>&1 || true
    pmdtools --version &> v_pmdtools.txt 2>&1 || true
    angsd -h |& head -n 1 | cut -d ' ' -f3-4 &> v_angsd.txt 2>&1 || true 
    multivcfanalyzer -Xmx${task.memory.toGiga()}g --help | head -n 1 &> v_multivcfanalyzer.txt 2>&1 || true
    malt-run -J-Xmx${task.memory.toGiga()}g --help |& tail -n 3 | head -n 1 | cut -f 2 -d'(' | cut -f 1 -d ',' &> v_malt.txt 2>&1 || true
    MaltExtract -Xmx${task.memory.toGiga()}g --help | head -n 2 | tail -n 1 &> v_maltextract.txt 2>&1 || true
    multiqc --version &> v_multiqc.txt 2>&1 || true
    vcf2genome -Xmx${task.memory.toGiga()}g -h |& head -n 1 &> v_vcf2genome.txt || true
    mtnucratio -Xmx${task.memory.toGiga()}g --help &> v_mtnucratiocalculator.txt || true
    sexdeterrmine --version &> v_sexdeterrmine.txt || true
    kraken2 --version | head -n 1 &> v_kraken.txt || true
    endorS.py --version &> v_endorSpy.txt || true
    pileupCaller --version &> v_sequencetools.txt 2>&1 || true
    bowtie2 --version | grep -a 'bowtie2-.* -fdebug' > v_bowtie2.txt || true
    eigenstrat_snp_coverage --version | cut -d ' ' -f2 >v_eigenstrat_snp_coverage.txt || true
    mapDamage --version > v_mapdamage.txt || true
    bbversion.sh > v_bbduk.txt || true
    bcftools --version | grep 'bcftools' | cut -d ' ' -f 2 > v_bcftools.txt || true
    scrape_software_versions.py &> software_versions_mqc.yaml
    """
}

// MultiQC file generation for pipeline report
//def workflow_summary = NfcoreSchema.params_summary_multiqc(workflow, summary_params)

//ch_workflow_summary = Channel.value(workflow_summary)

process multiqc {
    label 'sc_medium'

    publishDir "${params.outdir}/multiqc", mode: params.publish_dir_mode

    input:
    file multiqc_config from ch_multiqc_config
    file (mqc_custom_config) from ch_multiqc_custom_config.collect().ifEmpty([])
    file software_versions_mqc from software_versions_yaml.collect().ifEmpty([])
    file logo from ch_eager_logo
    file ('fastqc_raw/*') from ch_prefastqc_for_multiqc.collect().ifEmpty([])
    path('fastqc/*') from ch_fastqc_after_clipping.collect().ifEmpty([])
    file ('adapter_removal/*') from ch_adapterremoval_logs.collect().ifEmpty([])
    file ('mapping/bt2/*') from ch_bt2_for_multiqc.collect().ifEmpty([])
    file ('flagstat/*') from ch_flagstat_for_multiqc.collect().ifEmpty([])
    file ('flagstat_filtered/*') from ch_bam_filtered_flagstat_for_multiqc.collect().ifEmpty([])
    file ('preseq/*') from ch_preseq_for_multiqc.collect().ifEmpty([])
    file ('damageprofiler/dmgprof*/*') from ch_damageprofiler_results.collect().ifEmpty([])
    file ('mapdamage/*') from ch_mapdamage_for_multiqc.collect().ifEmpty([])
    file ('qualimap/qualimap*/*') from ch_qualimap_results.collect().ifEmpty([])
    file ('markdup/*') from ch_markdup_results_for_multiqc.collect().ifEmpty([])
    file ('dedup*/*') from ch_dedup_results_for_multiqc.collect().ifEmpty([])
    file ('fastp/*') from ch_fastp_for_multiqc.collect().ifEmpty([])
    file ('sexdeterrmine/*') from ch_sexdet_for_multiqc.collect().ifEmpty([])
    file ('mutnucratio/*') from ch_mtnucratio_for_multiqc.collect().ifEmpty([])
    file ('endorspy/*') from ch_endorspy_for_multiqc.collect().ifEmpty([])
    file ('multivcfanalyzer/*') from ch_multivcfanalyzer_for_multiqc.collect().ifEmpty([])
    file ('fastp_lowcomplexityfilter/*') from ch_metagenomic_complexity_filter_for_multiqc.collect().ifEmpty([])
    file ('malt/*') from ch_malt_for_multiqc.collect().ifEmpty([])
    file ('kraken/*') from ch_kraken_for_multiqc.collect().ifEmpty([])
    file ('hops/*') from ch_hops_for_multiqc.collect().ifEmpty([])
    file ('nuclear_contamination/*') from ch_nuclear_contamination_for_multiqc.collect().ifEmpty([])
    file ('genotyping/*') from ch_eigenstrat_snp_cov_for_multiqc.collect().ifEmpty([])
    file ('bcftools_stats') from ch_bcftools_stats_for_multiqc.collect().ifEmpty([])
    file workflow_summary from ch_workflow_summary.collectFile(name: "workflow_summary_mqc.yaml")

    output:
    file "*multiqc_report.html" into ch_multiqc_report
    file "*_data"

    script:
    rtitle = ''
    rfilename = ''
    if (!(workflow.runName ==~ /[a-z]+_[a-z]+/)) {
        rtitle = "--title \"${workflow.runName}\""
        rfilename = "--filename " + workflow.runName.replaceAll('\\W','_').replaceAll('_+','_') + "_multiqc_report"
    }
    
    def custom_config_file = params.multiqc_config ? "--config $mqc_custom_config" : ''
    """
    multiqc -f $rtitle $rfilename $multiqc_config $custom_config_file .
    """
}

// Send completion emails if requested, so user knows data is ready

workflow.onComplete {

    // Set up the e-mail variables
    def subject = "[nf-core/eager] Successful: $workflow.runName"
    if (!workflow.success) {
        subject = "[nf-core/eager] FAILED: $workflow.runName"
    }
    def email_fields = [:]
    email_fields['version'] = workflow.manifest.version
    email_fields['runName'] = workflow.runName
    email_fields['success'] = workflow.success
    email_fields['dateComplete'] = workflow.complete
    email_fields['duration'] = workflow.duration
    email_fields['exitStatus'] = workflow.exitStatus
    email_fields['errorMessage'] = (workflow.errorMessage ?: 'None')
    email_fields['errorReport'] = (workflow.errorReport ?: 'None')
    email_fields['commandLine'] = workflow.commandLine
    email_fields['projectDir'] = workflow.projectDir
    email_fields['summary'] = summary
    email_fields['summary']['Date Started'] = workflow.start
    email_fields['summary']['Date Completed'] = workflow.complete
    email_fields['summary']['Pipeline script file path'] = workflow.scriptFile
    email_fields['summary']['Pipeline script hash ID'] = workflow.scriptId
    if (workflow.repository) email_fields['summary']['Pipeline repository Git URL'] = workflow.repository
    if (workflow.commitId) email_fields['summary']['Pipeline repository Git Commit'] = workflow.commitId
    if (workflow.revision) email_fields['summary']['Pipeline Git branch/tag'] = workflow.revision
    email_fields['summary']['Nextflow Version'] = workflow.nextflow.version
    email_fields['summary']['Nextflow Build'] = workflow.nextflow.build
    email_fields['summary']['Nextflow Compile Timestamp'] = workflow.nextflow.timestamp

    // On success try attach the multiqc report
    def mqc_report = null
    try {
        if (workflow.success) {
            mqc_report = ch_multiqc_report.getVal()
            if (mqc_report.getClass() == ArrayList) {
                log.warn "[nf-core/eager] Found multiple reports from process 'multiqc', will use only one"
                mqc_report = mqc_report[0]
            }
        }
    } catch (all) {
        log.warn "[nf-core/eager] Could not attach MultiQC report to summary email"
    }

    // Check if we are only sending emails on failure
    email_address = params.email
    if (!params.email && params.email_on_fail && !workflow.success) {
        email_address = params.email_on_fail
    }

    // Render the TXT template
    def engine = new groovy.text.GStringTemplateEngine()
    def tf = new File("$projectDir/assets/email_template.txt")
    def txt_template = engine.createTemplate(tf).make(email_fields)
    def email_txt = txt_template.toString()

    // Render the HTML template
    def hf = new File("$projectDir/assets/email_template.html")
    def html_template = engine.createTemplate(hf).make(email_fields)
    def email_html = html_template.toString()

    // Render the sendmail template
    def smail_fields = [ email: email_address, subject: subject, email_txt: email_txt, email_html: email_html, projectDir: "$projectDir", mqcFile: mqc_report, mqcMaxSize: params.max_multiqc_email_size.toBytes() ]
    def sf = new File("$projectDir/assets/sendmail_template.txt")
    def sendmail_template = engine.createTemplate(sf).make(smail_fields)
    def sendmail_html = sendmail_template.toString()

    // Send the HTML e-mail
    if (email_address) {
        try {
            if (params.plaintext_email) { throw GroovyException('Send plaintext e-mail, not HTML') }
            // Try to send HTML e-mail using sendmail
            [ 'sendmail', '-t' ].execute() << sendmail_html
            log.info "[nf-core/eager] Sent summary e-mail to $email_address (sendmail)"
        } catch (all) {
            // Catch failures and try with plaintext
            def mail_cmd = [ 'mail', '-s', subject, '--content-type=text/html', email_address ]
            if ( mqc_report.size() <= params.max_multiqc_email_size.toBytes() ) {
              mail_cmd += [ '-A', mqc_report ]
            }
            mail_cmd.execute() << email_html
            log.info "[nf-core/eager] Sent summary e-mail to $email_address (mail)"
        }
    }

    // Write summary e-mail HTML to a file
    def output_d = new File("${params.outdir}/pipeline_info/")
    if (!output_d.exists()) {
        output_d.mkdirs()
    }
    def output_hf = new File(output_d, "pipeline_report.html")
    output_hf.withWriter { w -> w << email_html }
    def output_tf = new File(output_d, "pipeline_report.txt")
    output_tf.withWriter { w -> w << email_txt }

    c_green = params.monochrome_logs ? '' : "\033[0;32m";
    c_purple = params.monochrome_logs ? '' : "\033[0;35m";
    c_red = params.monochrome_logs ? '' : "\033[0;31m";
    c_reset = params.monochrome_logs ? '' : "\033[0m";

    if (workflow.stats.ignoredCount > 0 && workflow.success) {
        log.info "-${c_purple}Warning, pipeline completed, but with errored process(es) ${c_reset}-"
        log.info "-${c_red}Number of ignored errored process(es) : ${workflow.stats.ignoredCount} ${c_reset}-"
        log.info "-${c_green}Number of successfully ran process(es) : ${workflow.stats.succeedCount} ${c_reset}-"
    }

    if (workflow.success) {
        log.info "-${c_purple}[nf-core/eager]${c_green} Pipeline completed successfully${c_reset}-"
        log.info "-${c_purple}[nf-core/eager]${c_green} MultiQC run report can be found in ${params.outdir}/multiqc ${c_reset}-"
        log.info "-${c_purple}[nf-core/eager]${c_green} Further output documentation can be seen at https://nf-core/eager/output ${c_reset}-"
    } else {
        checkHostname()
        log.info "-${c_purple}[nf-core/eager]${c_red} Pipeline completed with errors${c_reset}-"
    }

}

workflow.onError {
    // Print unexpected parameters - easiest is to just rerun validation
    NfcoreSchema.validateParameters(params, json_schema, log)
}


/////////////////////////////////////
/* --    AUXILARY FUNCTIONS     -- */
/////////////////////////////////////

// Channelling the TSV file containing FASTQ or BAM 
def extract_data(tsvFile) {
    Channel.fromPath(tsvFile)
        .splitCsv(header: true, sep: '\t')
        .map { row ->

            def expected_keys = ['Sample_Name', 'Library_ID', 'Lane', 'Colour_Chemistry', 'SeqType', 'Organism', 'Strandedness', 'UDG_Treatment', 'R1', 'R2', 'BAM']
            if ( !row.keySet().containsAll(expected_keys) ) exit 1, "[nf-core/eager] error: Invalid TSV input - malformed column names. Please check input TSV. Column names should be: ${expected_keys.join(", ")}"

            checkNumberOfItem(row, 11)

            if ( row.Sample_Name == null || row.Sample_Name.isEmpty() ) exit 1, "[nf-core/eager] error: the Sample_Name column is empty. Ensure all cells are filled or contain 'NA' for optional fields. Check row:\n ${row}"
            if ( row.Library_ID == null || row.Library_ID.isEmpty() ) exit 1, "[nf-core/eager] error: the Library_ID column is empty. Ensure all cells are filled or contain 'NA' for optional fields. Check row:\n ${row}"
            if ( row.Lane == null || row.Lane.isEmpty() ) exit 1, "[nf-core/eager] error: the Lane column is empty. Ensure all cells are filled or contain 'NA' for optional fields. Check row:\n ${row}"
            if ( row.Colour_Chemistry == null || row.Colour_Chemistry.isEmpty() ) exit 1, "[nf-core/eager] error: the Colour_Chemistry column is empty. Ensure all cells are filled or contain 'NA' for optional fields. Check row:\n ${row}"
            if ( row.SeqType == null || row.SeqType.isEmpty() ) exit 1, "[nf-core/eager] error: the SeqType column is empty. Ensure all cells are filled or contain 'NA' for optional fields. Check row:\n ${row}"
            if ( row.Organism == null || row.Organism.isEmpty() ) exit 1, "[nf-core/eager] error: the Organism column is empty. Ensure all cells are filled or contain 'NA' for optional fields. Check row:\n ${row}"
            if ( row.Strandedness == null || row.Strandedness.isEmpty() ) exit 1, "[nf-core/eager] error: the Strandedness column is empty. Ensure all cells are filled or contain 'NA' for optional fields. Check row:\n ${row}"
            if ( row.UDG_Treatment == null || row.UDG_Treatment.isEmpty() ) exit 1, "[nf-core/eager] error: the UDG_Treatment column is empty. Ensure all cells are filled or contain 'NA' for optional fields. Check row:\n ${row}"
            if ( row.R1 == null || row.R1.isEmpty() ) exit 1, "[nf-core/eager] error: the R1 column is empty. Ensure all cells are filled or contain 'NA' for optional fields. Check row:\n ${row}"
            if ( row.R2 == null || row.R2.isEmpty() ) exit 1, "[nf-core/eager] error: the R2 column is empty. Ensure all cells are filled or contain 'NA' for optional fields. Check row:\n ${row}"
            if ( row.BAM == null || row.BAM.isEmpty() ) exit 1, "[nf-core/eager] error: the BAM column is empty. Ensure all cells are filled or contain 'NA' for optional fields. Check row:\n ${row}"

            def samplename = row.Sample_Name
            def libraryid  = row.Library_ID
            def lane = row.Lane
            def colour = row.Colour_Chemistry
            def seqtype = row.SeqType
            def organism = row.Organism
            def strandedness = row.Strandedness
            def udg = row.UDG_Treatment
            def r1 = row.R1.matches('NA') ? 'NA' : return_file(row.R1)
            def r2 = row.R2.matches('NA') ? 'NA' : return_file(row.R2)
            def bam = row.BAM.matches('NA') ? 'NA' : return_file(row.BAM)

            // check no empty metadata fields
            if (samplename == '' || libraryid == '' || lane == '' || colour == '' || seqtype == '' || organism == '' || strandedness == '' || udg == '' || r1 == '' || r2 == '' || bam == '') exit 1, "[nf-core/eager] error: a field/column does not contain any information. Ensure all cells are filled or contain 'NA' for optional fields. Check row:\n ${row}"

            // Check no 'empty' rows
            if (r1.matches('NA') && r2.matches('NA') && bam.matches('NA')) exit 1, "[nf-core/eager] error: A row in your TSV appears to have all files defined as NA. See '--help' flag and documentation under 'running the pipeline' for more information. Check row for: ${samplename}"

            // Ensure BAMs aren't submitted with PE
            if (!bam.matches('NA') && seqtype.matches('PE')) exit 1, "[nf-core/eager] error: BAM input rows in TSV cannot be set as PE, only SE. See '--help' flag and documentation under 'running the pipeline' for more information. Check row for: ${samplename}"

            // Check valid UDG treatment
            if (!udg.matches('none') && !udg.matches('half') && !udg.matches('full')) exit 1, "[nf-core/eager] error: UDG treatment can only be 'none', 'half' or 'full'. See '--help' flag and documentation under 'running the pipeline' for more information. You have '${udg}'"

            // Check valid colour chemistry
            if (!colour.matches('2') && !colour.matches('4')) exit 1, "[nf-core/eager] error: Colour chemistry in TSV can either be 2 (e.g. NextSeq/NovaSeq) or 4 (e.g. HiSeq/MiSeq)"

            //  Ensure that we do not accept incompatible chemistry setup
            if (!seqtype.matches('PE') && !seqtype.matches('SE')) exit 1, "[nf-core/eager] error:  SeqType for one or more rows in TSV is neither SE nor PE! see '--help' flag and documentation under 'running the pipeline' for more information. You have: '${seqtype}'"
            
           // So we don't accept existing files that are wrong format: e.g. fasta or sam
            if ( !r1.matches('NA') && !has_extension(r1, "fastq.gz") && !has_extension(r1, "fq.gz") && !has_extension(r1, "fastq") && !has_extension(r1, "fq")) exit 1, "[nf-core/eager] error: A specified R1 file either has a non-recognizable FASTQ extension or is not NA. See '--help' flag and documentation under 'running the pipeline' for more information. Check: ${r1}"
            if ( !r2.matches('NA') && !has_extension(r2, "fastq.gz") && !has_extension(r2, "fq.gz") && !has_extension(r2, "fastq") && !has_extension(r2, "fq")) exit 1, "[nf-core/eager] error: A specified R2 file either has a non-recognizable FASTQ extension or is not NA. See '--help' flag and documentation under 'running the pipeline' for more information. Check: ${r2}"
            if ( !bam.matches('NA') && !has_extension(bam, "bam")) exit 1, "[nf-core/eager] error: A specified R1 file either has a non-recognizable BAM extension or is not NA. See '--help' flag and documentation under 'running the pipeline' for more information. Check: ${bam}"
            
            [ samplename, libraryid, lane, colour, seqtype, organism, strandedness, udg, r1, r2, bam ]

        }

    }

// Check if a row has the expected number of item
def checkNumberOfItem(row, number) {
    if (row.size() != number) exit 1, "[nf-core/eager] error:  Invalid TSV input - malformed row (e.g. missing column) in ${row}, see '--help' flag and documentation under 'running the pipeline' for more information"
    return true
}

// Return file if it exists
def return_file(it) {
    if (!file(it).exists()) exit 1, "[nf-core/eager] error: Cannot find supplied FASTQ or BAM input file. If using input method TSV set to NA if no file required. See '--help' flag and documentation under 'running the pipeline' for more information. Check file: ${it}" 
    return file(it)
}

// Check file extension
def has_extension(it, extension) {
    it.toString().toLowerCase().endsWith(extension.toLowerCase())
}

// Extract FastQs from Path
// Create a channel of FASTQs from a directory pattern: "my_samples/*/"
// All FASTQ files in subdirectories are collected and emitted;
// they must have _R1_ and/or _R2_ in their names.
def retrieve_input_paths(input, colour_chem, pe_se, ds_ss, udg_treat, bam_in) {

  if ( !bam_in ) {
        if( pe_se ) {
            log.info "Generating single-end FASTQ data TSV"
            Channel
                .fromFilePairs( input, size: 1 )
                .filter { it =~/.*.fastq.gz|.*.fq.gz|.*.fastq|.*.fq/ }
                .ifEmpty { exit 1, "[nf-core/eager] error:  Your specified FASTQ read files did not end in: '.fastq.gz', '.fq.gz', '.fastq', or '.fq'. Did you forget --bam?" }
                .map { row -> [ row[0], [ row[1][0] ] ] }
                .ifEmpty { exit 1, "[nf-core/eager] error:  --input was empty - no input files supplied!" }
                .into { ch_reads_for_faketsv; ch_reads_for_validate }

                // Check we don't have any duplicated sample names due to fromFilePairs behaviour of calculating sample name from anything before R1/R2 glob
                ch_reads_for_validate
                  .groupTuple()
                  .map{
                    if ( validate_size(it[1], 1) ) { null } else { exit 1, "[nf-core/eager] error: You have supplied non-unique sample names (text before R1/R2 indication). Did you accidentally supply paired-end data?  see '--help' flag and documentation under 'running the pipeline' for more information. Check duplicates of: ${it[0]}" } 
                  }

        } else if (!pe_se ){
            log.info "Generating paired-end FASTQ data TSV"

            Channel
                .fromFilePairs( input )
                .filter { it =~/.*.fastq.gz|.*.fq.gz|.*.fastq|.*.fq/ }
                .ifEmpty { exit 1, "[nf-core/eager] error: Files could not be found. Do the specified FASTQ read files end in: '.fastq.gz', '.fq.gz', '.fastq', or '.fq'? Did you forget --single_end?" }
                .map { row -> [ row[0], [ row[1][0], row[1][1] ] ] }
                .ifEmpty { exit 1, "[nf-core/eager] error: --input was empty - no input files supplied!" }
                .into { ch_reads_for_faketsv; ch_reads_for_validate }

                // Check we don't have any duplicated sample names due to fromFilePairs behaviour of calculating sample name from anything before R1/R2 glob
                ch_reads_for_validate
                  .groupTuple()
                  .map{
                    if ( validate_size(it[1], 1) ) { null } else { exit 1, "[nf-core/eager] error: You have supplied non-unique sample names (text before R1/R2 indication). See '--help' flag and documentation under 'running the pipeline' for more information. Check duplicates of: ${it[0]}" } 
                  }

        } 

    } else if ( bam_in ) {
              log.info "Generating BAM data TSV"

         Channel
            .fromFilePairs( input, size: 1 )
            .filter { it =~/.*.bam/ }
            .map { row -> [ row[0], [ row[1][0] ] ] }
            .ifEmpty { exit 1, "[nf-core/eager] error: Cannot find any bam file matching: ${input}" }
            .set { ch_reads_for_faketsv }

    }

ch_reads_for_faketsv
  .map{

      def samplename = it[0]
      def libraryid  = it[0]
      def lane = 0
      def colour = "${colour_chem}"
      def seqtype = pe_se ? 'SE' : 'PE'
      def organism = 'NA'
      def strandedness = ds_ss ? 'single' : 'double'
      def udg = udg_treat
      def r1 = !bam_in ? return_file(it[1][0]) : 'NA'
      def r2 = !bam_in && !pe_se ? return_file(it[1][1]) : 'NA'
      def bam = bam_in && pe_se ? return_file(it[1][0]) : 'NA'

      [ samplename, libraryid, lane, colour, seqtype, organism, strandedness, udg, r1, r2, bam ]
  }
  .ifEmpty {exit 1, "[nf-core/eager] error: Invalid file paths with --input"}

}

// Function to check length of collection in a channel closure is as expected (e.g. with .map())
def validate_size(collection, size){
    if ( collection.size() != size ) { return false } else { return true }
}

def checkHostname() {
    def c_reset = params.monochrome_logs ? '' : "\033[0m"
    def c_white = params.monochrome_logs ? '' : "\033[0;37m"
    def c_red = params.monochrome_logs ? '' : "\033[1;91m"
    def c_yellow_bold = params.monochrome_logs ? '' : "\033[1;93m"
    if (params.hostnames) {
        def hostname = 'hostname'.execute().text.trim()
        params.hostnames.each { prof, hnames ->
            hnames.each { hname ->
                if (hostname.contains(hname) && !workflow.profile.contains(prof)) {
                    log.error "${c_red}====================================================${c_reset}\n" +
                            "  ${c_red}WARNING!${c_reset} You are running with `-profile $workflow.profile`\n" +
                            "  but your machine hostname is ${c_white}'$hostname'${c_reset}\n" +
                            "  ${c_yellow_bold}It's highly recommended that you use `-profile $prof${c_reset}`\n" +
                            "${c_red}====================================================${c_reset}\n"
                }
            }
        }
    }
}


================================================
FILE: nextflow.config
================================================
/*
 * -------------------------------------------------
 *  nf-core/eager Nextflow config file
 * -------------------------------------------------
 * Default config options for all environments.
 */
// Global default params, used in configs
params {

  // Workflow flags
  genome = false
  input = null
  input_paths = null
  single_end = false
  outdir = './results'
  publish_dir_mode = 'copy'
  config_profile_name = null

  // aws
  awsqueue = null
  awsregion = 'eu-west-1'
  awscli = null

  //Pipeline options
  enable_conda               = false
  validate_params            = true
  schema_ignore_params       = 'genome'
  show_hidden_params         = false

  //Input reads
  udg_type = 'none'
  single_stranded = false
  single_end = false
  colour_chemistry = 4
  bam = false
  
  // Optional input information
  snpcapture_bed = null
  run_convertinputbam = false

  //Input reference
  fasta = null
  bwa_index = null
  bt2_index = null
  fasta_index = null
  seq_dict = null
  large_ref = false
  save_reference = false
  
  // this is just to stop the iGenomes WARN as we set as FALSE by default. Otherwise should be overwritten by optional config load below.
  genomes = false 


  //Skipping parts of the pipeline for impatient users
  skip_fastqc = false
  skip_adapterremoval = false 
  skip_preseq = false
  skip_deduplication = false
  skip_damage_calculation = false
  skip_qualimap = false

  //More defaults
  complexity_filter_poly_g = false
  complexity_filter_poly_g_min = 10

  //Read clipping and merging parameters
  clip_forward_adaptor = 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC'
  clip_reverse_adaptor = 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA'
  clip_adapters_list = null 
  clip_readlength = 30
  clip_min_read_quality = 20
  min_adap_overlap = 1
  skip_collapse = false
  skip_trim = false
  preserve5p = false
  mergedonly = false
  qualitymax = 41
  run_post_ar_trimming = false
  post_ar_trim_front = 7
  post_ar_trim_tail = 7
  post_ar_trim_front2 = 7
  post_ar_trim_tail2 = 7

  //Mapping algorithm
  mapper = 'bwaaln'
  bwaalnn = 0.01 // From Oliva et al. 2021 (10.1093/bib/bbab076)
  bwaalnk = 2
  bwaalnl = 1024 // From Oliva et al. 2021 (10.1093/bib/bbab076)
  bwaalno = 2 // From Oliva et al. 2021 (10.1093/bib/bbab076)
  circularextension = 500
  circulartarget = 'MT'
  circularfilter = false
  bt2_alignmode = 'local' // from Cahill 2018 (10.1093/molbev/msy018) and, Poullet and Orlando (10.3389/fevo.2020.00105)
  bt2_sensitivity = 'sensitive' // from Poullet and Orlando (10.3389/fevo.2020.00105)
  bt2n = 0 // Do not set Cahill 2018 recommendation of 1 here, so not to 'hide' overriding bowtie2 presets
  bt2l = 0
  bt2_trim5 = 0
  bt2_trim3 = 0
  bt2_maxins = 500

  //Mapped read removal from input FASTQ
  hostremoval_input_fastq = false
  hostremoval_mode = 'remove'

  //BAM Filtering steps (default = discard unmapped reads)
  run_bam_filtering = false
  bam_mapping_quality_threshold = 0
  bam_filter_minreadlength = 0
  bam_unmapped_type = 'discard'

  //DeDuplication settings
  dedupper = 'markduplicates'
  dedup_all_merged = false

  //Preseq settings
  preseq_step_size = 1000
  preseq_mode = 'c_curve'
  preseq_bootstrap = 100
  preseq_maxextrap = 10000000000
  preseq_cval = 0.95
  preseq_terms = 100

  //Damage estimation settings
  damage_calculation_tool = 'damageprofiler'
  damageprofiler_length = 100
  damageprofiler_threshold = 15
  damageprofiler_yaxis = 0.30
  mapdamage_downsample = 0
  mapdamage_yaxis = 0.30

  //PMDTools settings
  run_pmdtools = false
  pmdtools_range = 10
  pmdtools_threshold = 3
  pmdtools_reference_mask = null
  pmdtools_max_reads = 10000
  pmdtools_platypus = false

  // mapDamage
  run_mapdamage_rescaling = false
  rescale_length_5p = 0
  rescale_length_3p = 0
  rescale_seqlength = 12

  //Bedtools settings
  run_bedtools_coverage = false
  anno_file = null
  anno_file_is_unsorted = false

  //bamUtils trimbam settings
  run_trim_bam = false 
  bamutils_clip_double_stranded_half_udg_left = 0
  bamutils_clip_double_stranded_half_udg_right = 0
  bamutils_clip_double_stranded_none_udg_left = 0
  bamutils_clip_double_stranded_none_udg_right = 0
  bamutils_clip_single_stranded_half_udg_left = 0
  bamutils_clip_single_stranded_half_udg_right = 0
  bamutils_clip_single_stranded_none_udg_left = 0
  bamutils_clip_single_stranded_none_udg_right = 0
  bamutils_softclip = false

  //Genotyping options
  run_genotyping = false
  genotyping_tool = null
  genotyping_source = 'raw'
  // gatk options
  gatk_call_conf = 30
  gatk_ploidy = 2
  gatk_downsample = 250
  gatk_dbsnp = null
  gatk_hc_out_mode = 'EMIT_VARIANTS_ONLY'
  gatk_hc_emitrefconf = 'GVCF'
  gatk_ug_genotype_model = 'SNP'
  gatk_ug_out_mode = 'EMIT_VARIANTS_ONLY'
  gatk_ug_keep_realign_bam = false
  gatk_ug_defaultbasequalities = null
  // freebayes options
  freebayes_C = 1
  freebayes_g = 0
  freebayes_p = 2
  // Sequencetools pileupCaller
  pileupcaller_snpfile = null
  pileupcaller_bedfile = null
  pileupcaller_method = 'randomHaploid'
  pileupcaller_transitions_mode = 'AllSites'
  pileupcaller_min_map_quality = 30
  pileupcaller_min_base_quality = 30
  // ANGSD Genotype Likelihoods
  angsd_glmodel = 'samtools'
  angsd_glformat = 'binary'
  angsd_createfasta = false
  angsd_fastamethod = 'random'
  run_bcftools_stats = true

  //Consensus sequence generation
  run_vcf2genome = false
  vcf2genome_outfile = ''
  vcf2genome_header = ''
  vcf2genome_minc = 5
  vcf2genome_minq = 30
  vcf2genome_minfreq = 0.8

  //MultiVCFAnalyzer Options
  run_multivcfanalyzer = false
  write_allele_frequencies = false
  min_genotype_quality = 30
  min_base_coverage = 5
  min_allele_freq_hom = 0.9
  min_allele_freq_het = 0.9
  additional_vcf_files = null
  reference_gff_annotations = 'NA'
  reference_gff_exclude = 'NA'
  snp_eff_results = 'NA'

  //mtnucratio
  run_mtnucratio = false
  mtnucratio_header = 'MT'

  //Sex.DetERRmine settings
  run_sexdeterrmine = false
  sexdeterrmine_bedfile = null

  //Nuclear contamination based on chromosome X heterozygosity.
  run_nuclear_contamination = false
  contamination_chrom_name = 'X' // Default to using hs37d5 name

  // taxonomic classifier
  run_metagenomic_screening  = false
  
  metagenomic_complexity_filter = false
  metagenomic_complexity_entropy = 0.3

  metagenomic_tool = null
  database  = null
  metagenomic_min_support_reads = 1
  percent_identity = 85
  malt_mode = 'BlastN'
  malt_alignment_mode = 'SemiGlobal'
  malt_top_percent = 1
  malt_min_support_mode = 'percent'
  malt_min_support_percent = 0.01
  malt_max_queries = 100
  malt_memory_mode = 'load'
  malt_sam_output = false

  // maltextract - only including number 
  // parameters if default documented or duplicate of MALT
  run_maltextract = false
  maltextract_taxon_list = null
  maltextract_ncbifiles = null
  maltextract_filter = 'def_anc'
  maltextract_toppercent = 0.01
  maltextract_destackingoff = false
  maltextract_downsamplingoff = false
  maltextract_duplicateremovaloff = false
  maltextract_matches = false
  maltextract_megansummary = false
  maltextract_percentidentity = 85.0
  maltextract_topalignment =  false

  // Boilerplate options
  multiqc_config = false
  email = false
  email_on_fail = false
  max_multiqc_email_size = 25.MB
  plaintext_email = false
  monochrome_logs = false
  help = false
  igenomes_base = 's3://ngi-igenomes/igenomes'
  tracedir = "${params.outdir}/pipeline_info"
  igenomes_ignore = true
  custom_config_version = 'master'
  custom_config_base = "https://raw.githubusercontent.com/nf-core/configs/${params.custom_config_version}"
  hostnames = false
  config_profile_name = null
  config_profile_description = false
  config_profile_contact = false
  config_profile_url = false
  validate_params = true
  show_hidden_params = false
  schema_ignore_params = 'genomes,input_paths'

  // Defaults only, expecting to be overwritten
  max_memory = 128.GB
  max_cpus = 16
  max_time = 240.h

}

// Container slug. Stable releases should specify release tag!
// Developmental code should specify :dev
process.container = 'nfcore/eager:2.5.3'

// Load base.config by default for all pipelines
includeConfig 'conf/base.config'

// Load nf-core custom profiles from different Institutions
try {
  includeConfig "${params.custom_config_base}/nfcore_custom.config"
} catch (Exception e) {
  System.err.println("WARNING: Could not load nf-core/config profiles: ${params.custom_config_base}/nfcore_custom.config")
}

// Load nf-core/eager custom profiles from different institutions
try {
  includeConfig "${params.custom_config_base}/pipeline/eager.config"
} catch (Exception e) {
  System.err.println("WARNING: Could not load nf-core/config/eager profiles: ${params.custom_config_base}/pipeline/eager.config")
}

profiles {
  conda {
    docker.enabled = false
    singularity.enabled = false
    podman.enabled = false
    shifter.enabled = false
    charliecloud.enabled = false
    process.conda = "$projectDir/environment.yml"
  }
  debug { process.beforeScript = 'echo $HOSTNAME' }
  docker {
    docker.enabled = true
    singularity.enabled = false
    podman.enabled = false
    shifter.enabled = false
    charliecloud.enabled = false
    // Avoid this error:
    //   WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
    // Testing this in nf-core after discussion here https://github.com/nf-core/tools/pull/351
    // once this is established and works well, nextflow might implement this behavior as new default.
    docker.runOptions = '-u \$(id -u):\$(id -g)'
  }
  singularity {
    docker.enabled = false
    singularity.enabled = true
    podman.enabled = false
    shifter.enabled = false
    charliecloud.enabled = false
    singularity.autoMounts = true
  }
  podman {
    singularity.enabled = false
    docker.enabled = false
    podman.enabled = true
    shifter.enabled = false
    charliecloud.enabled = false
  }
  shifter {
    singularity.enabled = false
    docker.enabled = false
    podman.enabled = false
    shifter.enabled = true
    charliecloud.enabled = false
  }
  charliecloud {
    singularity.enabled = false
    docker.enabled = false
    podman.enabled = false
    shifter.enabled = false
    charliecloud.enabled = true
  }
  test { includeConfig 'conf/test.config'}
  test_direct { includeConfig 'conf/test_direct.config' }
  test_full { includeConfig 'conf/test_full.config' }
  test_bam { includeConfig 'conf/test_bam.config'}
  test_fna { includeConfig 'conf/test_fna.config'}
  test_humanbam { includeConfig 'conf/test_humanbam.config' }
  test_pretrim { includeConfig 'conf/test_pretrim.config' }
  test_kraken { includeConfig 'conf/test_kraken.config' }
  test_tsv_bam { includeConfig 'conf/test_tsv_bam.config'}
  test_tsv_fna { includeConfig 'conf/test_tsv_fna.config'}
  test_tsv_humanbam { includeConfig 'conf/test_tsv_humanbam.config' }
  test_tsv_pretrim { includeConfig 'conf/test_tsv_pretrim.config' }
  test_tsv_kraken { includeConfig 'conf/test_tsv_kraken.config' }
  test_tsv_complex { includeConfig 'conf/test_tsv_complex.config' }
  test_stresstest_human { includeConfig 'conf/test_stresstest_human.config' }
  benchmarking_human { includeConfig 'conf/benchmarking_human.config' }
  benchmarking_vikingfish { includeConfig 'conf/benchmarking_vikingfish.config' }
}


// Load igenomes.config if required
if (!params.igenomes_ignore) {
  includeConfig 'conf/igenomes.config'
}

// Export these variables to prevent local Python/R libraries from conflicting with those in the container
env {
  PYTHONNOUSERSITE = 1
  R_PROFILE_USER = "/.Rprofile"
  R_ENVIRON_USER = "/.Renviron"
}

// Capture exit codes from upstream processes when piping
process.shell = ['/bin/bash', '-euo', 'pipefail']

def trace_timestamp = new java.util.Date().format( 'yyyy-MM-dd_HH-mm-ss')
timeline {
  enabled = true
  file = "${params.tracedir}/execution_timeline_${trace_timestamp}.html"
}
report {
  enabled = true
  file = "${params.tracedir}/execution_report_${trace_timestamp}.html"
}
trace {
  enabled = true
  file = "${params.tracedir}/execution_trace_${trace_timestamp}.txt"
}
dag {
  enabled = true
  file = "${params.tracedir}/pipeline_dag_${trace_timestamp}.svg"
}

manifest {
  name = 'nf-core/eager'
  author = 'The nf-core/eager community'
  homePage = 'https://github.com/nf-core/eager'
  description = 'A fully reproducible and state-of-the-art ancient DNA analysis pipeline'
  mainScript = 'main.nf'
  nextflowVersion = '>=20.07.1'
  version = '2.5.3'
}

// Function to ensure that resource requirements don't go beyond
// a maximum limit
def check_max(obj, type) {
  if (type == 'memory') {
    try {
      if (obj.compareTo(params.max_memory as nextflow.util.MemoryUnit) == 1)
        return params.max_memory as nextflow.util.MemoryUnit
      else
        return obj
    } catch (all) {
      println "   ### ERROR ###   Max memory '${params.max_memory}' is not valid! Using default value: $obj"
      return obj
    }
  } else if (type == 'time') {
    try {
      if (obj.compareTo(params.max_time as nextflow.util.Duration) == 1)
        return params.max_time as nextflow.util.Duration
      else
        return obj
    } catch (all) {
      println "   ### ERROR ###   Max time '${params.max_time}' is not valid! Using default value: $obj"
      return obj
    }
  } else if (type == 'cpus') {
    try {
      return Math.min( obj, params.max_cpus as int )
    } catch (all) {
      println "   ### ERROR ###   Max cpus '${params.max_cpus}' is not valid! Using default value: $obj"
      return obj
    }
  }
}

================================================
FILE: nextflow_schema.json
================================================
{
    "$schema": "http://json-schema.org/draft-07/schema",
    "$id": "https://raw.githubusercontent.com/nf-core/eager/master/nextflow_schema.json",
    "title": "nf-core/eager pipeline parameters",
    "description": "A fully reproducible and state-of-the-art ancient DNA analysis pipeline",
    "type": "object",
    "definitions": {
        "input_output_options": {
            "title": "Input/output options",
            "type": "object",
            "fa_icon": "fas fa-terminal",
            "description": "Define where the pipeline should find input data, and additional metadata.",
            "required": [
                "input"
            ],
            "properties": {
                "input": {
                    "type": "string",
                    "description": "Either paths or URLs to FASTQ/BAM data (must be surrounded with quotes). For paired end data, the path must use '{1,2}' notation to specify read pairs. Alternatively, a path to a TSV file (ending .tsv) containing file paths and sequencing/sample metadata. Allows for merging of multiple lanes/libraries/samples. Please see documentation for template.",
                    "fa_icon": "fas fa-dna",
                    "help_text": "There are two possible ways of supplying input sequencing data to nf-core/eager. The most efficient but more simplistic is supplying direct paths (with wildcards) to your FASTQ or BAM files, with each file or pair being considered a single library and each one run independently  (e.g. for paired-end data: `--input '/<path>/<to>/*_{R1,R2}_*.fq.gz'`). TSV input requires creation of an extra file by the user (`--input '/<path>/<to>/eager_data.tsv'`) and extra metadata, but allows more powerful lane and library merging.  Please see [usage docs](https://nf-co.re/eager/docs/usage#input-specifications) for detailed instructions and specifications."
                },
                "udg_type": {
                    "type": "string",
                    "default": "none",
                    "description": "Specifies whether you have UDG treated libraries. Set to 'half' for partial treatment, or 'full' for UDG. If not set, libraries are assumed to have no UDG treatment ('none'). Not required for TSV input.",
                    "fa_icon": "fas fa-vial",
                    "help_text": "Defines whether Uracil-DNA glycosylase (UDG) treatment was used to remove DNA\ndamage on the sequencing libraries.\n\nSpecify `'none'` if no treatment was performed. If you have partial UDG treated\ndata ([Rohland et al 2016](http://dx.doi.org/10.1098/rstb.2013.0624)), specify\n`'half'`. If you have complete UDG treated data ([Briggs et al.\n2010](https://doi.org/10.1093/nar/gkp1163)), specify `'full'`. \n\nWhen also using PMDtools specifying `'half'` will use a different model for DNA\ndamage assessment in PMDTools (PMDtools: `--UDGhalf`). Specify `'full'` and the\nPMDtools DNA damage assessment will use CpG context only (PMDtools: `--CpG`).\nDefault: `'none'`.\n\n> **Tip**: You should provide a small decoy reference genome with pre-made indices, e.g.\n> the human mtDNA genome, for the mandatory parameter `--fasta` in order to\n> avoid long computational time for generating the index files of the reference\n> genome, even if you do not actually need a reference genome for any downstream\n> analyses.",
                    "enum": [
                        "none",
                        "half",
                        "full"
                    ]
                },
                "single_stranded": {
                    "type": "boolean",
                    "description": "Specifies that libraries are single stranded. Always affects MALTExtract but will be ignored by pileupCaller with TSV input. Not required for TSV input.",
                    "fa_icon": "fas fa-minus",
                    "help_text": "Indicates libraries are single stranded.\n\nCurrently only affects MALTExtract where it will switch on damage patterns\ncalculation mode to single-stranded, (MaltExtract: `--singleStranded`) and\ngenotyping with pileupCaller where a different method is used (pileupCaller:\n`--singleStrandMode`). Default: false\n\nOnly required when using the 'Path' method of `--input`"
                },
                "single_end": {
                    "type": "boolean",
                    "description": "Specifies that the input is single end reads. Not required for TSV input.",
                    "fa_icon": "fas fa-align-left",
                    "help_text": "By default, the pipeline expects paired-end data. If you have single-end data, specify this parameter on the command line when you launch the pipeline. It is not possible to run a mixture of single-end and paired-end files in one run.\n\nOnly required when using the 'Path' method of `--input`"
                },
                "colour_chemistry": {
                    "type": "integer",
                    "default": 4,
                    "description": "Specifies which Illumina sequencing chemistry was used. Used to inform whether to poly-G trim if turned on (see below). Not required for TSV input. Options: 2, 4.",
                    "fa_icon": "fas fa-palette",
                    "help_text": "Specifies which Illumina colour chemistry a library was sequenced with. This informs whether to perform poly-G trimming (if `--complexity_filter_poly_g` is also supplied). Only 2 colour chemistry sequencers (e.g. NextSeq or NovaSeq) can generate uncertain poly-G tails (due to 'G' being indicated via a no-colour detection). Default is '4' to indicate e.g. HiSeq or MiSeq platforms, which do not require poly-G trimming. Options: 2, 4. Default: 4\n\nOnly required when using the 'Path' method of input."
                },
                "bam": {
                    "type": "boolean",
                    "description": "Specifies that the input is in BAM format. Not required for TSV input.",
                    "fa_icon": "fas fa-align-justify",
                    "help_text": "Specifies the input file type to `--input` is in BAM format. This will automatically also apply `--single_end`.\n\nOnly required when using the 'Path' method of `--input`.\n"
                }
            },
            "help_text": "There are two possible ways of supplying input sequencing data to nf-core/eager.\nThe most efficient but more simplistic is supplying direct paths (with\nwildcards) to your FASTQ or BAM files, with each file or pair being considered a\nsingle library and each one run independently. TSV input requires creation of an\nextra file by the user and extra metadata, but allows more powerful lane and\nlibrary merging."
        },
        "input_data_additional_options": {
            "title": "Input Data Additional Options",
            "type": "object",
            "description": "Additional options regarding input data.",
            "default": "",
            "properties": {
                "snpcapture_bed": {
                    "type": "string",
                    "fa_icon": "fas fa-magnet",
                    "description": "If library result of SNP capture, path to BED file containing SNPS positions on reference genome. SNP statistics are qualimap results directory only not MultiQC.",
                    "help_text": "Can be used to set a path to a BED file (3/6 column format) of SNP positions of a reference genome, to calculate SNP captured libraries on-target efficiency. This should be used for array or in-solution SNP capture protocols such as 390K, 1240K, etc. If supplied, some on-target metrics are automatically generated for you by qualimap in the 'Globals inside' section of the 'genome_results.txt' file in the qualimap results directory. These statistics are currently NOT displayed in MultiQC!"
                },
                "run_convertinputbam": {
                    "type": "boolean",
                    "description": "Turns on conversion of an input BAM file into FASTQ format to allow re-preprocessing (e.g. AdapterRemoval etc.).",
                    "fa_icon": "fas fa-undo-alt",
                    "help_text": "Allows you to convert an input BAM file back to FASTQ for downstream processing. Note this is required if you need to perform AdapterRemoval and/or polyG clipping.\n\nIf not turned on, BAMs will automatically be sent to post-mapping steps."
                }
            },
            "fa_icon": "far fa-plus-square"
        },
        "reference_genome_options": {
            "title": "Reference genome options",
            "type": "object",
            "fa_icon": "fas fa-dna",
            "properties": {
                "fasta": {
                    "type": "string",
                    "fa_icon": "fas fa-font",
                    "description": "Path or URL to a FASTA reference file (required if not iGenome reference). File suffixes can be: '.fa', '.fn', '.fna', '.fasta'.",
                    "help_text": "You specify the full path to your reference genome here. The FASTA file can have any file suffix, such as `.fasta`, `.fna`, `.fa`, `.FastA` etc. You may also supply a gzipped reference files, which will be unzipped automatically for you.\n\nFor example:\n\n```bash\n--fasta '/<path>/<to>/my_reference.fasta'\n```\n\n> If you don't specify appropriate `--bwa_index`, `--fasta_index` parameters, the pipeline will create these indices for you automatically. Note that you can save the indices created for you for later by giving the `--save_reference` flag.\n> You must select either a `--fasta` or `--genome`\n"
                },
                "genome": {
                    "type": "string",
                    "description": "Name of iGenomes reference (required if not FASTA reference). Requires argument `--igenomes_ignore false`, as iGenomes is ignored by default in nf-core/eager",
                    "fa_icon": "fas fa-book",
                    "help_text": "Alternatively to `--fasta`, the pipeline config files come bundled with paths to the Illumina iGenomes reference index files. If running with docker or AWS, the configuration is set up to use the [AWS-iGenomes](https://ewels.github.io/AWS-iGenomes/) resource.\n\nThere are 31 different species supported in the iGenomes references. To run the pipeline, you must specify which to use with the `--genome` flag.\n\nYou can find the keys to specify the genomes in the [iGenomes config file](../conf/igenomes.config). Common genomes that are supported are:\n\n- Human\n  - `--genome GRCh37`\n  - `--genome GRCh38`\n- Mouse *\n  - `--genome GRCm38`\n- _Drosophila_ *\n  - `--genome BDGP6`\n- _S. cerevisiae_ *\n  - `--genome 'R64-1-1'`\n\n> \\* Not bundled with nf-core eager by default.\n\nNote that you can use the same configuration setup to save sets of reference files for your own use, even if they are not part of the iGenomes resource. See the [Nextflow documentation](https://www.nextflow.io/docs/latest/config.html) for instructions on where to save such a file.\n\nThe syntax for this reference configuration is as follows:\n\n```nextflow\nparams {\n  genomes {\n    'GRCh37' {\n      fasta   = '<path to the iGenomes genome fasta file>'\n    }\n    // Any number of additional genomes, key is used with --genome\n  }\n}\n**NB** Requires argument `--igenomes_ignore false` as iGenomes ignored by default in nf-core/eager\n\n```"
                },
                "igenomes_base": {
                    "type": "string",
                    "description": "Directory / URL base for iGenomes references.",
                    "default": "s3://ngi-igenomes/igenomes",
                    "fa_icon": "fas fa-cloud-download-alt",
                    "hidden": true
                },
                "igenomes_ignore": {
                    "type": "boolean",
                    "description": "Do not load the iGenomes reference config.",
                    "fa_icon": "fas fa-ban",
                    "hidden": true,
                    "help_text": "Do not load `igenomes.config` when running the pipeline. You may choose this option if you observe clashes between custom parameters and those supplied in `igenomes.config`."
                },
                "bwa_index": {
                    "type": "string",
                    "description": "Path to directory containing pre-made BWA indices (i.e. the directory before the files ending in '.amb' '.ann' '.bwt'. Do not include the files themselves. Most likely the same directory of the file provided with --fasta). If not supplied will be made for you.",
                    "fa_icon": "fas fa-address-book",
                    "help_text": "If you want to use pre-existing `bwa index` indices, please supply the **directory** to the FASTA you also specified in `--fasta` nf-core/eager will automagically detect the index files by searching for the FASTA filename with the corresponding `bwa` index file suffixes.\n\nFor example:\n\n```bash\nnextflow run nf-core/eager \\\n-profile test,docker \\\n--input '*{R1,R2}*.fq.gz'\n--fasta 'results/reference_genome/bwa_index/BWAIndex/Mammoth_MT_Krause.fasta' \\\n--bwa_index 'results/reference_genome/bwa_index/BWAIndex/'\n```\n\n> `bwa index` does not give you an option to supply alternative suffixes/names for these indices. Thus, the file names generated by this command _must not_ be changed, otherwise nf-core/eager will not be able to find them."
                },
                "bt2_index": {
                    "type": "string",
                    "description": "Path to directory containing pre-made Bowtie2 indices (i.e. everything before the endings e.g. '.1.bt2', '.2.bt2', '.rev.1.bt2'. Most likely the same value as --fasta). If not supplied will be made for you.",
                    "fa_icon": "far fa-address-book",
                    "help_text": "If you want to use pre-existing `bt2 index` indices, please supply the **directory** to the FASTA you also specified in `--fasta`. nf-core/eager will automagically detect the index files by searching for the FASTA filename with the corresponding `bt2` index file suffixes.\n\nFor example:\n\n```bash\nnextflow run nf-core/eager \\\n-profile test,docker \\\n--input '*{R1,R2}*.fq.gz'\n--fasta 'results/reference_genome/bwa_index/BWAIndex/Mammoth_MT_Krause.fasta' \\\n--bwa_index 'results/reference_genome/bt2_index/BT2Index/'\n```\n\n> `bowtie2-build` does not give you an option to supply alternative suffixes/names for these indices. Thus, the file names generated by this command _must not_ be changed, otherwise nf-core/eager will not be able to find them."
                },
                "fasta_index": {
                    "type": "string",
                    "description": "Path to samtools FASTA index (typically ending in '.fai'). If not supplied will be made for you.",
                    "fa_icon": "far fa-bookmark",
                    "help_text": "If you want to use a pre-existing `samtools faidx` index, use this to specify the required FASTA index file for the selected reference genome. This should be generated by `samtools faidx` and has a file suffix of `.fai`\n\nFor example:\n\n```bash\n--fasta_index 'Mammoth_MT_Krause.fasta.fai'\n```"
                },
                "seq_dict": {
                    "type": "string",
                    "description": "Path to picard sequence dictionary file (typically ending in '.dict'). If not supplied will be made for you.",
                    "fa_icon": "fas fa-spell-check",
                    "help_text": "If you want to use a pre-existing `picard CreateSequenceDictionary` dictionary file, use this to specify the required `.dict` file for the selected reference genome.\n\nFor example:\n\n```bash\n--seq_dict 'Mammoth_MT_Krause.dict'\n```"
                },
                "large_ref": {
                    "type": "boolean",
                    "description": "Specify to generate more recent '.csi' BAM indices. If your reference genome is larger than 3.5GB, this is recommended due to more efficient data handling with the '.csi' format over the older '.bai'.",
                    "fa_icon": "fas fa-mountain",
                    "help_text": "This parameter is required to be set for large reference genomes. If your\nreference genome is larger than 3.5GB, the `samtools index` calls in the\npipeline need to generate `CSI` indices instead of `BAI` indices to compensate\nfor the size of the reference genome (with samtools: `-c`). This parameter is\nnot required for smaller references (including the human `hg19` or\n`grch37`/`grch38` references), but `>4GB` genomes have been shown to need `CSI`\nindices. Default: off"
                },
                "save_reference": {
                    "type": "boolean",
                    "description": "If not already supplied by user, turns on saving of generated reference genome indices for later re-usage.",
                    "fa_icon": "far fa-save",
                    "help_text": "Use this if you do not have pre-made reference FASTA indices for `bwa`, `samtools` and `picard`. If you turn this on, the indices nf-core/eager generates for you and will be saved in the `<your_output_dir>/results/reference_genomes` for you. If not supplied, nf-core/eager generated index references will be deleted.\n\n> modifies SAMtools index command: `-c`"
                }
            },
            "description": "Specify locations of references and optionally, additional pre-made indices",
            "help_text": "All nf-core/eager runs require a reference genome in FASTA format to map reads\nagainst to.\n\nIn addition we provide various options for indexing of different types of\nreference genomes (based on the tools used in the pipeline). nf-core/eager can\nindex reference genomes for you (with options to save these for other analysis),\nbut you can also supply your pre-made indices.\n\nSupplying pre-made indices saves time in pipeline execution and is especially\nadvised when running multiple times on the same cluster system for example. You\ncan even add a resource [specific profile](#profile) that sets paths to\npre-computed reference genomes, saving time when specifying these.\n\n> :warning: you must always supply a reference file. If you want to use\n  functionality that does not require one, supply a small decoy genome such as\n  phiX or the human mtDNA genome."
        },
        "output_options": {
            "title": "Output options",
            "type": "object",
            "description": "Specify where to put output files and optional saving of intermediate files",
            "default": "",
            "properties": {
                "outdir": {
                    "type": "string",
                    "description": "The output directory where the results will be saved.",
                    "default": "./results",
                    "fa_icon": "fas fa-folder-open",
                    "help_text": "The output directory where the results will be saved. By default will be made in the directory you run the command in under `./results`."
                },
                "publish_dir_mode": {
                    "type": "string",
                    "default": "copy",
                    "hidden": true,
                    "description": "Method used to save pipeline results to output directory.",
                    "help_text": "The Nextflow `publishDir` option specifies which intermediate files should be saved to the output directory. This option tells the pipeline what method should be used to move these files. See [Nextflow docs](https://www.nextflow.io/docs/latest/process.html#publishdir) for details.",
                    "fa_icon": "fas fa-copy",
                    "enum": [
                        "symlink",
                        "rellink",
                        "link",
                        "copy",
                        "copyNoFollow",
                        "move"
                    ]
                }
            },
            "fa_icon": "fas fa-cloud-download-alt"
        },
        "generic_options": {
            "title": "Generic options",
            "type": "object",
            "properties": {
                "help": {
                    "type": "boolean",
                    "description": "Display help text.",
                    "hidden": true,
                    "fa_icon": "fas fa-question-circle"
                },
                "validate_params": {
                    "type": "boolean",
                    "description": "Boolean whether to validate parameters against the schema at runtime",
                    "default": true,
                    "fa_icon": "fas fa-check-square",
                    "hidden": true
                },
                "email": {
                    "type": "string",
                    "description": "Email address for completion summary.",
                    "fa_icon": "fas fa-envelope",
                    "help_text": "An email address to send a summary email to when the pipeline is completed.",
                    "pattern": "^([a-zA-Z0-9_\\-\\.]+)@([a-zA-Z0-9_\\-\\.]+)\\.([a-zA-Z]{2,5})$"
                },
                "email_on_fail": {
                    "type": "string",
                    "description": "Email address for completion summary, only when pipeline fails.",
                    "fa_icon": "fas fa-exclamation-triangle",
                    "pattern": "^([a-zA-Z0-9_\\-\\.]+)@([a-zA-Z0-9_\\-\\.]+)\\.([a-zA-Z]{2,5})$",
                    "hidden": true,
                    "help_text": "Set this parameter to your e-mail address to get a summary e-mail with details of the run if it **fails**. Normally would be the same as in `--email` but can be different. If set in your user config file (`~/.nextflow/config`) then you don't need to specify this on the command line for every run.\n\n> Note that this functionality requires either `mail` or `sendmail` to be installed on your system."
                },
                "plaintext_email": {
                    "type": "boolean",
                    "description": "Send plain-text email instead of HTML.",
                    "fa_icon": "fas fa-remove-format",
                    "hidden": true,
                    "help_text": "Set to receive plain-text e-mails instead of HTML formatted."
                },
                "max_multiqc_email_size": {
                    "type": "string",
                    "description": "File size limit when attaching MultiQC reports to summary emails.",
                    "default": "25.MB",
                    "fa_icon": "fas fa-file-upload",
                    "hidden": true,
                    "help_text": "If file generated by pipeline exceeds the threshold, it will not be attached."
                },
                "monochrome_logs": {
                    "type": "boolean",
                    "description": "Do not use coloured log outputs.",
                    "fa_icon": "fas fa-palette",
                    "hidden": true,
                    "help_text": "Set to disable colourful command line output and live life in monochrome."
                },
                "multiqc_config": {
                    "type": "string",
                    "description": "Custom config file to supply to MultiQC.",
                    "fa_icon": "fas fa-cog",
                    "hidden": true
                },
                "tracedir": {
                    "type": "string",
                    "description": "Directory to keep pipeline Nextflow logs and reports.",
                    "default": "${params.outdir}/pipeline_info",
                    "fa_icon": "fas fa-cogs",
                    "hidden": true
                },
                "show_hidden_params": {
                    "type": "boolean",
                    "fa_icon": "far fa-eye-slash",
                    "description": "Show all params when using `--help`",
                    "hidden": true,
                    "help_text": "By default, parameters set as _hidden_ in the schema are not shown on the command line when a user runs with `--help`. Specifying this option will tell the pipeline to show all parameters."
                },
                "enable_conda": {
                    "type": "boolean",
                    "hidden": true,
                    "description": "Parameter used for checking conda channels to be set correctly."
                },
                "schema_ignore_params": {
                    "type": "string",
                    "fa_icon": "fas fa-not-equal",
                    "description": "String to specify ignored parameters for parameter validation",
                    "hidden": true,
                    "default": "genomes"
                }
            },
            "fa_icon": "fas fa-file-import",
            "description": "Less common options for the pipeline, typically set in a config file.",
            "help_text": "These options are common to all nf-core pipelines and allow you to customise some of the core preferences for how the pipeline runs.\n\nTypically these options would be set in a Nextflow config file loaded for all pipeline runs, such as `~/.nextflow/config`."
        },
        "max_job_request_options": {
            "title": "Max job request options",
            "type": "object",
            "fa_icon": "fab fa-acquisitions-incorporated",
            "description": "Set the top limit for requested resources for any single job.",
            "help_text": "If you are running on a smaller system, a pipeline step requesting more resources than are available may cause the Nextflow to stop the run with an error. These options allow you to cap the maximum resources requested by any single job so that the pipeline will run on your system.\n\nNote that you can not _increase_ the resources requested by any job using these options. For that you will need your own configuration file. See [the nf-core website](https://nf-co.re/usage/configuration) for details.",
            "properties": {
                "max_cpus": {
                    "type": "integer",
                    "description": "Maximum number of CPUs that can be requested    for any single job.",
                    "default": 16,
                    "fa_icon": "fas fa-microchip",
                    "hidden": true,
                    "help_text": "Use to set an upper-limit for the CPU requirement for each process. Should be an integer e.g. `--max_cpus 1`"
                },
                "max_memory": {
                    "type": "string",
                    "description": "Maximum amount of memory that can be requested for any single job.",
                    "default": "128.GB",
                    "fa_icon": "fas fa-memory",
                    "pattern": "^\\d+(\\.\\d+)?\\.?\\s*(K|M|G|T)?B$",
                    "hidden": true,
                    "help_text": "Use to set an upper-limit for the memory requirement for each process. Should be a string in the format integer-unit e.g. `--max_memory '8.GB'`"
                },
                "max_time": {
                    "type": "string",
                    "description": "Maximum amount of time that can be requested for any single job.",
                    "default": "240.h",
                    "fa_icon": "far fa-clock",
                    "pattern": "^(\\d+\\.?\\s*(s|m|h|day)\\s*)+$",
                    "hidden": true,
                    "help_text": "Use to set an upper-limit for the time requirement for each process. Should be a string in the format integer-unit e.g. `--max_time '2.h'`"
                }
            }
        },
        "institutional_config_options": {
            "title": "Institutional config options",
            "type": "object",
            "fa_icon": "fas fa-university",
            "description": "Parameters used to describe centralised config profiles. These generally should not be edited.",
            "help_text": "The centralised nf-core configuration profiles use a handful of pipeline parameters to describe themselves. This information is then printed to the Nextflow log when you run a pipeline. You should not need to change these values when you run a pipeline.",
            "properties": {
                "custom_config_version": {
                    "type": "string",
                    "description": "Git commit id for Institutional configs.",
                    "default": "master",
                    "hidden": true,
                    "fa_icon": "fas fa-users-cog",
                    "help_text": "Provide git commit id for custom Institutional configs hosted at `nf-core/configs`. This was implemented for reproducibility purposes. Default: `master`.\n\n```bash\n## Download and use config file with following git commit id\n--custom_config_version d52db660777c4bf36546ddb188ec530c3ada1b96\n```"
                },
                "custom_config_base": {
                    "type": "string",
                    "description": "Base directory for Institutional configs.",
                    "default": "https://raw.githubusercontent.com/nf-core/configs/master",
                    "hidden": true,
                    "help_text": "If you're running offline, nextflow will not be able to fetch the institutional config files from the internet. If you don't need them, then this is not a problem. If you do need them, you should download the files from the repo and tell nextflow where to find them with the `custom_config_base` option. For example:\n\n```bash\n## Download and unzip the config files\ncd /path/to/my/configs\nwget https://github.com/nf-core/configs/archive/master.zip\nunzip master.zip\n\n## Run the pipeline\ncd /path/to/my/data\nnextflow run /path/to/pipeline/ --custom_config_base /path/to/my/configs/configs-master/\n```\n\n> Note that the nf-core/tools helper package has a `download` command to download all required pipeline files + singularity containers + institutional configs in one go for you, to make this process easier.",
                    "fa_icon": "fas fa-users-cog"
                },
                "hostnames": {
                    "type": "string",
                    "description": "Institutional configs hostname.",
                    "hidden": true,
                    "fa_icon": "fas fa-users-cog"
                },
                "config_profile_name": {
                    "type": "string",
                    "description": "Institutional config name.",
                    "hidden": true,
                    "fa_icon": "fas fa-users-cog"
                },
                "config_profile_description": {
                    "type": "string",
                    "description": "Institutional config description.",
                    "hidden": true,
                    "fa_icon": "fas fa-users-cog"
                },
                "config_profile_contact": {
                    "type": "string",
                    "description": "Institutional config contact information.",
                    "hidden": true,
                    "fa_icon": "fas fa-users-cog"
                },
                "config_profile_url": {
                    "type": "string",
                    "description": "Institutional config URL link.",
                    "hidden": true,
                    "fa_icon": "fas fa-users-cog"
                },
                "awsqueue": {
                    "type": "string",
                    "description": "The AWSBatch JobQueue that needs to be set when running on AWSBatch",
                    "fa_icon": "fab fa-aws"
                },
                "awsregion": {
                    "type": "string",
                    "default": "eu-west-1",
                    "description": "The AWS Region for your AWS Batch job to run on",
                    "fa_icon": "fab fa-aws"
                },
                "awscli": {
                    "type": "string",
                    "description": "Path to the AWS CLI tool",
                    "fa_icon": "fab fa-aws"
                }
            }
        },
        "skip_steps": {
            "title": "Skip steps",
            "type": "object",
            "description": "Skip any of the mentioned steps.",
            "default": "",
            "properties": {
                "skip_fastqc": {
                    "type": "boolean",
                    "fa_icon": "fas fa-fast-forward",
                    "help_text": "Turns off FastQC pre- and post-Adapter Removal, to speed up the pipeline. Use of this flag is most common when data has been previously pre-processed and the post-Adapter Removal mapped reads are being re-mapped to a new reference genome."
                },
                "skip_adapterremoval": {
                    "type": "boolean",
                    "fa_icon": "fas fa-fast-forward",
                    "help_text": "Turns off adapter trimming and paired-end read merging. Equivalent to setting both `--skip_collapse` and `--skip_trim`."
                },
                "skip_preseq": {
                    "type": "boolean",
                    "fa_icon": "fas fa-fast-forward",
                    "help_text": "Turns off the computation of library complexity estimation."
                },
                "skip_deduplication": {
                    "type": "boolean",
                    "fa_icon": "fas fa-fast-forward",
                    "help_text": "Turns off duplicate removal methods DeDup and MarkDuplicates respectively. No duplicates will be removed on any data in the pipeline.\n"
                },
                "skip_damage_calculation": {
                    "type": "boolean",
                    "fa_icon": "fas fa-fast-forward",
                    "help_text": "Turns off the DamageProfiler module to compute DNA damage profiles.\n"
                },
                "skip_qualimap": {
                    "type": "boolean",
                    "fa_icon": "fas fa-fast-forward",
                    "help_text": "Turns off QualiMap and thus does not compute coverage and other mapping metrics.\n"
                }
            },
            "fa_icon": "fas fa-fast-forward",
            "help_text": "Some of the steps in the pipeline can be executed optionally. If you specify\nspecific steps to be skipped, there won't be any output related to these\nmodules."
        },
        "complexity_filtering": {
            "title": "Complexity filtering",
            "type": "object",
            "description": "Processing of Illumina two-colour chemistry data.",
            "default": "",
            "properties": {
                "complexity_filter_poly_g": {
                    "type": "boolean",
                    "description": "Turn on running poly-G removal on FASTQ files. Will only be performed on 2 colour chemistry machine sequenced libraries.",
                    "fa_icon": "fas fa-power-off",
                    "help_text": "Performs a poly-G tail removal step in the beginning of the pipeline using `fastp`, if turned on. This can be useful for trimming ploy-G tails from short-fragments sequenced on two-colour Illumina chemistry such as NextSeqs (where no-fluorescence is read as a G on two-colour chemistry), which can inflate reported GC content values.\n"
                },
                "complexity_filter_poly_g_min": {
                    "type": "integer",
                    "default": 10,
                    "description": "Specify length of poly-g min for clipping to be performed.",
                    "fa_icon": "fas fa-ruler-horizontal",
                    "help_text": "This option can be used to define the minimum length of a poly-G tail to begin low complexity trimming. By default, this is set to a value of `10` unless the user has chosen something specifically using this option.\n\n> Modifies fastp parameter: `--poly_g_min_len`"
                }
            },
            "fa_icon": "fas fa-filter",
            "help_text": "More details can be seen in the [fastp\ndocumentation](https://github.com/OpenGene/fastp)\n\nIf using TSV input, this is performed per lane separately"
        },
        "read_merging_and_adapter_removal": {
            "title": "Read merging and adapter removal",
            "type": "object",
            "description": "Options for adapter clipping and paired-end merging.",
            "default": "",
            "properties": {
                "clip_forward_adaptor": {
                    "type": "string",
                    "default": "AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC",
                    "description": "Specify adapter sequence to be clipped off (forward strand).",
                    "fa_icon": "fas fa-cut",
                    "help_text": "Defines the adapter sequence to be used for the forward read. By default, this is set to `'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC'`.\n\n> Modifies AdapterRemoval parameter: `--adapter1`"
                },
                "clip_reverse_adaptor": {
                    "type": "string",
                    "default": "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA",
                    "description": "Specify adapter sequence to be clipped off (reverse strand).",
                    "fa_icon": "fas fa-cut",
                    "help_text": "Defines the adapter sequence to be used for the reverse read in paired end sequencing projects. This is set to `'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA'` by default.\n\n> Modifies AdapterRemoval parameter: `--adapter2`"
                },
                "clip_adapters_list": {
                    "type": "string",
                    "description": "Path to AdapterRemoval adapter list file. Overrides `--clip_*_adaptor` parameters",
                    "fa_icon": "fas fa-cut",
                    "help_text": "Allows to supply a file with a list of adapter (combinations) to remove from all files. **Overrides** the `--clip_*_adaptor` parameters . First column represents forward strand, second column for reverse strand. You must supply all possibly combinations, one per line, and this list is applied to all files. See [AdapterRemoval documentation](https://adapterremoval.readthedocs.io/en/latest/manpage.html) for more information.\n\n> Modifies AdapterRemoval parameter: `--adapter-list`"
                },
                "clip_readlength": {
                    "type": "integer",
                    "default": 30,
                    "description": "Specify read minimum length to be kept for downstream analysis.",
                    "fa_icon": "fas fa-ruler",
                    "help_text": "Defines the minimum read length that is required for reads after merging to be considered for downstream analysis after read merging. Default is `30`.\n\nNote that when you have a large percentage of very short reads in your library (< 20 bp) - such as retrieved in single-stranded library protocols - that performing read length filtering at this step is not _always_ reliable for correct endogenous DNA calculation.  When you have very few reads passing this length filter, it will artificially inflate your 'endogenous DNA' value by creating a very small denominator. \n\nIf you notice you have ultra short reads (< 20 bp), it is recommended to set this parameter to 0, and use `--bam_filter_minreadlength` instead, to filter out 'un-usable' short reads after mapping. A caveat, however, is that this will cause a very large increase in computational run time, due to all reads in the library will be being mapped.\n\n> Modifies AdapterRemoval parameter: `--minlength`\n"
                },
                "clip_min_read_quality": {
                    "type": "integer",
                    "default": 20,
                    "description": "Specify minimum base quality for trimming off bases.",
                    "fa_icon": "fas fa-medal",
                    "help_text": "Defines the minimum read quality per base that is required for a base to be kept. Individual bases at the ends of reads falling below this threshold will be clipped off. Default is set to `20`.\n\n> Modifies AdapterRemoval parameter: `--minquality`"
                },
                "min_adap_overlap": {
                    "type": "integer",
                    "default": 1,
                    "description": "Specify minimum adapter overlap required for clipping.",
                    "fa_icon": "fas fa-hands-helping",
                    "help_text": "Specifies a minimum number of bases that overlap with the adapter sequence before adapters are trimmed from reads. Default is set to `1` base overlap.\n\n> Modifies AdapterRemoval parameter: `--minadapteroverlap`"
                },
                "skip_collapse": {
                    "type": "boolean",
                    "description": "Skip of merging forward and reverse reads together and turns on paired-end alignment for downstream mapping. Only applicable for paired-end libraries.",
                    "fa_icon": "fas fa-fast-forward",
                    "help_text": "Turns off the paired-end read merging.\n\nFor example\n\n```bash\n--skip_collapse  --input '*_{R1,R2}_*.fastq'\n```\n\nIt is important to use the paired-end wildcard globbing as `--skip_collapse` can only be used on paired-end data!\n\n:warning: If you run this and also with `--clip_readlength` set to something (as is by default), you may end up removing single reads from either the pair1 or pair2 file. These will be NOT be mapped when aligning with either `bwa` or `bowtie`, as both can only accept one (forward) or two (forward and reverse) FASTQs as input.\n\nAlso note that supplying this flag will then also cause downstream mapping steps to run in paired-end mode. This may be more suitable for modern data, or when you want to utilise mate-pair spatial information.\n\n> Modifies AdapterRemoval parameter: `--collapse`"
                },
                "skip_trim": {
                    "type": "boolean",
                    "description": "Skip adapter and quality trimming.",
                    "fa_icon": "fas fa-fast-forward",
                    "help_text": "Turns off adapter AND quality trimming.\n\nFor example:\n\n```bash\n--skip_trim  --input '*.fastq'\n```\n\n:warning: it is not possible to keep quality trimming (n or base quality) on,\n_and_ skip adapter trimming.\n\n:warning: it is not possible to turn off one or the other of quality\ntrimming or n trimming. i.e. --trimns --trimqualities are both given\nor neither. However setting quality in `--clip_min_read_quality` to 0 would\ntheoretically turn off base quality trimming.\n\n> Modifies AdapterRemoval parameters: `--trimns --trimqualities --adapter1 --adapter2`"
                },
                "preserve5p": {
                    "type": "boolean",
                    "description": "Skip quality base trimming (n, score, window) of 5 prime end.",
                    "fa_icon": "fas fa-life-ring",
                    "help_text": "Turns off quality based trimming at the 5p end of reads when any of the --trimns, --trimqualities, or --trimwindows options are used. Only 3p end of reads will be removed.\n\nThis also entirely disables quality based trimming of collapsed reads, since both ends of these are informative for PCR duplicate filtering. Described [here](https://github.com/MikkelSchubert/adapterremoval/issues/32#issuecomment-504758137).\n\n> Modifies AdapterRemoval parameters: `--preserve5p`"
                },
                "mergedonly": {
                    "type": "boolean",
                    "description": "Only use merged reads downstream (un-merged reads and singletons are discarded).",
                    "fa_icon": "fas fa-handshake",
                    "help_text": "Specify that only merged reads are sent downstream for analysis.\n\nSingletons (i.e. reads missing a pair), or un-merged reads (where there wasn't sufficient overlap) are discarded.\n\nYou may want to use this if you want ensure only the best quality reads for your analysis, but with the penalty of potentially losing still valid data (even if some reads have slightly lower quality). It is highly recommended when using `--dedupper 'dedup'` (see below)."
                },
                "qualitymax": {
                    "type": "integer",
                    "description": "Specify the maximum Phred score used in input FASTQ files",
                    "help_text": "Specify maximum Phred score of the quality field of FASTQ files. The quality-score range can vary depending on the machine and version (e.g. see diagram [here](https://en.wikipedia.org/wiki/FASTQ_format#Encoding), and this allows you to increase from the default AdapterRemoval value of `41`.\n\n> Modifies AdapterRemoval parameters: `--qualitymax`",
                    "default": 41,
                    "fa_icon": "fas fa-arrow-up"
                },
                "run_post_ar_trimming": {
                    "type": "boolean",
                    "description": "Turn on trimming of inline barcodes (i.e. internal barcodes after adapter removal)",
                    "help_text": "In some cases, you may want to additionally trim reads in a FASTQ file after adapter removal.\n\nThis could be to remove short 'inline' or 'internal' barcodes that are ligated directly onto DNA molecules prior ligation of adapters and indicies (the former of which allow ultra-multiplexing and/or checks for barcode hopping).\n\nIn other cases, you may wish to already remove known high-frequency damage bases to allow stricter mapping.\n\nTurning on this module uses `fastp` to trim one, or both ends of a merged read, or in cases where you have not collapsed your read, R1 and R2.\n"
                },
                "post_ar_trim_front": {
                    "type": "integer",
                    "default": 7,
                    "description": "Specify the number of bases to trim off the front of a merged read or R1",
                    "help_text": "Specify the number of bases to trim off the start of a read in a merged- or forward read FASTQ file.\n\n> Modifies fastp parameters: `--trim_front1`"
                },
                "post_ar_trim_tail": {
                    "type": "integer",
                    "default": 7,
                    "description": "Specify the number of bases to trim off the tail of of a merged read or R1",
                    "help_text": "Specify the number of bases to trim off the end of a read in a merged- or forward read FASTQ file.\n\n> Modifies fastp parameters: `--trim_tail1`"
                },
                "post_ar_trim_front2": {
                    "type": "integer",
                    "default": 7,
                    "description": "Specify the number of bases to trim off the front of R2",
                    "help_text": "Specify the number of bases to trim off the start of a read in an unmerged forward read (R1) FASTQ file.\n\n> Modifies fastp parameters: `--trim_front2`"
                },
                "post_ar_trim_tail2": {
                    "type": "integer",
                    "default": 7,
                    "description": "Specify the number of bases to trim off the tail of R2",
                    "help_text": "Specify the number of bases to trim off the end of a read in an unmerged reverse read (R2) FASTQ file.\n\n> Modifies fastp parameters: `--trim_tail2`"
                }
            },
            "fa_icon": "fas fa-cut",
            "help_text": "These options handle various parts of adapter clipping and read merging steps.\n\nMore details can be seen in the [AdapterRemoval\ndocumentation](https://adapterremoval.readthedocs.io/en/latest/)\n\nIf using TSV input, this is performed per lane separately.\n\n> :warning: `--skip_trim` will skip adapter clipping AND quality trimming\n> (n, base quality). It is currently not possible skip one or the other."
        },
        "mapping": {
            "title": "Read mapping to reference genome",
            "type": "object",
            "description": "Options for reference-genome mapping",
            "default": "",
            "properties": {
                "mapper": {
                    "title": "Mapper",
                    "type": "string",
                    "description": "Specify which mapper to use. Options: 'bwaaln', 'bwamem', 'circularmapper', 'bowtie2'.",
                    "default": "bwaaln",
                    "fa_icon": "fas fa-layer-group",
                    "help_text": "Specify which mapping tool to use. Options are BWA aln (`'bwaaln'`), BWA mem (`'bwamem'`), circularmapper (`'circularmapper'`), or bowtie2 (`bowtie2`). BWA aln is the default and highly suited for short-read ancient DNA. BWA mem can be quite useful for modern DNA, but is rarely used in projects for ancient DNA. CircularMapper enhances  the mapping procedure to circular references, using the BWA algorithm but utilizing a extend-remap procedure (see Peltzer et al 2016, Genome Biology for details). Bowtie2 is similar to BWA aln, and has recently been suggested to provide slightly better results under certain conditions ([Poullet and Orlando 2020](https://doi.org/10.3389/fevo.2020.00105)), as well as providing extra functionality (such as FASTQ trimming). Default is 'bwaaln'\n\nMore documentation can be seen for each tool under:\n\n- [BWA aln](http://bio-bwa.sourceforge.net/bwa.shtml#3)\n- [BWA mem](http://bio-bwa.sourceforge.net/bwa.shtml#3)\n- [CircularMapper](https://circularmapper.readthedocs.io/en/latest/contents/userguide.html)\n- [Bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#command-line)\n",
                    "enum": [
                        "bwaaln",
                        "bwamem",
                        "circularmapper",
                        "bowtie2"
                    ]
                },
                "bwaalnn": {
                    "type": "number",
                    "default": 0.01,
                    "description": "Specify the -n parameter for BWA aln, i.e. amount of allowed mismatches in the alignment.",
                    "fa_icon": "fas fa-sort-numeric-down",
                    "help_text": "Configures the `bwa aln -n` parameter, defining how many mismatches are allowed in a read. By default set to `0.04` (following recommendations of [Schubert et al. (2012 _BMC Genomics_)](https://doi.org/10.1186/1471-2164-13-178)), if you're uncertain what to set check out [this](https://apeltzer.shinyapps.io/bwa-mismatches/) Shiny App for more information on how to set this parameter efficiently.\n\n> Modifies bwa aln parameter: `-n`"
                },
                "bwaalnk": {
                    "type": "integer",
                    "default": 2,
                    "description": "Specify the -k parameter for BWA aln, i.e. maximum edit distance allowed in a seed.",
                    "fa_icon": "fas fa-drafting-compass",
                    "help_text": "Configures the `bwa aln -k` parameter for the seeding phase in the mapping algorithm. Default is set to `2`.\n\n> Modifies BWA aln parameter: `-k`"
                },
                "bwaalnl": {
                    "type": "integer",
                    "default": 1024,
                    "description": "Specify the -l parameter for BWA aln i.e. the length of seeds to be used.",
                    "fa_icon": "fas fa-ruler-horizontal",
                    "help_text": "Configures the length of the seed used in `bwa aln -l`. Default is set to be 'turned off' at the recommendation of Schubert et al. ([2012 _BMC Genomics_](https://doi.org/10.1186/1471-2164-13-178)) for ancient DNA with `1024`.\n\nNote: Despite being recommended, turning off seeding can result in long runtimes!\n\n> Modifies BWA aln parameter: `-l`\n"
                },
                "bwaalno": {
                    "type": "integer",
                    "default": 2,
                    "fa_icon": "fas fa-people-arrows",
                    "description": "Specify the -o parameter for BWA aln i.e. the number of gaps allowed.",
                    "help_text": "Configures the number of gaps used in `bwa aln`. Default is set to `bwa` default.\n\n> Modifies BWA aln parameter: `-o`\n"
                },
                "circularextension": {
                    "type": "integer",
                    "default": 500,
                    "description": "Specify the number of bases to extend reference by (circularmapper only).",
                    "fa_icon": "fas fa-external-link-alt",
                    "help_text": "The number of bases to extend the reference genome with. By default this is set to `500` if not specified otherwise.\n\n> Modifies circulargenerator and realignsamfile parameter: `-e`"
                },
                "circulartarget": {
                    "type": "string",
                    "default": "MT",
                    "description": "Specify the FASTA header of the target chromosome to extend (circularmapper only).",
                    "fa_icon": "fas fa-bullseye",
                    "help_text": "The chromosome in your FASTA reference that you'd like to be treated as circular. By default this is set to `MT` but can be configured to match any other chromosome.\n\n> Modifies circulargenerator parameter: `-s`"
                },
                "circularfilter": {
                    "type": "boolean",
                    "description": "Turn on to remove reads that did not map to the circularised genome (circularmapper only).",
                    "fa_icon": "fas fa-filter",
                    "help_text": "If you want to filter out reads that don't map to a circular chromosome (and also non-circular chromosome headers) from the resulting BAM file, turn this on. By default this option is turned off.\n> Modifies -f and -x parameters of CircularMapper's realignsamfile\n"
                },
                "bt2_alignmode": {
                    "type": "string",
                    "default": "local",
                    "description": "Specify the bowtie2 alignment mode. Options:  'local', 'end-to-end'.",
                    "fa_icon": "fas fa-arrows-alt-h",
                    "help_text": "The type of read alignment to use. Options are 'local' or 'end-to-end'. Local allows only partial alignment of read, with ends of reads possibly 'soft-clipped' (i.e. remain unaligned/ignored), if the soft-clipped alignment provides best alignment score. End-to-end requires all nucleotides to be aligned. Default is 'local', following [Cahill et al (2018)](https://doi.org/10.1093/molbev/msy018) and [Poullet and Orlando 2020](https://doi.org/10.3389/fevo.2020.00105).\n\n> Modifies Bowtie2 parameters: `--very-fast --fast --sensitive --very-sensitive --very-fast-local --fast-local --sensitive-local --very-sensitive-local`",
                    "enum": [
                        "local",
                        "end-to-end"
                    ]
                },
                "bt2_sensitivity": {
                    "type": "string",
                    "default": "sensitive",
                    "description": "Specify the level of sensitivity for the bowtie2 alignment mode. Options: 'no-preset', 'very-fast', 'fast', 'sensitive', 'very-sensitive'.",
                    "fa_icon": "fas fa-microscope",
                    "help_text": "The Bowtie2 'preset' to use. Options: 'no-preset' 'very-fast', 'fast', 'sensitive', or 'very-sensitive'. These strings apply to both `--bt2_alignmode` options. See the Bowtie2 [manual](http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#command-line) for actual settings. Default is 'sensitive' (following [Poullet and Orlando (2020)](https://doi.org/10.3389/fevo.2020.00105), when running damaged-data _without_ UDG treatment)\n\n> Modifies Bowtie2 parameters: `--very-fast --fast --sensitive --very-sensitive --very-fast-local --fast-local --sensitive-local --very-sensitive-local`",
                    "enum": [
                        "no-preset",
                        "very-fast",
                        "fast",
                        "sensitive",
                        "very-sensitive"
                    ]
                },
                "bt2n": {
                    "type": "integer",
                    "description": "Specify the -N parameter for bowtie2 (mismatches in seed). This will override defaults from alignmode/sensitivity.",
                    "fa_icon": "fas fa-sort-numeric-down",
                    "help_text": "The number of mismatches allowed in the seed during seed-and-extend procedure of Bowtie2. This will override any values set with `--bt2_sensitivity`. Can either be 0 or 1. Default: 0 (i.e. use`--bt2_sensitivity` defaults).\n\n> Modifies Bowtie2 parameters: `-N`",
                    "default": 0
                },
                "bt2l": {
                    "type": "integer",
                    "description": "Specify the -L parameter for bowtie2 (length of seed substrings). This will override defaults from alignmode/sensitivity.",
                    "fa_icon": "fas fa-ruler-horizontal",
                    "help_text": "The length of the seed sub-string to use during seeding. This will override any values set with `--bt2_sensitivity`. Default: 0 (i.e. use`--bt2_sensitivity` defaults: [20 for local and 22 for end-to-end](http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#command-line).\n\n> Modifies Bowtie2 parameters: `-L`",
                    "default": 0
                },
                "bt2_trim5": {
                    "type": "integer",
                    "description": "Specify number of bases to trim off from 5' (left) end of read before alignment.",
                    "fa_icon": "fas fa-cut",
                    "help_text": "Number of bases to trim at the 5' (left) end of read prior alignment. Maybe useful when left-over sequencing artefacts of in-line barcodes present Default: 0\n\n> Modifies Bowtie2 parameters: `-bt2_trim5`",
                    "default": 0
                },
                "bt2_trim3": {
                    "type": "integer",
                    "description": "Specify number of bases to trim off from 3' (right) end of read before alignment.",
                    "fa_icon": "fas fa-cut",
                    "help_text": "Number of bases to trim at the 3' (right) end of read prior alignment. Maybe useful when left-over sequencing artefacts of in-line barcodes present Default: 0.\n\n> Modifies Bowtie2 parameters: `-bt2_trim3`",
                    "default": 0
                },
                "bt2_maxins": {
                    "type": "integer",
                    "default": 500,
                    "fa_icon": "fas fa-exchange-alt",
                    "description": "Specify the maximum fragment length for Bowtie2 paired-end mapping mode only.",
                    "help_text": "The maximum fragment for valid paired-end alignments. Only for paired-end mapping (i.e. unmerged), and therefore typically only useful for modern data.\n\n See [Bowtie2 documentation](http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml) for more information.\n\n>  Modifies Bowtie2 parameters: `--maxins`"
                }
            },
            "fa_icon": "fas fa-layer-group",
            "help_text": "If using TSV input, mapping is performed at the library level, i.e. after lane merging.\n"
        },
        "host_removal": {
            "title": "Removal of Host-Mapped Reads",
            "type": "object",
            "description": "Options for production of host-read removed FASTQ files for privacy reasons.",
            "default": "",
            "properties": {
                "hostremoval_input_fastq": {
                    "type": "boolean",
                    "description": "Turn on per-library creation pre-Adapter Removal FASTQ files without reads that mapped to reference (e.g. for public upload of privacy sensitive non-host data)",
                    "fa_icon": "fas fa-power-off",
                    "help_text": "Create pre-Adapter Removal FASTQ files without reads that mapped to reference (e.g. for public upload of privacy sensitive non-host data)\n"
                },
                "hostremoval_mode": {
                    "type": "string",
                    "default": "remove",
                    "description": "Host removal mode. Remove mapped reads completely from FASTQ (remove) or just mask mapped reads sequence by N (replace).",
                    "fa_icon": "fas fa-mask",
                    "help_text": "Read removal mode. Remove mapped reads completely (`'remove'`) or just replace mapped reads sequence by N (`'replace'`)\n\n> Modifies extract_map_reads.py parameter: `-m`",
                    "enum": [
                        "strip",
                        "replace",
                        "remove"
                    ]
                }
            },
            "fa_icon": "fas fa-user-shield",
            "help_text": "These parameters are used for removing mapped reads from the original input\nFASTQ files, usually in the context of uploading the original FASTQ files to a\npublic read archive (NCBI SRA/EBI ENA/DDBJ SRA).\n\nThese flags will produce FASTQ files almost identical to your input files,\nexcept that reads with the same read ID as one found in the mapped bam file, are\neither removed or 'masked' (every base replaced with Ns).\n\nThis functionality allows you to provide other researchers who wish to re-use\nyour data to apply their own adapter removal/read merging procedures, while\nmaintaining anonymity for sample donors - for example with microbiome\nresearch.\n\nIf using TSV input, stripping is performed library, i.e. after lane merging."
        },
        "bam_filtering": {
            "title": "BAM Filtering",
            "type": "object",
            "description": "Options for quality filtering and how to deal with off-target unmapped reads.",
            "default": "",
            "properties": {
                "run_bam_filtering": {
                    "type": "boolean",
                    "description": "Turn on filtering of mapping quality, read lengths, or unmapped reads of BAM files.",
                    "fa_icon": "fas fa-power-off",
                    "help_text": "Turns on the bam filtering module for either mapping quality filtering or unmapped read treatment.\n"
                },
                "bam_mapping_quality_threshold": {
                    "type": "integer",
                    "description": "Minimum mapping quality for reads filter.",
                    "fa_icon": "fas fa-greater-than-equal",
                    "help_text": "Specify a mapping quality threshold for mapped reads to be kept for downstream analysis. By default keeps all reads and is therefore set to `0` (basically doesn't filter anything).\n\n> Modifies samtools view parameter: `-q`",
                    "default": 0
                },
                "bam_filter_minreadlength": {
                    "type": "integer",
                    "fa_icon": "fas fa-ruler-horizontal",
                    "description": "Specify minimum read length to be kept after mapping.",
                    "help_text": "Specify minimum length of mapped reads. This filtering will apply at the same time as mapping quality filtering.\n\nIf used _instead_ of minimum length read filtering at AdapterRemoval, this can be useful to get more realistic endogenous DNA percentages, when most of your reads are very short (e.g. in single-stranded libraries) and would otherwise be discarded by AdapterRemoval (thus making an artificially small denominator for a typical endogenous DNA calculation). Note in this context you should not perform mapping quality filtering nor discarding of unmapped reads to ensure a correct denominator of all reads, for the endogenous DNA calculation.\n\n> Modifies filter_bam_fragment_length.py parameter: `-l`",
                    "default": 0
                },
                "bam_unmapped_type": {
                    "type": "string",
                    "default": "discard",
                    "description": "Defines whether to discard all unmapped reads, keep only bam and/or keep only fastq format Options: 'discard', 'bam', 'fastq', 'both'.",
                    "fa_icon": "fas fa-trash-alt",
                    "help_text": "Defines how to proceed with unmapped reads: `'discard'` removes all unmapped reads, `keep` keeps both unmapped and mapped reads in the same BAM file, `'bam'` keeps unmapped reads as BAM file, `'fastq'` keeps unmapped reads as FastQ file, `both` keeps both BAM and FASTQ files. Default is `discard`.  `keep` is what would happen if `--run_bam_filtering` was _not_ supplied.\n\nNote that in all cases, if `--bam_mapping_quality_threshold` is also supplied, mapping quality filtering will still occur on the mapped reads.\n\n> Modifies samtools view parameter: `-f4 -F4`",
                    "enum": [
                        "discard",
                        "keep",
                        "bam",
                        "fastq",
                        "both"
                    ]
                }
            },
            "fa_icon": "fas fa-sort-amount-down",
            "help_text": "Users can configure to keep/discard/extract certain groups of reads efficiently\nin the nf-core/eager pipeline.\n\nIf using TSV input, filtering is performed library, i.e. after lane merging.\n\nThis module utilises `samtools view` and `filter_bam_fragment_length.py`"
        },
        "deduplication": {
            "title": "DeDuplication",
            "type": "object",
            "description": "Options for removal of PCR amplicon duplicates that can artificially inflate coverage.",
            "default": "",
            "properties": {
                "dedupper": {
                    "type": "string",
                    "default": "markduplicates",
                    "description": "Deduplication method to use. Options: 'markduplicates',  'dedup'.",
                    "fa_icon": "fas fa-object-group",
                    "help_text": "Sets the duplicate read removal tool. By default uses `markduplicates` from Picard. Alternatively an ancient DNA specific read deduplication tool `dedup` ([Peltzer et al. 2016](http://dx.doi.org/10.1186/s13059-016-0918-z)) is offered.\n\nThis utilises both ends of paired-end data to remove duplicates (i.e. true exact duplicates, as markduplicates will over-zealously deduplicate anything with the same starting position even if the ends are different). DeDup should generally only be used solely on paired-end data otherwise suboptimal deduplication can occur if applied to either single-end or a mix of single-end/paired-end data.\n",
                    "enum": [
                        "markduplicates",
                        "dedup"
                    ]
                },
                "dedup_all_merged": {
                    "type": "boolean",
                    "description": "Turn on treating all reads as merged reads.",
                    "fa_icon": "fas fa-handshake",
                    "help_text": "Sets DeDup to treat all reads as merged reads. This is useful if reads are for example not prefixed with `M_` in all cases. Therefore, this can be used as a workaround when also using a mixture of paired-end and single-end data, however this is not recommended (see above).\n\n> Modifies dedup parameter: `-m`"
                }
            },
            "fa_icon": "fas fa-clone",
            "help_text": "If using TSV input, deduplication is performed per library, i.e. after lane merging."
        },
        "library_complexity_analysis": {
            "title": "Library Complexity Analysis",
            "type": "object",
            "description": "Options for calculating library complexity (i.e. how many unique reads are present).",
            "default": "",
            "properties": {
                "preseq_mode": {
                    "type": "string",
                    "default": "c_curve",
                    "description": "Specify which mode of preseq to run.",
                    "fa_icon": "fas fa-toggle-on",
                    "help_text": "Specify which mode of preseq to run.\n\nFrom the [PreSeq documentation](http://smithlabresearch.org/wp-content/uploads/manual.pdf): \n\n`c curve` is used to compute the expected complexity curve of a mapped read file with a hypergeometric\nformula\n\n`lc extrap` is used to generate the expected yield for theoretical larger experiments and bounds on the\nnumber of distinct reads in the library and the associated confidence intervals, which is computed by\nbootstrapping the observed duplicate counts histogram",
                    "enum": [
                        "c_curve",
                        "lc_extrap"
                    ]
                },
                "preseq_step_size": {
                    "type": "integer",
                    "default": 1000,
                    "description": "Specify the step size of Preseq.",
                    "fa_icon": "fas fa-shoe-prints",
                    "help_text": "Can be used to configure the step size of Preseq's `c_curve` and `lc_extrap` method. Can be useful when only few and thus shallow sequencing results are used for extrapolation.\n\n> Modifies preseq c_curve and lc_extrap parameter: `-s`"
                },
                "preseq_maxextrap": {
                    "type": "integer",
                    "default": 10000000000,
                    "description": "Specify the maximum extrapolation (lc_extrap mode only)",
                    "fa_icon": "fas fa-ban",
                    "help_text": "Specify the maximum extrapolation that `lc_extrap` mode will perform.\n\n> Modifies preseq lc_extrap parameter: `-e`"
                },
                "preseq_terms": {
                    "type": "integer",
                    "default": 100,
                    "description": "Specify the maximum number of terms for extrapolation (lc_extrap mode only)",
                    "fa_icon": "fas fa-sort-numeric-up-alt",
                    "help_text": "Specify the maximum number of terms that `lc_extrap` mode will use.\n\n> Modifies preseq lc_extrap parameter: `-x`"
                },
                "preseq_bootstrap": {
                    "type": "integer",
                    "default": 100,
                    "description": "Specify number of bootstraps to perform (lc_extrap mode only)",
                    "fa_icon": "fab fa-bootstrap",
                    "help_text": "Specify the number of bootstraps `lc_extrap` mode will perform to calculate confidence intervals.\n\n> Modifies preseq lc_extrap parameter: `-n`"
                },
                "preseq_cval": {
                    "type": "number",
                    "default": 0.95,
                    "description": "Specify confidence interval level (lc_extrap mode only)",
                    "fa_icon": "fas fa-check-circle",
                    "help_text": "Specify the allowed level of confidence intervals used for `lc_extrap` mode.\n\n> Modifies preseq lc_extrap parameter: `-c`"
                }
            },
            "fa_icon": "fas fa-bezier-curve",
            "help_text": "nf-core/eager uses Preseq on mapped reads as one method to calculate library\ncomplexity. If DeDup is used, Preseq uses the histogram output of DeDup,\notherwise the sorted non-duplicated BAM file is supplied. Furthermore, if\npaired-end read collapsing is not performed, the `-P` flag is used."
        },
        "adna_damage_analysis": {
            "title": "(aDNA) Damage Analysis",
            "type": "object",
            "description": "Options for calculating and filtering for characteristic ancient DNA damage patterns.",
            "default": "",
            "properties": {
                "damage_calculation_tool": {
                    "type": "string",
                    "default": "damageprofiler",
                    "description": "Specify the tool to use for damage calculation.",
                    "fa_icon": "fas fa-tools",
                    "help_text": "Specify the tool to be used for damage calculation. DamageProfiler is generally faster than mapDamage2, but the latter has an option to limit the number of reads used. This can significantly speed up the processing of very large files, where the damage estimates are already accurate after processing only a fraction of the input. Options: `damageprofiler`, `mapdamage`. By default, DamageProfiler is used.",
                    "enum": [
                        "damageprofiler",
                        "mapdamage"
                    ]
                },
                "damageprofiler_length": {
                    "type": "integer",
                    "default": 100,
                    "description": "Specify length filter for DamageProfiler.",
                    "fa_icon": "fas fa-sort-amount-up",
                    "help_text": "Specifies the length filter for DamageProfiler. By default set to `100`.\n\n> Modifies DamageProfile parameter: `-l`"
                },
                "damageprofiler_threshold": {
                    "type": "integer",
                    "default": 15,
                    "description": "Specify number of bases of each read to consider for DamageProfiler calculations.",
                    "fa_icon": "fas fa-ruler-horizontal",
                    "help_text": "Specifies the length of the read start and end to be considered for profile generation in DamageProfiler. By default set to `15` bases.\n\n> Modifies DamageProfile parameter: `-t`"
                },
                "damageprofiler_yaxis": {
                    "type": "number",
                    "default": 0.3,
                    "description": "Specify the maximum misincorporation frequency that should be displayed on the damage plot. Set to 0 to 'autoscale'.",
                    "fa_icon": "fas fa-ruler-vertical",
                    "help_text": "Specifies what the maximum misincorporation frequency should be displayed as, in the DamageProfiler damage plot. This is set to `0.30` (i.e. 30%) by default as this matches the popular [mapDamage2.0](https://ginolhac.github.io/mapDamage) program. However, the default behaviour of DamageProfiler is to 'autoscale' the y-axis maximum to zoom in on any _possible_ damage that may occur (e.g. if the damage is about 10%, the highest value on the y-axis would be set to 0.12). This 'autoscale' behaviour can be turned on by specifying the number to `0`. Default: `0.30`.\n\n> Modifies DamageProfile parameter: `-yaxis_damageplot`"
                },
                "mapdamage_downsample": {
                    "type": "integer",
                    "default": 0,
                    "description": "Specify the maximum number of reads to consider for damage calculation. Defaults value is `0` (i.e. no downsampling is performed).",
                    "fa_icon": "fas fa-greater-than-equal",
                    "help_text": "The maximum number of reads used for damage calculation in mapDamage2. Can be used to significantly reduce the amount of time required for damage assessment. Note that a too low value can also obtain incorrect results.\n\n> Modifies mapDamage2 parameter: `-n`"
                },
                "mapdamage_yaxis": {
                    "type": "number",
                    "default": 0.3,
                    "description": "Specify the maximum misincorporation frequency that should be displayed on the damage plot.",
                    "fa_icon": "fas fa-ruler-vertical",
                    "help_text": "Specifies what the maximum misincorporation frequency should be displayed as, in the mapDamage2 damage plot. This defaults to `0.30` (i.e. 30%).\n\n> Modifies mapDamage2 parameter: `-y`"
                },
                "run_pmdtools": {
                    "type": "boolean",
                    "description": "Turn on PMDtools",
                    "fa_icon": "fas fa-power-off",
                    "help_text": "Specifies to run PMDTools for damage based read filtering and assessment of DNA damage in sequencing libraries. By default turned off.\n"
                },
                "pmdtools_range": {
                    "type": "integer",
                    "default": 10,
                    "description": "Specify range of bases for PMDTools to scan for damage.",
                    "fa_icon": "fas fa-arrows-alt-h",
                    "help_text": "Specifies the range in which to consider DNA damage from the ends of reads. By default set to `10`.\n\n> Modifies PMDTools parameter: `--range`"
                },
                "pmdtools_threshold": {
                    "type": "integer",
                    "default": 3,
                    "description": "Specify PMDScore threshold for PMDTools.",
                    "fa_icon": "fas fa-chart-bar",
                    "help_text": "Specifies the PMDScore threshold to use in the pipeline when filtering BAM files for DNA damage. Only reads which surpass this damage score are considered for downstream DNA analysis. By default set to `3` if not set specifically by the user.\n\n> Modifies PMDTools parameter: `--threshold`"
                },
                "pmdtools_reference_mask": {
                    "type": "string",
                    "description": "Specify a bedfile to be used to mask the reference fasta prior to running pmdtools.",
                    "fa_icon": "fas fa-mask",
                    "help_text": "Activates masking of the reference fasta prior to running pmdtools. Positions that are in the provided bedfile will be replaced by Ns in the reference genome. This is useful for capture data, where you might not want the allele of a SNP to be counted as damage when it is a transition. Masking of the reference is done using `bedtools maskfasta`."
                },
                "pmdtools_max_reads": {
                    "type": "integer",
                    "default": 10000,
                    "description": "Specify the maximum number of reads to consider for metrics generation.",
                    "fa_icon": "fas fa-greater-than-equal",
                    "help_text": "The maximum number of reads used for damage assessment in PMDtools. Can be used to significantly reduce the amount of time required for damage assessment in PMDTools. Note that a too low value can also obtain incorrect results.\n\n> Modifies PMDTools parameter: `-n`"
                },
                "pmdtools_platypus": {
                    "type": "boolean",
                    "description": "Append big list of base frequencies for platypus to output.",
                    "fa_icon": "fas fa-power-off",
                    "help_text": "Enables the printing of a wider list of base frequencies used by platypus as an addition to the output base misincorporation frequency table. By default turned off.\n"
                },
                "run_mapdamage_rescaling": {
                    "type": "boolean",
                    "fa_icon": "fas fa-map",
                    "description": "Turn on damage rescaling of BAM files using mapDamage2 to probabilistically remove damage.",
                    "help_text": "Turns on mapDamage2's BAM rescaling functionality. This probablistically replaces Ts back to Cs depending on the likelihood this reference-mismatch was originally caused by damage. If the library is specified to be single stranded, this will automatically use the `--single-stranded` mode.\n\nThis functionality does not have any MultiQC output.\n\n:warning: rescaled libraries will not be merged with non-scaled libraries of the same sample for downstream genotyping, as the model may be different for each library. If you wish to merge these, please do this manually and re-run nf-core/eager using the merged BAMs as input. \n\n> Modifies the `--rescale` parameter of mapDamage2"
                },
                "rescale_seqlength": {
                    "type": "integer",
                    "default": 12,
                    "fa_icon": "fas fa-ruler-horizontal",
                    "description": "Length of read sequence to use from each side for rescaling. Can be overridden by --rescale_length_*p.",
                    "help_text": "Specify the length from the end of the read that mapDamage should rescale at both ends.\n\n> Modifies the `--seq-length` parameter of mapDamage2."
                },
                "rescale_length_5p": {
                    "type": "integer",
                    "default": 0,
                    "fa_icon": "fas fa-balance-scale-right",
                    "description": "Length of read for mapDamage2 to rescale from 5p end.  Only used if not 0 otherwise --rescale_seqlength used.",
                    "help_text": "Specify the length from the end of the read that mapDamage should rescale. Overrides `--rescale_seqlength`.\n\n> Modifies the `--rescale-length-5p` parameter of mapDamage2."
                },
                "rescale_length_3p": {
                    "type": "integer",
                    "default": 0,
                    "fa_icon": "fas fa-balance-scale-left",
                    "description": "Length of read for mapDamage2 to rescale from 3p end. Only used if not 0 otherwise --rescale_seqlength used..",
                    "help_text": "Specify the length from the end of the read that mapDamage should rescale.\n\n> Modifies the `--rescale-length-3p` parameter of mapDamage2."
                }
            },
            "fa_icon": "fas fa-chart-line",
            "help_text": "More documentation can be seen in the follow links for:\n\n- [DamageProfiler](https://github.com/Integrative-Transcriptomics/DamageProfiler)\n- [PMDTools documentation](https://github.com/pontussk/PMDtools)\n\nIf using TSV input, DamageProfiler is performed per library, i.e. after lane\nmerging. PMDtools and  BAM Trimming is run after library merging of same-named\nlibrary BAMs that have the same type of UDG treatment. BAM Trimming is only\nperformed on non-UDG and half-UDG treated data.\n"
        },
        "feature_annotation_statistics": {
            "title": "Feature Annotation Statistics",
            "type": "object",
            "description": "Options for getting reference annotation statistics (e.g. gene coverages)",
            "default": "",
            "properties": {
                "run_bedtools_coverage": {
                    "type": "boolean",
                    "description": "Turn on ability to calculate no. reads, depth and breadth coverage of features in reference.",
                    "fa_icon": "fas fa-chart-area",
                    "help_text": "Specifies to turn on the bedtools module, producing statistics for breadth (or percent coverage), and depth (or X fold) coverages.\n"
                },
                "anno_file": {
                    "type": "string",
                    "description": "Path to GFF or BED file containing positions of features in reference file (--fasta). Path should be enclosed in quotes.",
                    "fa_icon": "fas fa-file-signature",
                    "help_text": "Specify the path to a GFF/BED containing the feature coordinates (or any acceptable input for [`bedtools coverage`](https://bedtools.readthedocs.io/en/latest/content/tools/coverage.html)). Must be in quotes.\n"
                },
                "anno_file_is_unsorted": {
                    "type": "boolean",
                    "fa_icon": "fas fa-random",
                    "description": "Specify if the annotation file provided to --anno_file is not sorted in the same way as the reference fasta file.",
                    "help_text": "In cases where the annotation file is NOT sorted the same way as the reference fasta, this option should be specified. This will significantly increase the memory usage of bedtools!\n\n> Modifies bedtools parameter: `-sorted`"
                }
            },
            "fa_icon": "fas fa-scroll",
            "help_text": "If you're interested in looking at coverage stats for certain features on your\nreference such as genes, SNPs etc., you can use the following bedtools module\nfor this purpose.\n\nMore documentation on bedtools can be seen in the [bedtools\ndocumentation](https://bedtools.readthedocs.io/en/latest/)\n\nIf using TSV input, bedtools is run after library merging of same-named library\nBAMs that have the same type of UDG treatment.\n"
        },
        "bam_trimming": {
            "title": "BAM Trimming",
            "type": "object",
            "description": "Options for trimming of aligned reads (e.g. to remove damage prior genotyping).",
            "default": "",
            "properties": {
                "run_trim_bam": {
                    "type": "boolean",
                    "description": "Turn on BAM trimming. Will only run on non-UDG or half-UDG libraries",
                    "fa_icon": "fas fa-power-off",
                    "help_text": "Turns on the BAM trimming method. Trims off `[n]` bases from reads in the deduplicated BAM file. Damage assessment in PMDTools or DamageProfiler remains untouched, as data is routed through this independently. BAM trimming is typically performed to reduce errors during genotyping that can be caused by aDNA damage.\n\nBAM trimming will only be performed on libraries indicated as `--udg_type 'none'` or `--udg_type 'half'`. Complete UDG treatment ('full') should have removed all damage. The amount of bases that will be trimmed off can be set separately for libraries with `--udg_type` `'none'` and `'half'`  (see `--bamutils_clip_half_udg_left` / `--bamutils_clip_half_udg_right` / `--bamutils_clip_none_udg_left` / `--bamutils_clip_none_udg_right`).\n\n> Note: additional artefacts such as bar-codes or adapters that could potentially also be trimmed should be removed prior mapping."
                },
                "bamutils_clip_double_stranded_half_udg_left": {
                    "type": "integer",
                    "default": 0,
                    "fa_icon": "fas fa-ruler-combined",
                    "description": "Specify the number of bases to clip off reads from 'left' end of read for double-stranded half-UDG libraries.",
                    "help_text": "Default set to `0` and clips off no bases on the left or right side of reads from double_stranded libraries whose UDG treatment is set to `half`. Note that reverse reads will automatically be clipped off at the reverse side with this (automatically reverses left and right for the reverse read).\n\n> Modifies bamUtil's trimBam parameter: `-L -R`"
                },
                "bamutils_clip_double_stranded_half_udg_right": {
                    "type": "integer",
                    "default": 0,
                    "fa_icon": "fas fa-ruler",
                    "description": "Specify the number of bases to clip off reads from 'right' end of read for double-stranded half-UDG libraries.",
                    "help_text": "Default set to `0` and clips off no bases on the left or right side of reads from double_stranded libraries whose UDG treatment is set to `half`. Note that reverse reads will automatically be clipped off at the reverse side with this (automatically reverses left and right for the reverse read).\n\n> Modifies bamUtil's trimBam parameter: `-L -R`"
                },
                "bamutils_clip_double_stranded_none_udg_left": {
                    "type": "integer",
                    "default": 0,
                    "fa_icon": "fas fa-ruler-combined",
                    "description": "Specify the number of bases to clip off reads from 'left' end of read for double-stranded non-UDG libraries.",
                    "help_text": "Default set to `0` and clips off no bases on the left or right side of reads from double_stranded libraries whose UDG treatment is set to `none`. Note that reverse reads will automatically be clipped off at the reverse side with this (automatically reverses left and right for the reverse read).\n\n> Modifies bamUtil's trimBam parameter: `-L -R`"
                },
                "bamutils_clip_double_stranded_none_udg_right": {
                    "type": "integer",
                    "default": 0,
                    "fa_icon": "fas fa-ruler",
                    "description": "Specify the number of bases to clip off reads from 'right' end of read for double-stranded non-UDG libraries.",
                    "help_text": "Default set to `0` and clips off no bases on the left or right side of reads from double_stranded libraries whose UDG treatment is set to `none`. Note that reverse reads will automatically be clipped off at the reverse side with this (automatically reverses left and right for the reverse read).\n\n> Modifies bamUtil's trimBam parameter: `-L -R`"
                },
                "bamutils_clip_single_stranded_half_udg_left": {
                    "type": "integer",
                    "default": 0,
                    "fa_icon": "fas fa-ruler-combined",
                    "description": "Specify the number of bases to clip off reads from 'left' end of read for single-stranded half-UDG libraries.",
                    "help_text": "Default set to `0` and clips off no bases on the left or right side of reads from single-stranded libraries whose UDG treatment is set to `half`. Note that reverse reads will automatically be clipped off at the reverse side with this (automatically reverses left and right for the reverse read).\n\n> Modifies bamUtil's trimBam parameter: `-L -R`"
                },
                "bamutils_clip_single_stranded_half_udg_right": {
                    "type": "integer",
                    "default": 0,
                    "fa_icon": "fas fa-ruler",
                    "description": "Specify the number of bases to clip off reads from 'right' end of read for single-stranded half-UDG libraries.",
                    "help_text": "Default set to `0` and clips off no bases on the left or right side of reads from single-stranded libraries whose UDG treatment is set to `half`. Note that reverse reads will automatically be clipped off at the reverse side with this (automatically reverses left and right for the reverse read).\n\n> Modifies bamUtil's trimBam parameter: `-L -R`"
                },
                "bamutils_clip_single_stranded_none_udg_left": {
                    "type": "integer",
                    "default": 0,
                    "fa_icon": "fas fa-ruler-combined",
                    "description": "Specify the number of bases to clip off reads from 'left' end of read for single-stranded non-UDG libraries.",
                    "help_text": "Default set to `0` and clips off no bases on the left or right side of reads from single-stranded libraries whose UDG treatment is set to `none`. Note that reverse reads will automatically be clipped off at the reverse side with this (automatically reverses left and right for the reverse read).\n\n> Modifies bamUtil's trimBam parameter: `-L -R`"
                },
                "bamutils_clip_single_stranded_none_udg_right": {
                    "type": "integer",
                    "default": 0,
                    "fa_icon": "fas fa-ruler",
                    "description": "Specify the number of bases to clip off reads from 'right' end of read for single-stranded non-UDG libraries.",
                    "help_text": "Default set to `0` and clips off no bases on the left or right side of reads from single-stranded libraries whose UDG treatment is set to `none`. Note that reverse reads will automatically be clipped off at the reverse side with this (automatically reverses left and right for the reverse read).\n\n> Modifies bamUtil's trimBam parameter: `-L -R`"
                },
                "bamutils_softclip": {
                    "type": "boolean",
                    "description": "Turn on using softclip instead of hard masking.",
                    "fa_icon": "fas fa-paint-roller",
                    "help_text": "By default, nf-core/eager uses hard clipping and sets clipped bases to `N` with quality `!` in the BAM output. Turn this on to use soft-clipping instead, masking reads at the read ends respectively using the CIGAR string.\n\n> Modifies bam trimBam parameter: `-c`"
                }
            },
            "fa_icon": "fas fa-eraser",
            "help_text": "For some library preparation protocols, users might want to clip off damaged\nbases before applying genotyping methods. This can be done in nf-core/eager\nautomatically by turning on the `--run_trim_bam` parameter.\n\nMore documentation can be seen in the [bamUtil\ndocumentation](https://genome.sph.umich.edu/wiki/BamUtil:_trimBam)\n"
        },
        "genotyping": {
            "title": "Genotyping",
            "type": "object",
            "description": "Options for variant calling.",
            "default": "",
            "properties": {
                "run_genotyping": {
                    "type": "boolean",
                    "description": "Turn on genotyping of BAM files.",
                    "fa_icon": "fas fa-power-off",
                    "help_text": "Turns on genotyping to run on all post-dedup and downstream BAMs. For example if `--run_pmdtools` and `--trim_bam` are both supplied, the genotyper will be run on all three BAM files i.e. post-deduplication, post-pmd and post-trimmed BAM files."
                },
                "genotyping_tool": {
                    "type": "string",
                    "description": "Specify which genotyper to use either GATK UnifiedGenotyper, GATK HaplotypeCaller, Freebayes, or pileupCaller. Options: 'ug', 'hc', 'freebayes', 'pileupcaller', 'angsd'.",
                    "fa_icon": "fas fa-tools",
                    "help_text": "Specifies which genotyper to use. Current options are: GATK (v3.5) UnifiedGenotyper or GATK Haplotype Caller (v4); and the FreeBayes Caller. Specify 'ug', 'hc', 'freebayes', 'pileupcaller' and 'angsd' respectively.\n\n> > Note that while UnifiedGenotyper is more suitable for low-coverage ancient DNA (HaplotypeCaller does _de novo_ assembly around each variant site), be aware GATK 3.5 it is officially deprecated by the Broad Institute.",
                    "enum": [
                        "ug",
                        "hc",
                        "freebayes",
                        "pileupcaller",
                        "angsd"
                    ]
                },
                "genotyping_source": {
                    "type": "string",
                    "default": "raw",
                    "description": "Specify which input BAM to use for genotyping. Options: 'raw', 'trimmed', 'pmd' or 'rescaled'.",
                    "fa_icon": "fas fa-faucet",
                    "help_text": "Indicates which BAM file to use for genotyping, depending on what BAM processing modules you have turned on. Options are: `'raw'` for mapped only, filtered, or DeDup BAMs (with priority right to left); `'trimmed'` (for base clipped BAMs); `'pmd'` (for pmdtools output); `'rescaled'` (for mapDamage2 rescaling output). Default is: `'raw'`.\n",
                    "enum": [
                        "raw",
                        "pmd",
                        "trimmed",
                        "rescaled"
                    ]
                },
                "gatk_call_conf": {
                    "type": "integer",
                    "default": 30,
                    "description": "Specify GATK phred-scaled confidence threshold.",
                    "fa_icon": "fas fa-balance-scale-right",
                    "help_text": "If selected, specify a GATK genotyper phred-scaled confidence threshold of a given SNP/INDEL call. Default: `30`\n\n> Modifies GATK UnifiedGenotyper or HaplotypeCaller parameter: `-stand_call_conf`"
                },
                "gatk_ploidy": {
                    "type": "integer",
                    "default": 2,
                    "description": "Specify GATK organism ploidy.",
                    "fa_icon": "fas fa-pastafarianism",
                    "help_text": "If selected, specify a GATK genotyper ploidy value of your reference organism. E.g. if you want to allow heterozygous calls from >= diploid organisms. Default: `2`\n\n> Modifies GATK UnifiedGenotyper or HaplotypeCaller parameter: `--sample-ploidy`"
                },
                "gatk_downsample": {
                    "type": "integer",
                    "default": 250,
                    "description": "Maximum depth coverage allowed for genotyping before down-sampling is turned on.",
                    "fa_icon": "fas fa-icicles",
                    "help_text": "Maximum depth coverage allowed for genotyping before down-sampling is turned on. Any position with a coverage higher than this value will be randomly down-sampled to 250 reads. Default: `250`\n\n> Modifies GATK UnifiedGenotyper parameter: `-dcov`"
                },
                "gatk_dbsnp": {
                    "type": "string",
                    "description": "Specify VCF file for SNP annotation of output VCF files. Optional. Gzip not accepted.",
                    "fa_icon": "fas fa-marker",
                    "help_text": "(Optional) Specify VCF file for output VCF SNP annotation e.g. if you want to annotate your VCF file with 'rs' SNP IDs. Check GATK documentation for more information. Gzip not accepted.\n"
                },
                "gatk_hc_out_mode": {
                    "type": "string",
                    "default": "EMIT_VARIANTS_ONLY",
                    "description": "Specify GATK output mode. Options: 'EMIT_VARIANTS_ONLY', 'EMIT_ALL_CONFIDENT_SITES', 'EMIT_ALL_ACTIVE_SITES'.",
                    "fa_icon": "fas fa-bullhorn",
                    "help_text": "If the GATK genotyper HaplotypeCaller is selected, what type of VCF to create, i.e. produce calls for every site or just confidence sites. Options: `'EMIT_VARIANTS_ONLY'`, `'EMIT_ALL_CONFIDENT_SITES'`, `'EMIT_ALL_ACTIVE_SITES'`. Default: `'EMIT_VARIANTS_ONLY'`\n\n> Modifies GATK HaplotypeCaller parameter: `-output_mode`",
                    "enum": [
                        "EMIT_ALL_ACTIVE_SITES",
                        "EMIT_ALL_CONFIDENT_SITES",
                        "EMIT_VARIANTS_ONLY"
                    ]
                },
                "gatk_hc_emitrefconf": {
                    "type": "string",
                    "default": "GVCF",
                    "description": "Specify HaplotypeCaller mode for emitting reference confidence calls . Options: 'NONE', 'BP_RESOLUTION', 'GVCF'.",
                    "fa_icon": "fas fa-bullhorn",
                    "help_text": "If the GATK HaplotypeCaller is selected, mode for emitting reference confidence calls. Options: `'NONE'`, `'BP_RESOLUTION'`, `'GVCF'`. Default: `'GVCF'`\n\n> Modifies GATK HaplotypeCaller parameter: `--emit-ref-confidence`\n",
                    "enum": [
                        "NONE",
                        "GVCF",
                        "BP_RESOLUTION"
                    ]
                },
                "gatk_ug_out_mode": {
                    "type": "string",
                    "default": "EMIT_VARIANTS_ONLY",
                    "description": "Specify GATK output mode. Options: 'EMIT_VARIANTS_ONLY', 'EMIT_ALL_CONFIDENT_SITES', 'EMIT_ALL_SITES'.",
                    "fa_icon": "fas fa-bullhorn",
                    "help_text": "If the GATK UnifiedGenotyper is selected, what type of VCF to create, i.e. produce calls for every site or just confidence sites. Options: `'EMIT_VARIANTS_ONLY'`, `'EMIT_ALL_CONFIDENT_SITES'`, `'EMIT_ALL_SITES'`. Default: `'EMIT_VARIANTS_ONLY'`\n\n> Modifies GATK UnifiedGenotyper parameter: `--output_mode`",
                    "enum": [
                        "EMIT_ALL_SITES",
                        "EMIT_ALL_CONFIDENT_SITES",
                        "EMIT_VARIANTS_ONLY"
                    ]
                },
                "gatk_ug_genotype_model": {
                    "type": "string",
                    "default": "SNP",
                    "description": "Specify UnifiedGenotyper likelihood model. Options: 'SNP', 'INDEL', 'BOTH', 'GENERALPLOIDYSNP', 'GENERALPLOIDYINDEL'.",
                    "fa_icon": "fas fa-project-diagram",
                    "help_text": "If the GATK UnifiedGenotyper is selected, which likelihood model to follow, i.e. whether to call use SNPs or INDELS etc. Options: `'SNP'`, `'INDEL'`, `'BOTH'`, `'GENERALPLOIDYSNP'`, `'GENERALPLOIDYINDEL`'. Default: `'SNP'`\n\n> Modifies GATK UnifiedGenotyper parameter: `--genotype_likelihoods_model`",
                    "enum": [
                        "SNP",
                        "INDEL",
                        "BOTH",
                        "GENERALPLOIDYSNP",
                        "GENERALPLOIDYINDEL"
                    ]
                },
                "gatk_ug_keep_realign_bam": {
                    "type": "boolean",
                    "description": "Specify to keep the BAM output of re-alignment around variants from GATK UnifiedGenotyper.",
                    "fa_icon": "fas fa-align-left",
                    "help_text": "If provided when running GATK's UnifiedGenotyper, this will put into the output folder the BAMs that have realigned reads (with GATK's (v3) IndelRealigner) around possible variants for improved genotyping.\n\nThese BAMs will be stored in the same folder as the corresponding VCF files."
                },
                "gatk_ug_defaultbasequalities": {
                    "type": "string",
                    "description": "Supply a default base quality if a read is missing a base quality score. Setting to -1 turns this off.",
                    "fa_icon": "fas fa-undo-alt",
                    "help_text": "When running GATK's UnifiedGenotyper,  specify a value to set base quality scores, if reads are missing this information. Might be useful if you have 'synthetically' generated reads (e.g. chopping up a reference genome). Default is set to -1  which is to not set any default quality (turned off). Default: `-1`\n\n> Modifies GATK UnifiedGenotyper parameter: `--defaultBaseQualities`"
                },
                "freebayes_C": {
                    "type": "integer",
                    "default": 1,
                    "description": "Specify minimum required supporting observations to consider a variant.",
                    "fa_icon": "fas fa-align-center",
                    "help_text": "Specify minimum required supporting observations to consider a variant. Default: `1`\n\n> Modifies freebayes parameter: `-C`"
                },
                "freebayes_g": {
                    "type": "integer",
                    "description": "Specify to skip over regions of high depth by discarding alignments overlapping positions where total read depth is greater than specified in --freebayes_C.",
                    "fa_icon": "fab fa-think-peaks",
                    "help_text": "Specify to skip over regions of high depth by discarding alignments overlapping positions where total read depth is greater than specified C. Not set by default.\n\n> Modifies freebayes parameter: `-g`",
                    "default": 0
                },
                "freebayes_p": {
                    "type": "integer",
                    "default": 2,
                    "description": "Specify ploidy of sample in FreeBayes.",
                    "fa_icon": "fas fa-pastafarianism",
                    "help_text": "Specify ploidy of sample in FreeBayes. Default is diploid. Default: `2`\n\n> Modifies freebayes parameter: `-p`"
                },
                "pileupcaller_bedfile": {
                    "type": "string",
                    "description": "Specify path to SNP panel in bed format for pileupCaller.",
                    "fa_icon": "fas fa-bed",
                    "help_text": "Specify a SNP panel in the form of a bed file of sites at which to generate pileup for pileupCaller.\n"
                },
                "pileupcaller_snpfile": {
                    "type": "string",
                    "description": "Specify path to SNP panel in EIGENSTRAT format for pileupCaller.",
                    "fa_icon": "fas fa-sliders-h",
                    "help_text": "Specify a SNP panel in [EIGENSTRAT](https://github.com/DReichLab/EIG/tree/master/CONVERTF) format, pileupCaller will call these sites.\n"
                },
                "pileupcaller_method": {
                    "type": "string",
                    "default": "randomHaploid",
                    "description": "Specify calling method to use. Options: 'randomHaploid', 'randomDiploid', 'majorityCall'.",
                    "fa_icon": "fas fa-toolbox",
                    "help_text": "Specify calling method to use. Options: randomHaploid, randomDiploid, majorityCall. Default: `'randomHaploid'`\n\n> Modifies pileupCaller parameter: `--randomHaploid --randomDiploid --majorityCall`",
                    "enum": [
                        "randomHaploid",
                        "randomDiploid",
                        "majorityCall"
                    ]
                },
                "pileupcaller_transitions_mode": {
                    "type": "string",
                    "default": "AllSites",
                    "description": "Specify the calling mode for transitions. Options: 'AllSites', 'TransitionsMissing', 'SkipTransitions'.",
                    "fa_icon": "fas fa-toggle-on",
                    "help_text": "Specify if genotypes of transition SNPs should be called, set to missing, or excluded from the genotypes respectively. Options: `'AllSites'`, `'TransitionsMissing'`, `'SkipTransitions'`. Default: `'AllSites'`\n\n> Modifies pileupCaller parameter: `--skipTransitions --transitionsMissing`",
                    "enum": [
                        "AllSites",
                        "TransitionsMissing",
                        "SkipTransitions"
                    ]
                },
                "pileupcaller_min_map_quality": {
                    "type": "integer",
                    "default": 30,
                    "description": "The minimum mapping quality to be used for genotyping.",
                    "fa_icon": "fas fa-filter",
                    "help_text": "The minimum mapping quality to be used for genotyping. Affects the `samtools pileup` output that is used by pileupcaller. Affects `-q` parameter of samtools mpileup."
                },
                "pileupcaller_min_base_quality": {
                    "type": "integer",
                    "default": 30,
                    "description": "The minimum base quality to be used for genotyping.",
                    "fa_icon": "fas fa-filter",
                    "help_text": "The minimum base quality to be used for genotyping. Affects the `samtools pileup` output that is used by pileupcaller. Affects `-Q` parameter of samtools mpileup."
                },
                "angsd_glmodel": {
                    "type": "string",
                    "default": "samtools",
                    "description": "Specify which ANGSD genotyping likelihood model to use. Options: 'samtools', 'gatk', 'soapsnp', 'syk'.",
                    "fa_icon": "fas fa-project-diagram",
                    "help_text": "Specify which genotype likelihood model to use. Options: `'samtools`, `'gatk'`, `'soapsnp'`, `'syk'`. Default: `'samtools'`\n\n> Modifies ANGSD parameter: `-GL`",
                    "enum": [
                        "samtools",
                        "gatk",
                        "soapsnp",
                        "syk"
                    ]
                },
                "angsd_glformat": {
                    "type": "string",
                    "default": "binary",
                    "description": "Specify which output type to output ANGSD genotyping likelihood results: Options: 'text', 'binary', 'binary_three', 'beagle'.",
                    "fa_icon": "fas fa-text-height",
                    "help_text": "Specifies what type of genotyping likelihood file format will be output. Options: `'text'`, `'binary'`, `'binary_three'`, `'beagle_binary'`. Default: `'text'`.\n\nThe options refer to the following descriptions respectively:\n\n- `text`: textoutput of all 10 log genotype likelihoods.\n- `binary`: binary all 10 log genotype likelihood\n- `binary_three`: binary 3 times likelihood\n- `beagle_binary`: beagle likelihood file\n\nSee the [ANGSD documentation](http://www.popgen.dk/angsd/) for more information on which to select for your downstream applications.\n\n> Modifies ANGSD parameter: `-doGlF`",
                    "enum": [
                        "text",
                        "binary",
                        "binary_three",
                        "beagle"
                    ]
                },
                "angsd_createfasta": {
                    "type": "boolean",
                    "description": "Turn on creation of FASTA from ANGSD genotyping likelihood.",
                    "fa_icon": "fas fa-align-justify",
                    "help_text": "Turns on the ANGSD creation of a FASTA file from the BAM file.\n"
                },
                "angsd_fastamethod": {
                    "type": "string",
                    "default": "random",
                    "description": "Specify which genotype type of 'base calling' to use for ANGSD FASTA generation. Options: 'random', 'common'.",
                    "fa_icon": "fas fa-toolbox",
                    "help_text": "The type of base calling to be performed when creating the ANGSD FASTA file. Options: `'random'` or `'common'`. Will output the most common non-N base at each given position, whereas 'random' will pick one at random. Default: `'random'`.\n\n> Modifies ANGSD parameter: `-doFasta -doCounts`",
                    "enum": [
                        "random",
                        "common"
                    ]
                },
                "run_bcftools_stats": {
                    "type": "boolean",
                    "default": true,
                    "description": "Turn on bcftools stats generation for VCF based variant calling statistics",
                    "help_text": "Runs `bcftools stats` against VCF files from GATK and FreeBayes genotypers.\n\nIt will automatically include the FASTA reference for INDEL-related statistics.",
                    "fa_icon": "far fa-chart-bar"
                }
            },
            "fa_icon": "fas fa-sliders-h",
            "help_text": "There are options for different genotypers (or genotype likelihood calculators)\nto be used. We suggest you read the documentation of each tool to find the ones that\nsuit your needs.\n\nDocumentation for each tool:\n\n- [GATK\n  UnifiedGenotyper](https://software.broadinstitute.org/gatk/documentation/tooldocs/3.5-0/org_broadinstitute_gatk_tools_walkers_genotyper_UnifiedGenotyper.php)\n- [GATK\n  HaplotypeCaller](https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php)\n- [FreeBayes](https://github.com/ekg/freebayes)\n- [ANGSD](http://www.popgen.dk/angsd/index.php/Genotype_Likelihoods)\n- [sequenceTools pileupCaller](https://github.com/stschiff/sequenceTools)\n\nIf using TSV input, genotyping is performed per sample (i.e. after all types of\nlibraries are merged), except for pileupCaller which gathers all double-stranded and\nsingle-stranded (same-type merged) libraries respectively."
        },
        "consensus_sequence_generation": {
            "title": "Consensus Sequence Generation",
            "type": "object",
            "description": "Options for creation of a per-sample FASTA sequence useful for downstream analysis (e.g. multi sequence alignment)",
            "default": "",
            "properties": {
                "run_vcf2genome": {
                    "type": "boolean",
                    "description": "Turns on ability to create a consensus sequence FASTA file based on a UnifiedGenotyper VCF file and the original reference (only considers SNPs).",
                    "fa_icon": "fas fa-power-off",
                    "help_text": "Turn on consensus sequence genome creation via VCF2Genome. Only accepts GATK UnifiedGenotyper VCF files with the `--gatk_ug_out_mode 'EMIT_ALL_SITES'` and `--gatk_ug_genotype_model 'SNP` flags. Typically useful for small genomes such as mitochondria.\n"
                },
                "vcf2genome_outfile": {
                    "type": "string",
                    "description": "Specify the name of the output FASTA file containing the consensus sequence.",
                    "fa_icon": "fas fa-file-alt",
                    "help_text": "The output FASTA file will be named `<sample_name>_<vcf2genome_outfile>.fasta`.\n"
                },
                "vcf2genome_header": {
                    "type": "string",
                    "description": "Specify the header name of the consensus sequence entry within the FASTA file.",
                    "fa_icon": "fas fa-heading",
                    "help_text": "The name of the FASTA entry you would like in your FASTA file.\n"
                },
                "vcf2genome_minc": {
                    "type": "integer",
                    "default": 5,
                    "description": "Minimum depth coverage required for a call to be included (else N will be called).",
                    "fa_icon": "fas fa-sort-amount-up",
                    "help_text": "Minimum depth coverage for a SNP to be made. Else, a SNP will be called as N. Default: `5`\n\n> Modifies VCF2Genome parameter: `-minc`"
                },
                "vcf2genome_minq": {
                    "type": "integer",
                    "default": 30,
                    "description": "Minimum genotyping quality of a call to be called. Else N will be called.",
                    "fa_icon": "fas fa-medal",
                    "help_text": "Minimum genotyping quality of a call to be made. Else N will be called. Default: `30`\n\n> Modifies VCF2Genome parameter: `-minq`"
                },
                "vcf2genome_minfreq": {
                    "type": "number",
                    "default": 0.8,
                    "description": "Minimum fraction of reads supporting a call to be included. Else N will be called.",
                    "fa_icon": "fas fa-percent",
                    "help_text": "In the case of two possible alleles, the frequency of the majority allele required for a call to be made. Else, a SNP will be called as N. Default: `0.8`\n\n> Modifies VCF2Genome parameter: `-minfreq`"
                }
            },
            "fa_icon": "fas fa-handshake",
            "help_text": "If using TSV input, consensus generation is performed per sample (i.e. after all\ntypes of libraries are merged)."
        },
        "snp_table_generation": {
            "title": "SNP Table Generation",
            "type": "object",
            "description": "Options for creation of a SNP table useful for downstream analysis (e.g. estimation of cross-mapping of different species and multi-sequence alignment)",
            "default": "",
            "properties": {
                "run_multivcfanalyzer": {
                    "type": "boolean",
                    "description": "Turn on MultiVCFAnalyzer. Note: This currently only supports diploid GATK UnifiedGenotyper input.",
                    "fa_icon": "fas fa-power-off",
                    "help_text": "Turns on MultiVCFAnalyzer. Will only work when in combination with UnifiedGenotyper genotyping module.\n"
                },
                "write_allele_frequencies": {
                    "type": "boolean",
                    "description": "Turn on writing write allele frequencies in the SNP table.",
                    "fa_icon": "fas fa-pen",
                    "help_text": "Specify whether to tell MultiVCFAnalyzer to write within the SNP table the frequencies of the allele at that position e.g. A (70%).\n"
                },
                "min_genotype_quality": {
                    "type": "integer",
                    "default": 30,
                    "description": "Specify the minimum genotyping quality threshold for a SNP to be called.",
                    "fa_icon": "fas fa-medal",
                    "help_text": "The minimal genotyping quality for a SNP to be considered for processing by MultiVCFAnalyzer. The default threshold is `30`.\n"
                },
                "min_base_coverage": {
                    "type": "integer",
                    "default": 5,
                    "description": "Specify the minimum number of reads a position needs to be covered to be considered for base calling.",
                    "fa_icon": "fas fa-sort-amount-up",
                    "help_text": "The minimal number of reads covering a base for a SNP at that position to be considered for processing by MultiVCFAnalyzer. The default depth is `5`.\n"
                },
                "min_allele_freq_hom": {
                    "type": "number",
                    "default": 0.9,
                    "description": "Specify the minimum allele frequency that a base requires to be considered a 'homozygous' call.",
                    "fa_icon": "fas fa-percent",
                    "help_text": "The minimal frequency of a nucleotide for a 'homozygous' SNP to be called. In other words, e.g. 90% of the reads covering that position must have that SNP to be called. If the threshold is not reached, and the previous two parameters are matched, a reference call is made (displayed as . in the SNP table). If the above two parameters are not met, an 'N' is called. The default allele frequency is `0.9`.\n"
                },
                "min_allele_freq_het": {
                    "type": "number",
                    "default": 0.9,
                    "description": "Specify the minimum allele frequency that a base requires to be considered a 'heterozygous' call.",
                    "fa_icon": "fas fa-percent",
                    "help_text": "The minimum frequency of a nucleotide for a 'heterozygous' SNP to be called. If\nthis parameter is set to the same as `--min_allele_freq_hom`, then only\nhomozygous calls are made. If this value is less than the previous parameter,\nthen a SNP call will be made. If it is between this and the previous parameter,\nit will be displayed as a IUPAC uncertainty call. Default is `0.9`."
                },
                "additional_vcf_files": {
                    "type": "string",
                    "description": "Specify paths to additional pre-made VCF files to be included in the SNP table generation. Use wildcard(s) for multiple files.",
                    "fa_icon": "fas fa-copy",
                    "help_text": "If you wish to add to the table previously created VCF files, specify here a path with wildcards (in quotes). These VCF files must be created the same way as your settings for [GATK UnifiedGenotyping](#genotyping-parameters) module above."
                },
                "reference_gff_annotations": {
                    "type": "string",
                    "default": "NA",
                    "description": "Specify path to the reference genome annotations in '.gff' format. Optional.",
                    "fa_icon": "fas fa-file-signature",
                    "help_text": "If you wish to report in the SNP table annotation information for the regions\nSNPs fall in, provide a file in GFF format (the path must be in quotes).\n"
                },
                "reference_gff_exclude": {
                    "type": "string",
                    "default": "NA",
                    "description": "Specify path to the positions to be excluded in '.gff' format. Optional.",
                    "fa_icon": "fas fa-times",
                    "help_text": "If you wish to exclude SNP regions from consideration by MultiVCFAnalyzer (such as for problematic regions), provide a file in GFF format (the path must be in quotes).\n"
                },
                "snp_eff_results": {
                    "type": "string",
                    "default": "NA",
                    "description": "Specify path to the output file from SNP effect analysis in '.txt' format. Optional.",
                    "fa_icon": "fas fa-magic",
                    "help_text": "If you wish to include results from SNPEff effect analysis, supply the output\nfrom SNPEff in txt format (the path must be in quotes)."
                }
            },
            "fa_icon": "fas fa-table",
            "help_text": "SNP Table Generation here is performed by MultiVCFAnalyzer. The current version\nof MultiVCFAnalyzer version only accepts GATK UnifiedGenotyper 3.5 VCF files,\nand when the ploidy was set to 2 (this allows MultiVCFAnalyzer to report\nfrequencies of polymorphic positions). A description of how the tool works can\nbe seen in the Supplementary Information of [Bos et al.\n(2014)](https://doi.org/10.1038/nature13591) under \"SNP Calling and Phylogenetic\nAnalysis\".\n\nMore can be seen in the [MultiVCFAnalyzer\ndocumentation](https://github.com/alexherbig/MultiVCFAnalyzer).\n\nIf using TSV input, MultiVCFAnalyzer is performed on all samples gathered\ntogether."
        },
        "mitochondrial_to_nuclear_ratio": {
            "title": "Mitochondrial to Nuclear Ratio",
            "type": "object",
            "description": "Options for the calculation of ratio of reads to one chromosome/FASTA entry against all others.",
            "default": "",
            "properties": {
                "run_mtnucratio": {
                    "type": "boolean",
                    "description": "Turn on mitochondrial to nuclear ratio calculation.",
                    "fa_icon": "fas fa-balance-scale-left",
                    "help_text": "Turn on the module to estimate the ratio of mitochondrial to nuclear reads.\n"
                },
                "mtnucratio_header": {
                    "type": "string",
                    "default": "MT",
                    "description": "Specify the name of the reference FASTA entry corresponding to the mitochondrial genome (up to the first space).",
                    "fa_icon": "fas fa-heading",
                    "help_text": "Specify the FASTA entry in the reference file specified as `--fasta`, which acts\nas the mitochondrial 'chromosome' to base the ratio calculation on. The tool\nonly accepts the first section of the header before the first space. The default\nchromosome name is based on hs37d5/GrCH37 human reference genome. Default: 'MT'"
                }
            },
            "fa_icon": "fas fa-balance-scale-left",
            "help_text": "If using TSV input, Mitochondrial to Nuclear Ratio calculation is calculated per\ndeduplicated library (after lane merging)"
        },
        "human_sex_determination": {
            "title": "Human Sex Determination",
            "type": "object",
            "description": "Options for the calculation of biological sex of human individuals.",
            "default": "",
            "properties": {
                "run_sexdeterrmine": {
                    "type": "boolean",
                    "description": "Turn on sex determination for human reference genomes. This will run on single- and double-stranded variants of a library separately.",
                    "fa_icon": "fas fa-transgender-alt",
                    "help_text": "Specify to run the optional process of sex determination.\n"
                },
                "sexdeterrmine_bedfile": {
                    "type": "string",
                    "description": "Specify path to SNP panel in bed format for error bar calculation. Optional (see documentation).",
                    "fa_icon": "fas fa-bed",
                    "help_text": "Specify an optional bedfile of the list of SNPs to be used for X-/Y-rate calculation. Running without this parameter will considerably increase runtime, and render the resulting error bars untrustworthy. Theoretically, any set of SNPs that are distant enough that two SNPs are unlikely to be covered by the same read can be used here. The programme was coded with the 1240K panel in mind. The path must be in quotes."
                }
            },
            "fa_icon": "fas fa-transgender",
            "help_text": "An optional process for human DNA. It can be used to calculate the relative\ncoverage of X and Y chromosomes compared to the autosomes (X-/Y-rate). Standard\nerrors for these measurements are also calculated, assuming a binomial\ndistribution of reads across the SNPs.\n\nIf using TSV input, SexDetERRmine is performed on all samples gathered together."
        },
        "nuclear_contamination_for_human_dna": {
            "title": "Nuclear Contamination for Human DNA",
            "type": "object",
            "description": "Options for the estimation of contamination of human DNA.",
            "default": "",
            "properties": {
                "run_nuclear_contamination": {
                    "type": "boolean",
                    "description": "Turn on nuclear contamination estimation for human reference genomes.",
                    "fa_icon": "fas fa-power-off",
                    "help_text": "Specify to run the optional processes for (human) nuclear DNA contamination estimation.\n"
                },
                "contamination_chrom_name": {
                    "type": "string",
                    "default": "X",
                    "description": "The name of the X chromosome in your bam/FASTA header. 'X' for hs37d5, 'chrX' for HG19.",
                    "fa_icon": "fas fa-address-card",
                    "help_text": "The name of the human chromosome X in your bam. `'X'` for hs37d5, `'chrX'` for HG19. Defaults to `'X'`."
                }
            },
            "fa_icon": "fas fa-radiation-alt"
        },
        "metagenomic_screening": {
            "title": "Metagenomic Screening",
            "type": "object",
            "description": "Options for metagenomic screening of off-target reads.",
            "default": "",
            "properties": {
                "metagenomic_complexity_filter": {
                    "type": "boolean",
                    "description": "Turn on removal of low-sequence complexity reads for metagenomic screening with bbduk",
                    "help_text": "Turns on low-sequence complexity filtering of off-target reads using `bbduk`.\n\nThis is typically performed to reduce the number of uninformative reads or potential false-positive reads, typically for input for metagenomic screening. This thus reduces false positive species IDs and also run-time and resource requirements.\n\nSee `--metagenomic_complexity_entropy` for how complexity is calculated. **Important** There are no MultiQC output results for this module, you must check the number of reads removed with the `_bbduk.stats` output file.\n\nDefault: off\n",
                    "fa_icon": "fas fa-filter"
                },
                "metagenomic_complexity_entropy": {
                    "type": "number",
                    "default": 0.3,
                    "description": "Specify the entropy threshold that under which a sequencing read will be complexity filtered out. This should be between 0-1.",
                    "minimum": 0,
                    "maximum": 1,
                    "help_text": "Specify a minimum entropy threshold that under which it will be _removed_ from the FASTQ file that goes into metagenomic screening. \n\nA mono-nucleotide read such as GGGGGG will have an entropy of 0, a completely random sequence has an entropy of almost 1.\n\nSee the `bbduk` [documentation](https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbduk-guide/-filter) on entropy for more information.\n\n> Modifies`bbduk` parameter `entropy=`",
                    "fa_icon": "fas fa-percent"
                },
                "run_metagenomic_screening": {
                    "type": "boolean",
                    "description": "Turn on metagenomic screening module for reference-unmapped reads.",
                    "fa_icon": "fas fa-power-off",
                    "help_text": "Turn on the metagenomic screening module.\n"
                },
                "metagenomic_tool": {
                    "type": "string",
                    "description": "Specify which classifier to use. Options: 'malt', 'kraken'.",
                    "fa_icon": "fas fa-tools",
                    "help_text": "Specify which taxonomic classifier to use. There are two options available:\n\n- `kraken` for [Kraken2](https://ccb.jhu.edu/software/kraken2)\n- `malt` for [MALT](https://software-ab.informatik.uni-tuebingen.de/download/malt/welcome.html)\n\n:warning: **Important** It is very important to run `nextflow clean -f` on your\nNextflow run directory once completed. RMA6 files are VERY large and are\n_copied_ from a `work/` directory into the results folder. You should clean the\nwork directory with the command to ensure non-redundancy and large HDD\nfootprints!"
                },
                "database": {
                    "type": "string",
                    "description": "Specify path to classifier database directory. For Kraken2 this can also be a `.tar.gz` of the directory.",
                    "fa_icon": "fas fa-database",
                    "help_text": "Specify the path to the _directory_ containing your taxonomic classifier's database (malt or kraken).\n\nFor Kraken2, it can be either the path to the _directory_ or the path to the `.tar.gz` compressed directory of the Kraken2 database."
                },
                "metagenomic_min_support_reads": {
                    "type": "integer",
                    "default": 1,
                    "description": "Specify a minimum number of reads a taxon of sample total is required to have to be retained. Not compatible with --malt_min_support_mode 'percent'.",
                    "fa_icon": "fas fa-sort-numeric-up-alt",
                    "help_text": "Specify the minimum number of reads a given taxon is required to have to be retained as a positive 'hit'.  \nFor malt, this only applies when `--malt_min_support_mode` is set to 'reads'. Default: 1.\n\n> Modifies MALT or kraken_parse.py parameter: `-sup` and `-c` respectively\n"
                },
                "percent_identity": {
                    "type": "integer",
                    "default": 85,
                    "description": "Percent identity value threshold for MALT.",
                    "fa_icon": "fas fa-id-card",
                    "help_text": "Specify the minimum percent identity (or similarity) a sequence must have to the reference for it to be retained. Default is `85`\n\nOnly used when `--metagenomic_tool malt` is also supplied.\n\n> Modifies MALT parameter: `-id`"
                },
                "malt_mode": {
                    "type": "string",
                    "default": "BlastN",
                    "description": "Specify which alignment mode to use for MALT. Options: 'Unknown', 'BlastN', 'BlastP', 'BlastX', 'Classifier'.",
                    "fa_icon": "fas fa-align-left",
                    "help_text": "Use this to run the program in 'BlastN', 'BlastP', 'BlastX' modes to align DNA\nand DNA, protein and protein, or DNA reads against protein references\nrespectively. Ensure your database matches the mode. Check the\n[MALT\nmanual](http://ab.inf.uni-tuebingen.de/data/software/malt/download/manual.pdf)\nfor more details. Default: `'BlastN'`\n\nOnly when `--metagenomic_tool malt` is also supplied.\n\n> Modifies MALT parameter: `-m`\n",
                    "enum": [
                        "BlastN",
                        "BlastP",
                        "BlastX"
                    ]
                },
                "malt_alignment_mode": {
                    "type": "string",
                    "default": "SemiGlobal",
                    "description": "Specify alignment method for MALT. Options: 'Local', 'SemiGlobal'.",
                    "fa_icon": "fas fa-align-center",
                    "help_text": "Specify what alignment algorithm to use. Options are 'Local' or 'SemiGlobal'. Local is a BLAST like alignment, but is much slower. Semi-global alignment aligns reads end-to-end. Default: `'SemiGlobal'`\n\nOnly when `--metagenomic_tool malt` is also supplied.\n\n> Modifies MALT parameter: `-at`",
                    "enum": [
                        "Local",
                        "SemiGlobal"
                    ]
                },
                "malt_top_percent": {
                    "type": "integer",
                    "default": 1,
                    "description": "Specify the percent for LCA algorithm for MALT (see MEGAN6 CE manual).",
                    "fa_icon": "fas fa-percent",
                    "help_text": "Specify the top percent value of the LCA algorithm. From the [MALT manual](http://ab.inf.uni-tuebingen.de/data/software/malt/download/manual.pdf): \"For each\nread, only those matches are used for taxonomic placement whose bit disjointScore is within\n10% of the best disjointScore for that read.\". Default: `1`.\n\nOnly when `--metagenomic_tool malt` is also supplied.\n\n> Modifies MALT parameter: `-top`"
                },
                "malt_min_support_mode": {
                    "type": "string",
                    "default": "percent",
                    "description": "Specify whether to use percent or raw number of reads for minimum support required for taxon to be retained for MALT. Options: 'percent', 'reads'.",
                    "fa_icon": "fas fa-drumstick-bite",
                    "help_text": "Specify whether to use a percentage, or raw number of reads as the value used to decide the minimum support a taxon requires to be retained.\n\nOnly when `--metagenomic_tool malt` is also supplied.\n\n> Modifies MALT parameter: `-sup -supp`",
                    "enum": [
                        "percent",
                        "reads"
                    ]
                },
                "malt_min_support_percent": {
                    "type": "number",
                    "default": 0.01,
                    "description": "Specify the minimum percentage of reads a taxon of sample total is required to have to be retained for MALT.",
                    "fa_icon": "fas fa-percentage",
                    "help_text": "Specify the minimum number of reads (as a percentage of all assigned reads) a given taxon is required to have to be retained as a positive 'hit' in the RMA6 file. This only applies when `--malt_min_support_mode` is set to 'percent'. Default 0.01.\n\nOnly when `--metagenomic_tool malt` is also supplied.\n\n> Modifies MALT parameter: `-supp`"
                },
                "malt_max_queries": {
                    "type": "integer",
                    "default": 100,
                    "description": "Specify the maximum number of queries a read can have for MALT.",
                    "fa_icon": "fas fa-phone",
                    "help_text": "Specify the maximum number of alignments a read can have. All further alignments are discarded. Default: `100`\n\nOnly when `--metagenomic_tool malt` is also supplied.\n\n> Modifies MALT parameter: `-mq`"
                },
                "malt_memory_mode": {
                    "type": "string",
                    "default": "load",
                    "description": "Specify the memory load method. Do not use 'map' with GPFS file systems for MALT as can be very slow. Options: 'load', 'page', 'map'.",
                    "fa_icon": "fas fa-memory",
                    "help_text": "\nHow to load the database into memory. Options are `'load'`, `'page'` or `'map'`.\n'load' directly loads the entire database into memory prior seed look up, this\nis slow but compatible with all servers/file systems. `'page'` and `'map'`\nperform a sort of 'chunked' database loading, allowing seed look up prior entire\ndatabase loading. Note that Page and Map modes do not work properly not with\nmany remote file-systems such as GPFS. Default is `'load'`.\n\nOnly when `--metagenomic_tool malt` is also supplied.\n\n> Modifies MALT parameter: `--memoryMode`",
                    "enum": [
                        "load",
                        "page",
                        "map"
                    ]
                },
                "malt_sam_output": {
                    "type": "boolean",
                    "description": "Specify to also produce SAM alignment files. Note this includes both aligned and unaligned reads, and are gzipped. Note this will result in very large file sizes.",
                    "fa_icon": "fas fa-file-alt",
                    "help_text": "Specify to _also_ produce gzipped SAM files of all alignments and un-aligned reads in addition to RMA6 files. These are **not** soft-clipped or in 'sparse' format. Can be useful for downstream analyses due to more common file format. \n\n:warning: can result in very large run output directories as this is essentially duplication of the RMA6 files.\n\n> Modifies MALT parameter `-a -f`"
                }
            },
            "fa_icon": "fas fa-search",
            "help_text": "\nAn increasingly common line of analysis in high-throughput aDNA analysis today\nis simultaneously screening off target reads of the host for endogenous\nmicrobial signals - particularly of pathogens. Metagenomic screening is\ncurrently offered via MALT with aDNA specific verification via MaltExtract, or\nKraken2.\n\nPlease note the following:\n\n- :warning: Metagenomic screening is only performed on _unmapped_ reads from a\n  mapping step.\n  - You _must_ supply the `--run_bam_filtering` flag with unmapped reads in\n    FASTQ format.\n  - If you wish to run solely MALT (i.e. the HOPS pipeline), you must still\n    supply a small decoy genome like phiX or human mtDNA `--fasta`.\n- MALT database construction functionality is _not_ included within the pipeline\n  - this should be done independently, **prior** the nf-core/eager run.\n  - To use `malt-build` from the same version as `malt-run`, load either the\n    Docker, Singularity or Conda environment.\n- MALT can often require very large computing resources depending on your\n  database. We set a absolute minimum of 16 cores and 128GB of memory (which is\n  1/4 of the recommendation from the developer). Please leave an issue on the\n  [nf-core github](https://github.com/nf-core/eager/issues) if you would like to\n  see this changed.\n\n> :warning: Running MALT on a server with less than 128GB of memory should be\n> performed at your own risk.\n\nIf using TSV input, metagenomic screening is performed on all samples gathered\ntogether."
        },
        "metagenomic_authentication": {
            "title": "Metagenomic Authentication",
            "type": "object",
            "description": "Options for authentication of metagenomic screening performed by MALT.",
            "default": "",
            "properties": {
                "run_maltextract": {
                    "type": "boolean",
                    "description": "Turn on MaltExtract for MALT aDNA characteristics authentication.",
                    "fa_icon": "fas fa-power-off",
                    "help_text": "Turn on MaltExtract for MALT aDNA characteristics authentication of metagenomic output from MALT.\n\nMore can be seen in the [MaltExtract documentation](https://github.com/rhuebler/)\n\nOnly when `--metagenomic_tool malt` is also supplied"
                },
                "maltextract_taxon_list": {
                    "type": "string",
                    "description": "Path to a text file with taxa of interest (one taxon per row, NCBI taxonomy name format)",
                    "fa_icon": "fas fa-list-ul",
                    "help_text": "\nPath to a `.txt` file with taxa of interest you wish to assess for aDNA characteristics. In `.txt` file should be one taxon per row, and the taxon should be in a valid [NCBI taxonomy](https://www.ncbi.nlm.nih.gov/taxonomy) name format.\n\nOnly when `--metagenomic_tool malt` is also supplied."
                },
                "maltextract_ncbifiles": {
                    "type": "string",
                    "description": "Path to directory containing containing NCBI resource files (ncbi.tre and ncbi.map; available: https://github.com/rhuebler/HOPS/)",
                    "fa_icon": "fas fa-database",
                    "help_text": "Path to directory containing containing the NCBI resource tree and taxonomy table files (ncbi.tre and ncbi.map; available at the [HOPS repository](https://github.com/rhuebler/HOPS/Resources)).\n\nOnly when `--metagenomic_tool malt` is also supplied."
                },
                "maltextract_filter": {
                    "type": "string",
                    "default": "def_anc",
                    "description": "Specify which MaltExtract filter to use. Options: 'def_anc', 'ancient', 'default', 'crawl', 'scan', 'srna', 'assignment'.",
                    "fa_icon": "fas fa-filter",
                    "help_text": "Specify which MaltExtract filter to use. This is used to specify what types of characteristics to scan for. The default will output statistics on all alignments, and then a second set with just reads with one C to T mismatch in the first 5 bases. Further details on other parameters can be seen in the [HOPS documentation](https://github.com/rhuebler/HOPS/#maltextract-parameters). Options: `'def_anc'`, `'ancient'`, `'default'`, `'crawl'`, `'scan'`, `'srna'`, 'assignment'. Default: `'def_anc'`.\n\nOnly when `--metagenomic_tool malt` is also supplied.\n\n> Modifies MaltExtract parameter: `-f`",
                    "enum": [
                        "def_anc",
                        "default",
                        "ancient",
                        "scan",
                        "crawl",
                        "srna"
                    ]
                },
                "maltextract_toppercent": {
                    "type": "number",
                    "default": 0.01,
                    "description": "Specify percent of top alignments to use.",
                    "fa_icon": "fas fa-percent",
                    "help_text": "Specify frequency of top alignments for each read to be considered for each node.\nDefault is 0.01, i.e. 1% of all reads (where 1 would correspond to 100%).\n\n> :warning: this parameter follows the same concept as `--malt_top_percent` but\n> uses a different notation i.e. integer (MALT) versus float (MALTExtract)\n\nDefault: `0.01`.\n\nOnly when `--metagenomic_tool malt` is also supplied.\n\n> Modifies MaltExtract parameter: `-a`"
                },
                "maltextract_destackingoff": {
                    "type": "boolean",
                    "description": "Turn off destacking.",
                    "fa_icon": "fas fa-align-center",
                    "help_text": "Turn off destacking. If left on, a read that overlaps with another read will be\nremoved (leaving a depth coverage of 1).\n\nOnly when `--metagenomic_tool malt` is also supplied.\n\n> Modifies MaltExtract parameter: `--destackingOff`"
                },
                "maltextract_downsamplingoff": {
                    "type": "boolean",
                    "description": "Turn off downsampling.",
                    "fa_icon": "fab fa-creative-commons-sampling",
                    "help_text": "Turn off downsampling. By default, downsampling is on and will randomly select 10,000 reads if the number of reads on a node exceeds this number. This is to speed up processing, under the assumption at 10,000 reads the species is a 'true positive'.\n\nOnly when `--metagenomic_tool malt` is also supplied.\n\n> Modifies MaltExtract parameter: `--downSampOff`"
                },
                "maltextract_duplicateremovaloff": {
                    "type": "boolean",
                    "description": "Turn off duplicate removal.",
                    "fa_icon": "fas fa-align-left",
                    "help_text": "\nTurn off duplicate removal. By default, reads that are an exact copy (i.e. same start, stop coordinate and exact sequence match) will be removed as it is considered a PCR duplicate.\n\nOnly when `--metagenomic_tool malt` is also supplied.\n\n> Modifies MaltExtract parameter: `--dupRemOff`"
                },
                "maltextract_matches": {
                    "type": "boolean",
                    "description": "Turn on exporting alignments of hits in BLAST format.",
                    "fa_icon": "fas fa-equals",
                    "help_text": "\nExport alignments of hits for each node in BLAST format. By default turned off.\n\nOnly when `--metagenomic_tool malt` is also supplied.\n\n> Modifies MaltExtract parameter: `--matches`"
                },
                "maltextract_megansummary": {
                    "type": "boolean",
                    "description": "Turn on export of MEGAN summary files.",
                    "fa_icon": "fas fa-download",
                    "help_text": "Export 'minimal' summary files (i.e. without alignments) that can be loaded into [MEGAN6](https://doi.org/10.1371/journal.pcbi.1004957). By default turned off.\n\nOnly when `--metagenomic_tool malt` is also supplied.\n\n> Modifies MaltExtract parameter: `--meganSummary`"
                },
                "maltextract_percentidentity": {
                    "type": "number",
                    "description": "Minimum percent identity alignments are required to have to be reported. Recommended to set same as MALT parameter.",
                    "default": 85,
                    "fa_icon": "fas fa-id-card",
                    "help_text": "Minimum percent identity alignments are required to have to be reported. Higher values allows fewer mismatches between read and reference sequence, but therefore will provide greater confidence in the hit. Lower values allow more mismatches, which can account for damage and divergence of a related strain/species to the reference. Recommended to set same as MALT parameter or higher. Default: `85`.\n\nOnly when `--metagenomic_tool malt` is also supplied.\n\n> Modifies MaltExtract parameter: `--minPI`"
                },
                "maltextract_topalignment": {
                    "type": "boolean",
                    "description": "Turn on using top alignments per read after filtering.",
                    "fa_icon": "fas fa-star-half-alt",
                    "help_text": "Use the best alignment of each read for every statistic, except for those concerning read distribution and coverage. Default: off.\n\nOnly when `--metagenomic_tool malt` is also supplied.\n\n> Modifies MaltExtract parameter: `--useTopAlignment`"
                }
            },
            "fa_icon": "fas fa-tasks",
            "help_text": "Turn on MaltExtract for MALT aDNA characteristics authentication of metagenomic\noutput from MALT.\n\nMore can be seen in the [MaltExtract\ndocumentation](https://github.com/rhuebler/)\n\nOnly when `--metagenomic_tool malt` is also supplied"
        }
    },
    "allOf": [
        {
            "$ref": "#/definitions/input_output_options"
        },
        {
            "$ref": "#/definitions/input_data_additional_options"
        },
        {
            "$ref": "#/definitions/reference_genome_options"
        },
        {
            "$ref": "#/definitions/output_options"
        },
        {
            "$ref": "#/definitions/generic_options"
        },
        {
            "$ref": "#/definitions/max_job_request_options"
        },
        {
            "$ref": "#/definitions/institutional_config_options"
        },
        {
            "$ref": "#/definitions/skip_steps"
        },
        {
            "$ref": "#/definitions/complexity_filtering"
        },
        {
            "$ref": "#/definitions/read_merging_and_adapter_removal"
        },
        {
            "$ref": "#/definitions/mapping"
        },
        {
            "$ref": "#/definitions/host_removal"
        },
        {
            "$ref": "#/definitions/bam_filtering"
        },
        {
            "$ref": "#/definitions/deduplication"
        },
        {
            "$ref": "#/definitions/library_complexity_analysis"
        },
        {
            "$ref": "#/definitions/adna_damage_analysis"
        },
        {
            "$ref": "#/definitions/feature_annotation_statistics"
        },
        {
            "$ref": "#/definitions/bam_trimming"
        },
        {
            "$ref": "#/definitions/genotyping"
        },
        {
            "$ref": "#/definitions/consensus_sequence_generation"
        },
        {
            "$ref": "#/definitions/snp_table_generation"
        },
        {
            "$ref": "#/definitions/mitochondrial_to_nuclear_ratio"
        },
        {
            "$ref": "#/definitions/human_sex_determination"
        },
        {
            "$ref": "#/definitions/nuclear_contamination_for_human_dna"
        },
        {
            "$ref": "#/definitions/metagenomic_screening"
        },
        {
            "$ref": "#/definitions/metagenomic_authentication"
        }
    ]
}