Copy disabled (too large)
Download .txt
Showing preview only (11,938K chars total). Download the full file to get everything.
Repository: TobyBaril/EarlGrey
Branch: main
Commit: d6d8654acc2b
Files: 89
Total size: 11.4 MB
Directory structure:
gitextract_2xvj2bh0/
├── .gitpod.yml
├── CITATION.cff
├── CODE_OF_CONDUCT.md
├── Docker/
│ ├── Dockerfile
│ ├── README.md
│ ├── configInstructions.txt
│ └── getFiles.sh
├── LICENSE
├── README.md
├── Singularity/
│ └── README.md
├── conda/
│ ├── build.sh
│ ├── developmentNotes.md
│ └── meta.yaml
├── earlGrey
├── earlGreyAnnotationOnly
├── earlGreyLibConstruct
├── legacyDocs.md
├── modules/
│ └── trf409.linux64
└── scripts/
├── LTR_FINDER_parallel
├── TEstrainer/
│ ├── README.md
│ ├── TEstrainer
│ ├── TEstrainer.Rproj
│ ├── TEstrainer_for_earlGrey.sh
│ ├── data/
│ │ ├── acceptable_domains.tsv
│ │ ├── additional_domains.tsv
│ │ ├── exceptional_domains.tsv
│ │ ├── old_acceptable_domains.tsv
│ │ └── unacceptable_domains.tsv
│ └── scripts/
│ ├── Dfam_extractor.py
│ ├── TEtrim.py
│ ├── domain_plotter.R
│ ├── final_sorter.R
│ ├── indexer.py
│ ├── initial_domain_finder.R
│ ├── initial_mafft_setup.py
│ ├── mreps_parser.sh
│ ├── simple_repeat_filter_trim.R
│ ├── splitter.py
│ ├── strainer.R
│ └── trf_parser.py
├── autoLand.R
├── autoPie.R
├── autoPie.sh
├── backSwap.py
├── backSwapGFF.py
├── bin/
│ ├── LTR_FINDER.x86_64-1.0.7/
│ │ ├── ltr_finder
│ │ └── tRNA.list.txt
│ └── cut.pl
├── divergenceCalc/
│ ├── divergence_calc.py
│ └── divergence_plot.R
├── extract_align.py
├── faswap.py
├── filteringOverlappingRepeats.R
├── headSwap.sh
├── install_r_packages.R
├── makeGff.R
├── mergeRepeats.R
├── rcMergeRepeats
├── rcMergeRepeatsLoose
├── repeatCraft/
│ ├── Dockerfile
│ ├── LICENSE
│ ├── README.md
│ ├── example/
│ │ ├── example.rclabel.gff
│ │ ├── example.rmerge.gff
│ │ ├── example.summary.txt
│ │ ├── example_input.gff
│ │ ├── example_input.out
│ │ ├── example_ltrfinder.gff
│ │ ├── mapfile.tsv
│ │ ├── repeatcraft.cfg
│ │ └── repeatcraft_strict.cfg
│ ├── helper/
│ │ ├── combineGFFoverlapm.py
│ │ ├── extraFuseTEm.py
│ │ ├── extraTrueMergeTEm.py
│ │ ├── filtershortm.py
│ │ ├── fuseltr.py
│ │ ├── fusetem.py
│ │ ├── rcStatm.py
│ │ ├── reformatm.py
│ │ ├── repeatcraftHelper.py
│ │ ├── truemergeltrm.py
│ │ ├── truemergeltrmErrorManagement.py
│ │ └── truemergetem.py
│ ├── repeatcraft.py
│ ├── repeatcraftErrorManagement.py
│ └── test/
│ ├── ci.sh
│ └── requirements.txt
├── repeatcraft.py
└── repeatcraftErrorManagement.py
================================================
FILE CONTENTS
================================================
================================================
FILE: .gitpod.yml
================================================
image: gitpod/workspace-base
tasks:
- name: install mamba and earlgrey
init: |
curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh"
bash Mambaforge-$(uname)-$(uname -m).sh -b -p /workspace/conda && rm Mambaforge-$(uname)-$(uname -m).sh
/workspace/conda/bin/mamba init bash
source ${HOME}/.bashrc
mamba create -n earlgrey -c conda-forge -c bioconda \
earlgrey -y
mamba clean --all -y
mamba activate earlGrey
cd /workspace
wget https://sk13.cog.sanger.ac.uk/NC_045808_EarlWorkshop.fasta
command: |
/workspace/conda/bin/mamba init bash
source ~/.bashrc
mamba activate earlgrey
clear
github:
prebuilds:
# enable for the master/default branch (defaults to true)
master: true
# enable for all branches in this repo (defaults to false)
branches: true
# enable for pull requests coming from this repo (defaults to true)
pullRequests: true
# enable for pull requests coming from forks (defaults to false)
pullRequestsFromForks: true
# add a "Review in Gitpod" button as a comment to pull requests (defaults to true)
addComment: true
# add a "Review in Gitpod" button to pull requests (defaults to false)
addBadge: false
# add a label once the prebuild is ready to pull requests (defaults to false)
addLabel: prebuilt-in-gitpod
vscode:
extensions:
- anwar.papyrus-pdf
workspaceLocation: "/workspace"
================================================
FILE: CITATION.cff
================================================
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: Earl Grey
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
version: "7.2.4"
authors:
- given-names: Tobias
family-names: Baril
email: toby.baril@icloud.com
affiliation: Université de Neuchâtel
orcid: 'https://orcid.org/0000-0002-5936-7531'
- given-names: James
family-names: Galbraith
affiliation: University of Exeter
orcid: 'https://orcid.org/0000-0002-1871-2108'
- given-names: Alex
family-names: Hayward
affiliation: University of Exeter
orcid: 'https://orcid.org/0000-0001-7413-718X'
================================================
FILE: CODE_OF_CONDUCT.md
================================================
# Contributor Covenant Code of Conduct
## Our Pledge
We as members, contributors, and leaders pledge to make participation in our
community a harassment-free experience for everyone, regardless of age, body
size, visible or invisible disability, ethnicity, sex characteristics, gender
identity and expression, level of experience, education, socio-economic status,
nationality, personal appearance, race, religion, or sexual identity
and orientation.
We pledge to act and interact in ways that contribute to an open, welcoming,
diverse, inclusive, and healthy community.
## Our Standards
Examples of behavior that contributes to a positive environment for our
community include:
* Demonstrating empathy and kindness toward other people
* Being respectful of differing opinions, viewpoints, and experiences
* Giving and gracefully accepting constructive feedback
* Accepting responsibility and apologizing to those affected by our mistakes,
and learning from the experience
* Focusing on what is best not just for us as individuals, but for the
overall community
Examples of unacceptable behavior include:
* The use of sexualized language or imagery, and sexual attention or
advances of any kind
* Trolling, insulting or derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or email
address, without their explicit permission
* Other conduct which could reasonably be considered inappropriate in a
professional setting
## Enforcement Responsibilities
Community leaders are responsible for clarifying and enforcing our standards of
acceptable behavior and will take appropriate and fair corrective action in
response to any behavior that they deem inappropriate, threatening, offensive,
or harmful.
Community leaders have the right and responsibility to remove, edit, or reject
comments, commits, code, wiki edits, issues, and other contributions that are
not aligned to this Code of Conduct, and will communicate reasons for moderation
decisions when appropriate.
## Scope
This Code of Conduct applies within all community spaces, and also applies when
an individual is officially representing the community in public spaces.
Examples of representing our community include using an official e-mail address,
posting via an official social media account, or acting as an appointed
representative at an online or offline event.
## Enforcement
Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported to the community leaders responsible for enforcement.
All complaints will be reviewed and investigated promptly and fairly.
All community leaders are obligated to respect the privacy and security of the
reporter of any incident.
## Enforcement Guidelines
Community leaders will follow these Community Impact Guidelines in determining
the consequences for any action they deem in violation of this Code of Conduct:
### 1. Correction
**Community Impact**: Use of inappropriate language or other behavior deemed
unprofessional or unwelcome in the community.
**Consequence**: A private, written warning from community leaders, providing
clarity around the nature of the violation and an explanation of why the
behavior was inappropriate. A public apology may be requested.
### 2. Warning
**Community Impact**: A violation through a single incident or series
of actions.
**Consequence**: A warning with consequences for continued behavior. No
interaction with the people involved, including unsolicited interaction with
those enforcing the Code of Conduct, for a specified period of time. This
includes avoiding interactions in community spaces as well as external channels
like social media. Violating these terms may lead to a temporary or
permanent ban.
### 3. Temporary Ban
**Community Impact**: A serious violation of community standards, including
sustained inappropriate behavior.
**Consequence**: A temporary ban from any sort of interaction or public
communication with the community for a specified period of time. No public or
private interaction with the people involved, including unsolicited interaction
with those enforcing the Code of Conduct, is allowed during this period.
Violating these terms may lead to a permanent ban.
### 4. Permanent Ban
**Community Impact**: Demonstrating a pattern of violation of community
standards, including sustained inappropriate behavior, harassment of an
individual, or aggression toward or disparagement of classes of individuals.
**Consequence**: A permanent ban from any sort of public interaction within
the community.
## Attribution
This Code of Conduct is adapted from the [Contributor Covenant][homepage],
version 2.0, available at
https://www.contributor-covenant.org/version/2/0/code_of_conduct.html.
Community Impact Guidelines were inspired by [Mozilla's code of conduct
enforcement ladder](https://github.com/mozilla/diversity).
[homepage]: https://www.contributor-covenant.org
For answers to common questions about this code of conduct, see the FAQ at
https://www.contributor-covenant.org/faq. Translations are available at
https://www.contributor-covenant.org/translations.
================================================
FILE: Docker/Dockerfile
================================================
# Based on Dockerfile from Dfam TETools https://github.com/Dfam-consortium/TETools, with added database configuration, conda configuration, and Earl Grey installation
# and configuration
FROM debian:latest AS builder
RUN apt-get -y update && apt-get -y install \
curl gcc g++ make zlib1g-dev libgomp1 \
perl \
python3-h5py \
libfile-which-perl \
libtext-soundex-perl \
libjson-perl liburi-perl libwww-perl
# Install dependencies and some basic utilities
RUN apt-get -y update \
&& apt-get -y install \
aptitude \
libgomp1 \
perl \
python3-h5py \
libfile-which-perl \
libtext-soundex-perl \
libjson-perl liburi-perl libwww-perl \
libdevel-size-perl \
&& aptitude install -y ~pstandard ~prequired \
curl wget \
vim nano \
procps strace \
libpam-systemd-
COPY src/* /opt/src/
WORKDIR /opt/src
# Extract RMBlast
RUN cd /opt \
&& mkdir rmblast \
&& tar --strip-components=1 -x -f src/rmblast-2.13.0+-x64-linux.tar.gz -C rmblast \
&& rm src/rmblast-2.13.0+-x64-linux.tar.gz
# Compile HMMER
RUN tar -x -f hmmer-3.3.2.tar.gz \
&& cd hmmer-3.3.2 \
&& ./configure --prefix=/opt/hmmer && make && make install \
&& make clean
# Compile TRF
RUN tar -x -f trf-4.09.1.tar.gz \
&& cd TRF-4.09.1 \
&& mkdir build && cd build \
&& ../configure && make && cp ./src/trf /opt/trf \
&& cd .. && rm -r build
# Compile RepeatScout
RUN tar -x -f RepeatScout-1.0.6.tar.gz \
&& cd RepeatScout-1.0.6 \
&& sed -i 's#^INSTDIR =.*#INSTDIR = /opt/RepeatScout#' Makefile \
&& make && make install
# Compile and configure RECON
RUN tar -x -f RECON-1.08.tar.gz \
&& mv RECON-1.08 ../RECON \
&& cd ../RECON \
&& make -C src && make -C src install \
&& cp 00README bin/ \
&& sed -i 's#^\$path =.*#$path = "/opt/RECON/bin";#' scripts/recon.pl
# Compile cd-hit
RUN tar -x -f cd-hit-v4.8.1-2019-0228.tar.gz \
&& cd cd-hit-v4.8.1-2019-0228 \
&& make && mkdir /opt/cd-hit && PREFIX=/opt/cd-hit make install
# Compile genometools (for ltrharvest)
RUN tar -x -f gt-1.6.2.tar.gz \
&& cd genometools-1.6.2 \
&& make -i -j4 cairo=no && make -i cairo=no prefix=/opt/genometools install \
&& make cleanup
# Configure LTR_retriever
RUN cd /opt \
&& tar -x -f src/LTR_retriever-2.9.0.tar.gz \
&& mv LTR_retriever-2.9.0 LTR_retriever \
&& cd LTR_retriever \
&& sed -i \
-e 's#BLAST+=#BLAST+=/opt/rmblast/bin#' \
-e 's#RepeatMasker=#RepeatMasker=/opt/RepeatMasker#' \
-e 's#HMMER=#HMMER=/opt/hmmer/bin#' \
-e 's#CDHIT=#CDHIT=/opt/cd-hit#' \
paths
# Compile MAFFT
RUN tar -x -f mafft-7.505-without-extensions-src.tgz \
&& cd mafft-7.505-without-extensions/core \
&& sed -i 's#^PREFIX =.*#PREFIX = /opt/mafft#' Makefile \
&& make clean && make && make install \
&& make clean
# Compile NINJA
RUN cd /opt \
&& mkdir NINJA \
&& tar --strip-components=1 -x -f src/NINJA-cluster.tar.gz -C NINJA \
&& cd NINJA/NINJA \
&& make clean && make all
# Move UCSC tools
RUN mkdir /opt/ucsc_tools \
&& mv faToTwoBit twoBitInfo twoBitToFa /opt/ucsc_tools \
&& chmod +x /opt/ucsc_tools/*
#COPY LICENSE.ucsc /opt/ucsc_tools/LICENSE
# Compile and configure coseg
RUN cd /opt \
&& tar -x -f src/coseg-0.2.2.tar.gz \
&& cd coseg \
&& sed -i 's@#!.*perl@#!/usr/bin/perl@' preprocessAlignments.pl runcoseg.pl refineConsSeqs.pl \
&& sed -i 's#use lib "/usr/local/RepeatMasker";#use lib "/opt/RepeatMasker";#' preprocessAlignments.pl \
&& make -i
# Configure RepeatMasker
RUN cd /opt \
&& tar -x -f src/RepeatMasker-4.1.4.tar.gz \
&& chmod a+w RepeatMasker/Libraries \
&& cd RepeatMasker/Libraries \
&& cp /opt/src/Dfam.h5.gz . \
&& gunzip -f Dfam.h5.gz \
&& cd /opt/RepeatMasker \
#&& cp /opt/src/RepBaseRepeatMaskerEdition-20181026.tar.gz . \
#&& tar -zxf RepBaseRepeatMaskerEdition-20181026.tar.gz \
#&& wget https://github.com/rmhubley/RepeatMasker/blob/development/Libraries/RMRBMeta.embl \
#&& cd /opt/RepeatMasker \
&& mv famdb.py famdb.py.bak \
&& cp /opt/src/famdb.py . \
&& chmod 755 famdb.py \
&& perl configure \
-hmmer_dir=/opt/hmmer/bin \
-rmblast_dir=/opt/rmblast/bin \
-libdir=/opt/RepeatMasker/Libraries \
-trf_prgm=/opt/trf \
-default_search_engine=rmblast \
&& echo "RepeatMasker configured..." && cd .. && rm src/RepeatMasker-4.1.4.tar.gz
# Configure RepeatModeler
RUN cd /opt \
&& tar -x -f src/RepeatModeler-2.0.4.tar.gz \
&& mv RepeatModeler-2.0.4 RepeatModeler \
&& cd RepeatModeler \
&& perl configure \
-cdhit_dir=/opt/cd-hit -genometools_dir=/opt/genometools/bin \
-ltr_retriever_dir=/opt/LTR_retriever -mafft_dir=/opt/mafft/bin \
-ninja_dir=/opt/NINJA/NINJA -recon_dir=/opt/RECON/bin \
-repeatmasker_dir=/opt/RepeatMasker \
-rmblast_dir=/opt/rmblast/bin -rscout_dir=/opt/RepeatScout \
-trf_dir=/opt/ \
-ucsctools_dir=/opt/ucsc_tools
FROM debian:latest
# Install dependencies and some basic utilities
RUN apt update && apt-get -y install git
RUN apt-get -y update \
&& apt-get -y install \
aptitude \
libgomp1 \
perl \
python3-h5py \
libfile-which-perl \
libtext-soundex-perl \
libjson-perl liburi-perl libwww-perl \
libdevel-size-perl \
&& aptitude install -y ~pstandard ~prequired \
curl wget \
vim nano \
procps strace \
libpam-systemd-
RUN cd /opt/ \
&& curl https://repo.anaconda.com/archive/Anaconda3-2022.10-Linux-x86_64.sh --output anaconda.sh \
&& bash ./anaconda.sh -b -p /anaconda3 \
&& eval "$(/anaconda3/bin/conda shell.bash hook)"
ENV PATH=$PATH:/opt/RepeatMasker:/opt/RepeatMasker/util:/opt/RepeatModeler:/opt/RepeatModeler/util:/opt/coseg:/opt/ucsc_tools:/opt:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/cd-hit/
COPY --from=builder /opt /opt
RUN echo "PS1='(earlgrey \$(pwd))\\\$ '" >> /etc/bash.bashrc
ENV LANG=C
ENV PYTHONIOENCODING=utf8
RUN apt update && apt install build-essential -y --no-install-recommends
RUN apt update && apt install libcurl4-openssl-dev -y && apt-get install libbz2-dev liblzma-dev libtiff-dev libfreetype6-dev pkgconf libfontconfig1-dev libharfbuzz-dev libfribidi-dev bc -y
ARG USER_ID
ARG GROUP_ID
RUN addgroup --gid $GROUP_ID user
RUN adduser --disabled-password --gecos '' --uid $USER_ID --gid $GROUP_ID --home /home/user user
RUN chmod a+rwx /opt/*
RUN chmod a+rwx /home/user
RUN chown -R user /home/user
RUN cd /home/user \
&& git clone https://github.com/ridgelab/SA-SSR \
&& cd SA-SSR/ \
&& sed -i "s|PREFIX=/usr/local/bin|PREFIX=/home/user/SA-SSR/|g" Makefile && make && make install
RUN chmod a+rwx /home/user/SA-SSR/bin/* /home/user/SA-SSR/*
RUN chown -R user /home/user/SA-SSR/
USER user
WORKDIR /home/user
RUN git clone https://github.com/TobyBaril/EarlGrey \
&& cd EarlGrey \
&& chmod +x ./configureForDocker \
&& eval "$(/anaconda3/bin/conda shell.bash hook)" \
&& ./configureForDocker
ENV PATH=$PATH:/home/user/EarlGrey/:/home/user/SA-SSR/bin/
#RUN echo "#!/bin/bash" > /work/start.sh \
# && echo 'eval "$(/anaconda3/bin/conda shell.bash hook)"' >> /work/start.sh \
# && echo "conda activate earlGrey" >> /work/start.sh \
# && chmod +x /work/start.sh
================================================
FILE: Docker/README.md
================================================
# Docker Container
A Docker container has been generated with none of Dfam 3.9, but with script generation to source required partitions
I try to keep an up-to-date container in docker hub, but this might not always be the case depending on if I have had time to build and upload a new image. Currently, the recommended image ready for use is `-nodfam` version. Upon running the container interactively and running the command `earlGrey`, instructions will print to `stdout` and a script that you can use will be placed in `/usr/local/share/RepeatMasker/Libraries/famdb/` when the container is running.
```
# Interactive mode
# Version 7.2.4 with no preconfigured partitions (RECOMMENDED!) - bind a directory, in my case the current directory using pwd
docker run -it -v 'pwd':/data/ tobybaril/earlgrey:latest-nodfam
# change to library directory
cd /data/
# run earlGrey to make the configuration script
earlGrey
# then alter script with required partitions and run the configuration script
# change 0-16 to whichever you require, but at least 0. This relates to the partitions of Dfam 3.9 (https://www.dfam.org/releases/Dfam_3.9/families/FamDB/)
##### e.g. for 0-5:
sed -i '/^curl/ s/0-16/0-5/g' configure_dfam39.sh
##### e.g for 1,3,5:
sed -i '/^curl/ s/0-16/1,3,5/g' configure_dfam39.sh
# run the configuration script
bash configure_dfam39.sh
# return to your data directory
cd /data/
```
================================================
FILE: Docker/configInstructions.txt
================================================
# Get Earl Grey to run in Docker
# Download dependencies
./getFiles.sh
# NOTE: please make sure that Dfam.h5.gz and RepBase libraries tar.gz are inside the ./src directory after running ./getsrc.sh .
# If you need to download these,
# Dfam.h5.gz can be downloaded by running:
wget https://www.dfam.org/releases/Dfam_3.7/families/Dfam.h5.gz # THIS IS A BIG FILE!
# Repbase is now behind a paywall, if you do not have access, please comment out the following lines in the Dockerfile (lines 105-107)
# && cd /opt/RepeatMasker \
# && cp /opt/src/RepBaseRepeatMaskerEdition-20181026.tar.gz . \
# && tar -zxf RepBaseRepeatMaskerEdition-20181026.tar.gz \
# Build a docker container (run from inside the directory where the Dockerfile for EarlGrey is stored)
docker build . -t earlgrey
# start the docker container
docker run -it --rm --init --mount type=bind,source="$(pwd)",target=/work --user "$(id -u):$(id -g)" --workdir "/work" --env "HOME=/work" earlgrey "$@"
# IMPORTANT - once the docker container has started, run these commands to activate the conda environment for EarlGrey
# we strongly recommend upgrading the RepeatMasker contained within Docker with the Dfam libraries and RepBase, information on this is found: https://github.com/Dfam-consortium/TETools
# note there are only ~6000 seqs in the basic RepeatMasker library in this container, real Dfam is much larger!
eval "$(/anaconda3/bin/conda shell.bash hook)"
conda activate earlGrey
================================================
FILE: Docker/getFiles.sh
================================================
#!/bin/sh
set -eu
# download src name
download() {
src="$1"
shift
if [ $# -ge 1 ]; then
name="$1"
else
name="${src##*/}"
fi
dest="src/$name"
if [ -n "${ALWAYS-}" ] || ! [ -f "$dest" ]; then
echo "Downloading $src to $dest"
curl -sSL "$src" > "$dest"
fi
}
mkdir -p src
download https://www.repeatmasker.org/rmblast/rmblast-2.13.0+-x64-linux.tar.gz
download http://eddylab.org/software/hmmer/hmmer-3.3.2.tar.gz
download https://github.com/Benson-Genomics-Lab/TRF/archive/v4.09.1.tar.gz trf-4.09.1.tar.gz
download https://www.repeatmasker.org/RepeatScout-1.0.6.tar.gz
download https://www.repeatmasker.org/RepeatModeler/RECON-1.08.tar.gz
download https://github.com/weizhongli/cdhit/releases/download/V4.8.1/cd-hit-v4.8.1-2019-0228.tar.gz
download https://github.com/genometools/genometools/archive/v1.6.2.tar.gz gt-1.6.2.tar.gz
download https://github.com/oushujun/LTR_retriever/archive/v2.9.0.tar.gz LTR_retriever-2.9.0.tar.gz
download https://mafft.cbrc.jp/alignment/software/mafft-7.505-without-extensions-src.tgz
download https://github.com/TravisWheelerLab/NINJA/archive/0.97-cluster_only.tar.gz NINJA-cluster.tar.gz
download https://www.repeatmasker.org/coseg-0.2.2.tar.gz
download https://www.repeatmasker.org/RepeatMasker/RepeatMasker-4.1.4.tar.gz
download https://github.com/Dfam-consortium/RepeatModeler/archive/2.0.4.tar.gz RepeatModeler-2.0.4.tar.gz
download https://github.com/Dfam-consortium/FamDB/raw/master/famdb.py famdb.py
download https://www.dfam.org/releases/Dfam_3.7/families/Dfam_curatedonly.h5.gz Dfam.h5.gz
# TODO: /exe/ only includes binaries of the "latest" version at the time of download.
# The version listed in README.md is obtained by running 'strings src/faToTwoBit | grep kent'
# On whatever was downloaded.
# Consider building these tools from source instead.
for tool in faToTwoBit twoBitInfo twoBitToFa; do
download https://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/"$tool"
done
================================================
FILE: LICENSE
================================================
Open Software License v. 2.1
============================
This Open Software License (the "License") applies to any original work
of authorship (the "Original Work") whose owner (the "Licensor") has
placed the following notice immediately following the copyright notice
for the Original Work: Licensed under the Open Software License version 2.1
1) Grant of Copyright License. Licensor hereby grants You a world-wide,
royalty-free, non-exclusive, perpetual, sublicenseable license to do
the following:
to reproduce the Original Work in copies;
to prepare derivative works ("Derivative Works") based upon the Original Work;
to distribute copies of the Original Work and Derivative Works to the
public, with the proviso that copies of Original Work or Derivative Works
that You distribute shall be licensed under the Open Software License;
to perform the Original Work publicly; and
to display the Original Work publicly.
2) Grant of Patent License. Licensor hereby grants You a world-wide,
royalty-free, non-exclusive, perpetual, sublicenseable license, under
patent claims owned or controlled by the Licensor that are embodied in
the Original Work as furnished by the Licensor, to make, use, sell and
offer for sale the Original Work and Derivative Works.
3) Grant of Source Code License. The term "Source Code" means the
preferred form of the Original Work for making modifications to it
and all available documentation describing how to modify the Original
Work. Licensor hereby agrees to provide a machine-readable copy of the
Source Code of the Original Work along with each copy of the Original
Work that Licensor distributes. Licensor reserves the right to satisfy
this obligation by placing a machine-readable copy of the Source Code in
an information repository reasonably calculated to permit inexpensive and
convenient access by You for as long as Licensor continues to distribute
the Original Work, and by publishing the address of that information
repository in a notice immediately following the copyright notice that
applies to the Original Work.
4) Exclusions From License Grant. Neither the names of Licensor, nor
the names of any contributors to the Original Work, nor any of their
trademarks or service marks, may be used to endorse or promote products
derived from this Original Work without express prior written permission
of the Licensor. Nothing in this License shall be deemed to grant any
rights to trademarks, copyrights, patents, trade secrets or any other
intellectual property of Licensor except as expressly stated herein. No
patent license is granted to make, use, sell or offer to sell embodiments
of any patent claims other than the licensed claims defined in Section
2. No right is granted to the trademarks of Licensor even if such marks
are included in the Original Work. Nothing in this License shall be
interpreted to prohibit Licensor from licensing under different terms
from this License any Original Work that Licensor otherwise would have
a right to license.
5) External Deployment. The term "External Deployment" means the use or
distribution of the Original Work or Derivative Works in any way such that
the Original Work or Derivative Works may be used by anyone other than
You, whether the Original Work or Derivative Works are distributed to
those persons or made available as an application intended for use over
a computer network. As an express condition for the grants of license
hereunder, You agree that any External Deployment by You of a Derivative
Work shall be deemed a distribution and shall be licensed to all under
the terms of this License, as prescribed in section 1(c) herein.
6) Attribution Rights. You must retain, in the Source Code of any
Derivative Works that You create, all copyright, patent or trademark
notices from the Source Code of the Original Work, as well as any
notices of licensing and any descriptive text identified therein as an
"Attribution Notice." You must cause the Source Code for any Derivative
Works that You create to carry a prominent Attribution Notice reasonably
calculated to inform recipients that You have modified the Original Work.
7) Warranty of Provenance and Disclaimer of Warranty. Licensor
warrants that the copyright in and to the Original Work and the patent
rights granted herein by Licensor are owned by the Licensor or are
sublicensed to You under the terms of this License with the permission
of the contributor(s) of those copyrights and patent rights. Except as
expressly stated in the immediately proceeding sentence, the Original
Work is provided under this License on an "AS IS" BASIS and WITHOUT
WARRANTY, either express or implied, including, without limitation,
the warranties of NON-INFRINGEMENT, MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY OF THE ORIGINAL
WORK IS WITH YOU. This DISCLAIMER OF WARRANTY constitutes an essential
part of this License. No license to Original Work is granted hereunder
except under this disclaimer.
8) Limitation of Liability. Under no circumstances and under no legal
theory, whether in tort (including negligence), contract, or otherwise,
shall the Licensor be liable to any person for any direct, indirect,
special, incidental, or consequential damages of any character arising
as a result of this License or the use of the Original Work including,
without limitation, damages for loss of goodwill, work stoppage, computer
failure or malfunction, or any and all other commercial damages or
losses. This limitation of liability shall not apply to liability for
death or personal injury resulting from Licensor's negligence to the
extent applicable law prohibits such limitation. Some jurisdictions do
not allow the exclusion or limitation of incidental or consequential
damages, so this exclusion and limitation may not apply to You.
9) Acceptance and Termination. If You distribute copies of the Original
Work or a Derivative Work, You must make a reasonable effort under the
circumstances to obtain the express assent of recipients to the terms of
this License. Nothing else but this License (or another written agreement
between Licensor and You) grants You permission to create Derivative Works
based upon the Original Work or to exercise any of the rights granted
in Section 1 herein, and any attempt to do so except under the terms of
this License (or another written agreement between Licensor and You)
is expressly prohibited by U.S. copyright law, the equivalent laws of
other countries, and by international treaty. Therefore, by exercising
any of the rights granted to You in Section 1 herein, You indicate Your
acceptance of this License and all of its terms and conditions. This
License shall terminate immediately and you may no longer exercise any
of the rights granted to You by this License upon Your failure to honor
the proviso in Section 1(c) herein.
10) Termination for Patent Action. This License shall terminate
automatically and You may no longer exercise any of the rights granted
to You by this License as of the date You commence an action, including
a cross-claim or counterclaim, against Licensor or any licensee alleging
that the Original Work infringes a patent. This termination provision
shall not apply for an action alleging patent infringement by combinations
of the Original Work with other software or hardware.
11) Jurisdiction, Venue and Governing Law. Any action or suit relating to this License may be brought only in the courts of a jurisdiction wherein the Licensor resides or in which Licensor conducts its primary business, and under the laws of that jurisdiction excluding its conflict-of-law provisions. The application of the United Nations Convention on Contracts for the International Sale of Goods is expressly excluded. Any use of the Original Work outside the scope of this License or after its termination shall be subject to the requirements and penalties of the U.S. Copyright Act, 17 U.S.C. § 101 et seq., the equivalent laws of other countries, and international treaty. This section shall survive the termination of this License.
12) Attorneys Fees. In any action to enforce the terms of this License or
seeking damages relating thereto, the prevailing party shall be entitled
to recover its costs and expenses, including, without limitation,
reasonable attorneys' fees and costs incurred in connection with such
action, including any appeal of such action. This section shall survive
the termination of this License.
13) Miscellaneous. This License represents the complete agreement
concerning the subject matter hereof. If any provision of this License
is held to be unenforceable, such provision shall be reformed only to
the extent necessary to make it enforceable.
14) Definition of "You" in This License. "You" throughout this License,
whether in upper or lower case, means an individual or a legal entity
exercising rights under, and complying with all of the terms of, this
License. For legal entities, "You" includes any entity that controls,
is controlled by, or is under common control with you. For purposes of
this definition, "control" means (i) the power, direct or indirect, to
cause the direction or management of such entity, whether by contract
or otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
15) Right to Use. You may use the Original Work in all ways not otherwise
restricted or conditioned by this License or by law, and Licensor promises
not to interfere with or be responsible for such uses by You.
This license is Copyright (C) 2003-2004 Lawrence E. Rosen. All rights
reserved. Permission is hereby granted to copy and distribute this
license without modification. This license may not be modified without
the express written permission of its copyright owner.
================================================
FILE: README.md
================================================

[](https://zenodo.org/badge/latestdoi/412126708) [](https://anaconda.org/bioconda/earlgrey) [](
https://anaconda.org/bioconda/earlgrey) [](https://anaconda.org/bioconda/earlgrey) [](https://anaconda.org/bioconda/earlgrey)
# Earl Grey
Earl Grey is a full-automated transposable element (TE) annotation pipeline, leveraging the most widely-used tools and combining these with a consensus elongation process to better define _de novo_ consensus sequences when annotating new genome assemblies.
# Contents
[Changes in Latest Release](#changes-in-latest-release)
[Example Run](#example)
[References and Acknowledgements](#references-and-acknowledgements)
[Usage Without Installation](#usage-without-installation)
[Recommended Installation](#recommended-installation-with-conda-or-mamba)
[Docker Container](#docker-container)
<!-- toc -->
# Important Considerations
Earl Grey version 6 uses Dfam 3.9. After installation, you MUST configure Dfam partitions as needed. Earl Grey will generate the script to do this and provide guidance when you run it for the first time. You need to specify which partitions of Dfam and/or RepBase to configure Earl Grey with. Choose partitions carefully as the combination will highly influence your results, especially if you want to pre-mask your input genome. Please make use of issues and discussions tabs if you have questions about this, we are always happy to help!
# Notes / Updates
We often get questions related to runtime. TE curation and annotation remains resource and time intensive. Fast is not necessarily better, and runtime is highly dependent on genome size, complexity, and repeat content. Runs will likely take longer than you might expect, and be very RAM-hungry. As some generic benchmarks, a 40Mb genome can take anywhere from a few hours to a day, 400Mb up to around 4-5 days, a 3Gb genome ~a week, and a 25Gb genome several weeks! Things will be running even if it doesn't look like they are. Each step checkpoints, so if you have server limits, you can resubmit the same script with the same parameters, and Earl Grey will skip completed steps. `TEstrainer` and the final `divergence calculator` use a lot of memory. Check carefully for OOM errors in the logs! As a rule of thumb, you need _at least_ 3GB of RAM _per thread_, with more being better. Therefore, 16 threads requires at least 48GB of RAM depending on repeat complexity of the input genome.
We have been made aware of some instability in repeat annotation percentages when high numbers of CPUs are employed in certain server environments. Please be sure to check logs carefully for instances of interruption. Known cases so far will show the following message:
```
OpenBLAS blas_thread_init: pthread_create failed for thread X of X: Resource temporarily unavailable
OpenBLAS blas_thread_init: ensure that your address space and process count limits are big enough (ulimit -a)
OpenBLAS blas_thread_init: or set a smaller OPENBLAS_NUM_THREADS to fit into what you have available
```
If you see this message, re-run analysis with less threads. Alternatively, you can modify your instance of the TEstrainer script `initial_mafft_setup.py` to add the following after `import os`:
```
os.environ['OPENBLAS_NUM_THREADS'] = '1'
```
# Changes in Latest Release
Earl Grey v7.2.4 improves pipeline robustness and RepeatModeler handling for small or fragmented genomes:
- **Proactive RepeatModeler sampling cap** (`earlGrey`, `earlGreyLibConstruct`): the `deNovo1` function previously used a nested retry loop that ran RepeatModeler up to three times, falling back to progressively smaller `-genomeSampleSizeMax` values after each failure. It is now replaced by a single proactive run. The samplable genome size (sum of contigs ≥ 40 kb, the threshold below which RepeatModeler discards contigs) is computed upfront using `awk`, and the appropriate `-genomeSampleSizeMax` cap is derived from the cumulative RECON round thresholds (none for ≥ 363 Mb; 81 Mb for ≥ 120 Mb; 27 Mb for ≥ 39 Mb; 9 Mb for ≥ 12 Mb; 3 Mb for < 12 Mb). RepeatModeler runs exactly once with the correct flag, eliminating wasted compute on small or fragmented genomes.
- **`set -eo pipefail`** added to `earlGrey`, `earlGreyLibConstruct`, and `earlGreyAnnotationOnly`: any command that fails mid-pipeline now immediately aborts execution, preventing downstream stages from running with missing or corrupt intermediate files.
- **`expr` replaced with `$(( ))`** throughout all three scripts: `expr N / 4` returns exit code 1 when the result is 0 (i.e. fewer than 4 threads requested), which would abort the pipeline under `set -e` before any tool was invoked. All thread-division expressions now use bash arithmetic `$(( ProcNum / 4 ))`.
- **`ls | head -n 1` pipelines guarded with `|| true`**: under `pipefail`, `ls` receives SIGPIPE (exit 141) when `head` exits after reading one line. All affected pipelines now append `|| true` to suppress this expected signal.
### Previous Changes
Earl Grey v7.2.3 fixes five bugs reported by users:
- **Off-by-one chunking in divergence_calc.py** ([#290](https://github.com/TobyBaril/EarlGrey/issues/290)): The previous floor-division chunking could produce one extra chunk when the row count was not evenly divisible by the thread count, causing the final chunk to run serially after all parallel workers had finished (effectively doubling wall-clock time in the worst case). Chunking now uses `np.array_split`, which always produces exactly `num_processes` chunks and distributes any remainder one row at a time.
- **OSError: AF_UNIX path too long in divergence_calc.py** ([#294](https://github.com/TobyBaril/EarlGrey/issues/294)): `pybedtools.set_tempdir()` sets Python's `tempfile.tempdir` globally. The `forkserver` start method creates its Unix domain socket under `tempfile.tempdir`, so deeply-nested output paths could exceed the 108-character AF_UNIX socket limit. The divergence calculator is now passed a short per-species path in `/tmp` via `-tmp`, keeping the socket path well within the limit regardless of output directory depth.
- **Crash when no nested TEs are found** ([#292](https://github.com/TobyBaril/EarlGrey/issues/292)): `filteringOverlappingRepeats.R` crashed with `object 'nested' not found` on genomes with low TE content where no fully-nested elements were detected. When the nesting-detection loop exits without finding any nesting, `bind_rows()` produces a zero-row data frame that lacks the `nested` and `nesting_round` columns, causing the downstream `mutate()` to fail. The fix guards the mutate block with an `nrow()` check — if no nested TEs are found the unnested output is written directly.
- **Temp file path errors in divergence_calc.py under Singularity and other environments** ([#289](https://github.com/TobyBaril/EarlGrey/issues/289)): All path constructions in `divergence_calc.py` used string concatenation (e.g. `temp_dir + "/qseqs"`), which produces double slashes when `temp_dir` already has a trailing slash. This is particularly common in Singularity where bound paths are often passed with trailing slashes. All 13 affected path constructions have been replaced with `os.path.join()`, which correctly handles trailing slashes and redundant separators in all environments.
- **TEstrainer parallel progress bar cluttering batch job logs** ([#293](https://github.com/TobyBaril/EarlGrey/issues/293)): A new `-q yes` flag on `earlGrey` (and `-q` on `TEstrainer_for_earlGrey.sh` directly) suppresses the GNU `parallel --bar` progress bar. This is particularly useful for `sbatch`/HPC log files where the animated bar produces thousands of redundant lines.
### Previous Changes
Earl Grey v7.2.1 patches a bug where the curated library directory was not created when an existing library was supplied via `-l` (without `-r`). In this case, `earlGreyAnnotationOnly` could fail when attempting to change into the directory during the final masking step. The directory is now created with `mkdir -p` before use, matching the fix already applied to the main `earlGrey` script.
### Older Changes
Earl Grey v7.2.0 significantly reduces peak RAM usage in two RAM-intensive components: TEstrainer and divergence_calc.py. These changes prevent OOM kills when running with large thread counts on memory-constrained compute nodes, with no change to output.
**TEstrainer / TEstrainer_for_earlGrey.sh:**
- The default `MEM_FREE` threshold raised from `200M` to `1G`. The previous value was lower than the startup cost of a single Python interpreter with heavy scientific libraries, making the guard ineffective.
- All GNU `parallel` calls in the BEAT curation loop (trf, initial_mafft_setup, mafft, TEtrim) now carry `--memfree ${MEM_FREE}`, throttling job dispatch when free RAM drops below the threshold.
- A RAM-cap guard is applied at startup: the requested thread count is capped based on available RAM (`free -m`) at an estimate of 800 MB per concurrent job, with a warning printed if a cap is applied.
**divergence_calc.py:**
- Switched from the default `fork` multiprocessing start method to `forkserver`. On Linux, `fork` duplicates the full parent address space (including the GFF DataFrame) into every worker; `forkserver` workers start clean and receive only a file path, eliminating N-fold GFF copies in RAM.
- GFF chunks are now serialised to temp TSV files on disk before the pool is created. The parent DataFrame is freed before any workers are launched, reducing parent RSS during the pool run.
- `pool.imap_unordered` replaces `pool.map`, allowing workers to be retired as they finish rather than all buffering results simultaneously.
- `maxtasksperchild=1` forces worker process restart after each chunk, releasing accumulated pybedtools handles and BioPython caches between chunks.
- Periodic `pybedtools.cleanup()` every 500 rows prevents temp-file accumulation in long-running workers.
- A pre-existing bug was fixed: `os.remove(query_path)` was called unconditionally in both cleanup branches even when the file was never created (e.g. when `pybedtools.sequence()` raised a samtools exception). Both removal calls are now guarded with `if exists(query_path)`.
Benchmarked results confirm peak RSS for `divergence_calc.py` is flat at ~76 MB across 1–16 threads (previously scaling linearly), and TEstrainer peak RSS is ~877 MB at both 4 and 32 threads (previously unbounded with thread count).
Earl Grey v7.1.1 patches a small bug in TEstrainer where consensus sequences comprised of tandem repeats were triggering a warning due to the change to `pandas >2.0`. The output results were not affected. The codebase has now been updated to handle these cases without throwing warnings, with the same expected behaviour and handling of tandem repeats as before.
Earl Grey v7.1.0 removes the dependency on Python 3.9, which is no longer supported. Earl Grey can now be built and run with Python 3.10 and above. This change was made to ensure that Earl Grey remains compatible with newer versions of Python and to remove the reliance on an unsupported version.
Earl Grey v7.0.3 fixes an issue with final calculation tables which did not count _other_ repeats towards total repeat content.
Earl Grey v7.0.2 adds RepeatLandscapes for _Penelope_-like elements and SINEs. Importantly, the `-norna` option in RepeatMasker is no longer invoked as default behaviour, which will improver the detection and masking of small tRNA-derived SINEs.
Earl Grey v7.0.1 patches the summary table generation, where LINEs and Penelopes were being counted in both categories for nested repeats only.
☕ Earl Grey v7.0.0 is here!
Some long-awaited changes in this release—thank you for your patience while I found the time to properly test and implement them.
🐞 RepeatCraft fixes
First, I’ve fixed an issue in RepeatCraft where distal TEs could be grouped erroneously. This occurred in rare edge cases where the internal counter failed to iterate when new distal copies of an existing TE were detected on the same contig. This should now be fully resolved.
🧬 Fully nested TE handling (major update)
I’ve also completely revamped the Earl Grey post-processing and summary steps to properly deal with Fully Nested TEs.
Partially overlapping TEs are handled as before, following the approach described in the original Earl Grey paper.
Fully nested TEs are now identified using a new iterative process:
The GFF is scanned for nested TEs
These are labelled and stored in a separate file
The nested TE is removed and the GFF is re-scanned to detect deeper (multi-level) nesting
This continues until no new nested TEs are found
📊 Changes to coverage and summaries (important)
Nested TEs are no longer included in the TE coverage calculation used to generate the pie chart in the `summaryFiles` directory.
Instead:
Summary tables now include new categories showing how many base pairs are comprised of nested TEs
These base pairs are not counted toward Total Interspersed Repeats, as doing so would result in double-counting genomic space.
⚠️ This represents a substantial change from previous versions, so please be aware of this difference when upgrading to v7.
The output table summary now has the following format:
```
|TE Classification | Coverage (bp)|Copy Number | % Genome Coverage| Genome Size| TE Family Count|
|:-------------------------------------------------|-------------:|:-------------|-----------------:|-----------:|---------------:|
|DNA | 80886|326 | 0.7607795| 10631990| 326|
|DNA-nested | 1449|29 | 0.0136287| 10631990| 29|
|Rolling Circle | 8022|31 | 0.0754515| 10631990| 31|
|Rolling Circle-nested | 0|0 | 0.0000000| 10631990| NA|
|Penelope | 112822|812 | 1.0611560| 10631990| 812|
|Penelope-nested | 1268|13 | 0.0119263| 10631990| 13|
|LINE | 132133|426 | 1.2427871| 10631990| 426|
|LINE-nested | 1268|13 | 0.0119263| 10631990| 4|
|SINE | 0|0 | 0.0000000| 10631990| NA|
|SINE-nested | 0|0 | 0.0000000| 10631990| NA|
|LTR | 0|0 | 0.0000000| 10631990| NA|
|LTR-nested | 0|0 | 0.0000000| 10631990| NA|
|Other (Simple Repeat, Microsatellite, RNA) | 189999|4241 | 1.7870502| 10631990| 4241|
|Other (Simple Repeat, Microsatellite, RNA)-nested | 1225|34 | 0.0115218| 10631990| 34|
|Unclassified | 274026|1291 | 2.5773726| 10631990| 1291|
|Unclassified-nested | 0|0 | 0.0000000| 10631990| NA|
|Total Interspersed Repeat | 607889|2886 | 5.7175468| 10631990| NA|
|Non-Repeat | 10024101|notApplicable | 94.2824532| 10631990| NA|
```
In the final GFF output, nested TEs are clearly labelled with a `nested=FULLY_NESTED` attribute in column 9, enabling quick identification and downstream filtering.
```
NC_045808.1 Earl_Grey Simple_repeat 51840 51871 14 + . ID=(CGCA)N_33;Name=(CGCA)N;tstart=1;tend=31;shortte=F;nested=FULLY_NESTED-ROUND1
```
As always, thank you to the TE community for your enthusiasm in using Earl Grey, and for your invaluable feedback and bug reports. I’ll continue to incorporate improvements and fixes as quickly—and carefully—as possible.
Happy New Year! 🎉
Earl Grey v6.3.6 patches a RepeatCraft bug that arises extremely rarely in specific genomes, linked to dictionary initialisation.
Earl Grey v6.3.5 patches the annotation only pipeline to use the correct divergence calculator when a custom library is used without a RepBase term.
Earl Grey v6.3.4 adds small patches to correct TE family count table, which was previously using ID rather than Name. Also, a new awk one-liner has been added to clean the final GFF to make all attributes compliant with genometools, geneious etc.
Earl Grey v6.3.3 adds small updates to improve user-friendliness when installing and running for the first time.
Earl Grey v6.3.0 makes some important changes. Firstly, RepeatMasker now runs without checking for IS element contamination `-no_is`. In final GFF files, each insertion is now given a unique ID with `ID=`. The TE family is designated with `Name=`. This should enable parsing with Geneious or other visualisers. Other script changes are related to this change in field designation.
Earl Grey v6.2.0 now includes better checkpoints for TEstrainer. In the event of an interrupted run, resubmit your `earlGrey` command with _exactly_ the same parameters and Earl Grey will skip successfully completed steps. With this new update, TEstrainer intermediate files will no longer be deleted and a new run started from scratch. Now, TEstrainer can be recovered mid-run to save time and resources.
Earl Grey v6.1.1 patches a bug where threads set to <4 caused TEstrainer to crash (only present in 6.1.0 and not earlier versions)
Earl Grey v6.1.0 reintroduces the `--curated` flag when known elements are used to pre-mask your input genome. As usual, unless a good quality TE library already exists for your species of interest, or a _very_ closely related one, we do not recommend pre-masking your input genome with known repeats. This reduces the amount of information available for _de novo_ annotation, and can lead to overestimations of TE divergence and lower-quality consensus sequences. If in doubt, leave this option out!
Earl Grey v6.0.3 reduces CPU usage for TEstrainer to reduce memory pressure.
Earl Grey v6.0.2 patches an issue where the use of existing libraries did not work with the new `famdb` formats.
Earl Grey v6.0.1 contains small bug fixes to verify installed RepeatMasker Libraries correctly. There is now a Docker container for Earl Grey v6.0.1 that contains all partitions of Dfam (so it is BIG!).
*Earl Grey v6.0.0 is here!*
There are some relatively large changes in this release, resulting in the jump to v6.0.0.
Importantly, Earl Grey has been updated to use *Dfam version 3.9*, with RepeatMasker 4.1.8 and famdb 2.0.1. This means that there is some extra configuration required to get the pipeline running. Upon first installation and running of Earl Grey, the pipeline will check whether RepeatMasker has been configured with the correct Dfam database partitions. If not, it will warn you, generate a script that you can modify and run to configure RepeatMasker, and provide instructions to `stdout` if you want to do this yourself.
Please take care to configure Earl Grey v6 with ALL required partitions. More information on the partitioning can be found at [Dfam.org](https://dfam.org/releases/current/families/FamDB/README.txt).
Earl Grey v5.1.1 will continue to work for those who are happy with Dfam v3.7, but we recommend upgrading to v6.0.0 to keep up to date with the latest improvements to the database.
Earl Grey v5.1.1 contains very small patches to improve compatibility with publicly available genome sequencing data. In rare instances, strange characters in fasta headers were causing issues preventing the pipeline from running. These have been resolved in the preparation step.
In addition, new _pretty_ tables are now generated in the `summaryFiles` directory and at the end of a successful run. These contain the same information as the `txt` tables, but in the familiar pipe format for readability. These can be added to markdown files if required. One is produced for the high level count as well as for the family level count. Below is an example of the table that is printed at the end of Earl Grey runs as of `v5.1.1`:
```
|TE Classification | Coverage (bp)| Copy Number| % Genome Coverage| Genome Size| TE Family Count|
|:------------------------------------------|-------------:|-----------:|-----------------:|-----------:|---------------:|
|DNA | 69284| 190| 0.6516560| 10631990| 9|
|Rolling Circle | 71788| 316| 0.6752076| 10631990| 12|
|Penelope | 139334| 869| 1.3105167| 10631990| 6|
|LINE | 141841| 491| 1.3340964| 10631990| 22|
|SINE | 44137| 191| 0.4151339| 10631990| 1|
|Other (Simple Repeat, Microsatellite, RNA) | 191645| 4235| 1.8025318| 10631990| 940|
|Unclassified | 155836| 905| 1.4657275| 10631990| 35|
|Non-Repeat | 9818125| NA| 92.3451301| 10631990| NA|
```
Earl Grey v5.1.0 contains small changes that drastically improve memory usage in the divergence calculator. We have replaced the use of EMBOSS `water` with EMBOSS `matcher`, which reduces memory consumption on large alignments whilst remaining rigorous. For more information, please see the notes section in the [EMBOSS Manual](https://www.bioinformatics.nl/cgi-bin/emboss/help/matcher). This should prevent jobs running out of memory, particularly when using queuing systems and shared resources.
Big changes in the latest release!
*Earl Grey v5.0.0 is here!*
This release incorporates the incremental improvements made throughout the life of version 4.
It is now possible to run some subroutines in Earl Grey (run either of these new commands with `-h` to see a list of options):
- `earlGreyLibConstruct` can be used to run Earl Grey for _de novo_ TE detection, consensus generation, and improvement through the BEAT process. The output will be the strained TE consensus sequences, which can then be used for subsequent annotation. This is useful when you want to make a combined library from the libraries of several different genomes, where it is no longer required to waste time running annotations. Once the libraries are generated and you have curated them, you can then run the next step in isolation (next point!).
- `earlGreyAnnotationOnly` can be used to run the final annotation and defragmentation steps in Earl Grey. This is useful if you have already run the BEAT process and have a library of _de novo_ TE consensus sequences that you would like to use to annotate a given genome. This script is also compatible with the `-r` flag to take known repeats from the databases used to configure RepeatMasker in addition to the custom repeat library.
- *EXPERIMENTAL FEATURE:* I have also added an option to run [HELIANO](https://github.com/Zhenlisme/heliano) for improved detection of Helitrons, which are notoriously difficult to detect and classify using homology methods. This can be implemented by adding `-e yes` to the command line options after upgrading to v5.0.0. Currently, HELIANO annotations replace those which they overlap following the RepeatMasker run, which is performed during defragmentation (in a similar way to full-length LTRs being dealt with in `RepeatCraft`). Feedback is welcomed on this implementation, and I am continuing to test and improve the implementation of HELIANO within Earl Grey.
- The settings used for HELIANO are: `--nearest -dn 6000 -flank_sim 0.5 -w 10000`. These can be modified in the earlGrey script of your specific installation.
Thank you for your continued support and enthusiasm for Earl Grey!
# Example
Given an input genome, Earl Grey will run through numerous steps to identify, curate, and annotate transposable elements (TEs). We recommend running earlGrey within a tmux or screen session, so that you can log off and leave Earl Grey running.
There are several required and optional parameters for your Earl Grey run:
```
Required Parameters:
-g == genome.fasta
-s == species name
-o == output directory
Optional Parameters:
-t == Number of Threads (DO NOT specify more than are available)
-r == RepeatMasker search term (e.g arthropoda/eukarya)
-l == Starting consensus library for an initial mask (in fasta format)
-i == Number of Iterations to BLAST, Extract, Extend (Default: 10)
-f == Number flanking basepairs to extract (Default: 1000)
-c == Cluster TE library to reduce redundancy? (yes/no, Default: no)
-m == Remove putative spurious TE annotations <100bp? (yes/no, Default: no)
-d == Create soft-masked genome at the end? (yes/no, Default: no)
-n == Max number of sequences used to generate consensus sequences (Default: 20)
-a == minimum number of sequences required to build a consensus sequence (Default: 3)
-e == Optional: Run HELIANO for detection of Helitrons (yes/no, Default: no)
-h == Show help
```
```
# remember to activate the conda environment before running (NOTE: depending on your install, the environment name might vary)
conda activate earlgrey
# run earl grey with minimum command options
earlGrey -g [genome.fasta] -s [speciesName] -o [outputDirectory]
# e.g
earlGrey -g myzusPersicae.fasta -s myzusPersicae -o ./earlGreyOutputs
```
Following this, Earl Grey will run through several processes depending on the options selected. The general pipeline is illustrated below:

For a more in-depth description of Earl Grey's steps, please refer to the implementation section in the [manuscript](https://academic.oup.com/mbe/article/41/4/msae068/7635926).
The runtime of Earl Grey will depend on the repeat content of your input genome. Once finished, you will notice that a number of directories have been created by Earl Grey. The most important results are found within the "summaryFiles" folder, however intermediate results are kept in case you wish to use alignments for further manual curation or investigation, for example. NOTE: RepeatModeler2 remains the rate-limiting step, with runtimes leading into several days, or even weeks, with large repeat-rich genomes. This is normal.
Directories created by earl grey:
```
[speciesName]EarlGrey/
|
|---[speciesName]_RepeatMasker/ [OPTIONAL: only created with -r or -l flags]
| + Results of the initial RepeatMasker run used to pre-mask previously characterised TEs
|---[speciesName]_Database/
| + BLAST database built from the (optionally pre-masked) genome, used by RepeatModeler
| + RepeatModeler2 consensus output ([speciesName]-families.fa, [speciesName]-families.stk, [speciesName]-rmod.log)
|---[speciesName]_RepeatModeler/
| + RepeatModeler2 working directory (RM_* subdirectory with per-round outputs and intermediate consensi)
|---[speciesName]_strainer/
| + TEstrainer working directory (TS_* subdirectory) containing BEAT curation intermediate files
| + Strained consensus library ([speciesName]-families.fa.strained)
|---[speciesName]_Curated_Library/ [OPTIONAL: only created with -r or -l flags]
| + Known repeat library extracted from the configured database ([speciesName].RepeatMasker.lib)
| + Combined _de novo_ and known repeat library ([speciesName]_combined_library.fasta)
|---[speciesName]_heliano/ [OPTIONAL: only created with -e yes flag]
| + HELIANO Helitron detection results (HEL_* subdirectory)
|---[speciesName]_RepeatMasker_Against_Custom_Library/
| + Results of RepeatMasker run against the final curated repeat library (.out, .tbl, .cat.gz, .masked)
|---[speciesName]_RepeatLandscape/
| + Divergence-annotated GFF ([speciesName].filteredRepeats.withDivergence.gff)
| + Repeat landscape PDFs (classification landscape, split-class landscape, superfamily divergence plot)
| + Divergence summary table ([speciesName]_summary_table.tsv)
|---[speciesName]_mergedRepeats/
| + looseMerge/ subdirectory containing:
| - LTR_FINDER full-length LTR results (ltrfinder GFF and working subdirectory)
| - RepeatCraft TE defragmentation output (RepeatCraft working subdirectory, merged GFFs and BEDs)
| - Final filtered and merged repeat annotations (filteredRepeats.bed, filteredRepeats.gff, filteredRepeats.summary)
|---[speciesName]_summaryFiles/
+ Final TE annotations in GFF3 format ([speciesName].filteredRepeats.gff)
(fully nested TEs labelled with 'nested=FULLY_NESTED' attribute in column 9)
+ Final TE annotations in BED format ([speciesName].filteredRepeats.bed)
+ High-level TE quantification table (tab-delimited .txt and markdown-formatted .kable)
+ Family-level TE quantification table (tab-delimited .txt and markdown-formatted .kable)
+ Divergence summary table ([speciesName]_divergence_summary_table.tsv)
+ Pie chart of genome repeat content ([speciesName].summaryPie.pdf)
+ Repeat landscape PDFs (classification landscape, split-class landscape, superfamily divergence plot)
+ _de novo_ strained TE library ([speciesName]-families.fa.strained)
+ Combined repeat library ([speciesName]_combined_library.fasta) [OPTIONAL: only with -r or -l flags]
+ Soft-masked genome FASTA ([speciesName].softmasked.fasta) [OPTIONAL: only with --softMask yes]
```
As of `v4.4.5`, there is an option to generate _de novo_ TE libraries without running subsequent annotation. To run this option, use `earlGreyLibConstruct` with the same command-line options as `earlGrey`. This will run everything up to the end of TEstrainer, and leave you a `families-fa.strained` library file in the `summaryFiles` directory, which you can then use for manual curation, or for pangenome studies.
### Example Outputs (NOTE: example data has been used here):
- Pie chart summarising TE content in input genome

- RepeatLandscapes summarising relative TE activity using Kimura 2-Parameter Divergence (recent activity towards the RHS)
<img width="849" alt="Screenshot 2024-08-12 at 13 55 06" src="https://github.com/user-attachments/assets/5e1b18b6-a84a-47e9-bc6d-6f341a2297ec">
- TE annotations - These are in standard genomic information formats to be compatible with downstream analyses. NOTE: TE divergence calculated using Kimura 2-Parameter distance is now supplied for each insertion in column 9 of the GFF3 file:
```
# BED format
NC45808.1 4964941 4965925 LINE/Penelope 5073 +
NC45808.1 7291353 7291525 LINE/L2 1279 +
NC_045808.1 8922477 8923791 DNA/TcMar-Tc1 11957 +
# GFF3 format
scaffold_1 Earl_Grey DNA/Mariner 71618 71814 892 + . TSTART=13;TEND=233;ID=RND-1_FAMILY-1;SHORTTE=F;KIMURA80=0.2141
scaffold_1 Earl_Grey Unknown 81757 81927 785 - . TSTART=16;TEND=194;ID=RND-1_FAMILY-0;SHORTTE=F;KIMURA80=0.2037
```
# References and Acknowledgements
This pipeline has been designed to be used and shared openly by the community.
### When using Earl Grey, please cite:
Baril, T., Galbraith, J.G., and Hayward, A., Earl Grey: A Fully Automated User-Friendly Transposable Element Annotation and Analysis Pipeline, Molecular Biology and Evolution, Volume 41, Issue 4, April 2024, msae068 [doi:10.1093/molbev/msae068](https://doi.org/10.1093/molbev/msae068)
Baril, Tobias., Galbraith, James., and Hayward, Alexander. (2023) Earl Grey. Zenodo [doi:10.5281/zenodo.5654615](https://doi.org/10.5281/zenodo.5654615)
### This pipeline makes use of scripts from:
[RepeatCraft](https://github.com/niccw/repeatcraftp) - Wong WY, Simakov O. RepeatCraft: a meta-pipeline for repetitive element de-fragmentation and annotation. Bioinformatics 2018;35:1051–2. https://doi.org/10.1093/bioinformatics/bty745.
### The following open source software are utilised by this pipeline:
Smit AFA, Hubley RR, Green PR. RepeatMasker Open-4.0. Http://RepeatmaskerOrg 2013.
Flynn, J.M., Hubley, R., Goubert, C., Rosen, J., Clark, A.G., Feschotte, C. and Smit, A.F. RepeatModeler2 for automated genomic discovery of transposable element families. Proceedings of the National Academy of Sciences 2020;17;9451-9457. https://doi.org/10.1073/pnas.1921046117
Bao Z, Eddy SR. Automated De Novo Identification of Repeat Sequence Families in Sequenced Genomes. Genome Res 2002;12:1269–76. https://doi.org/10.1101/gr.88502.
Price AL, Jones NC, Pevzner PA. De novo identification of repeat families in large genomes. Bioinformatics 2005;21:i351–8.
Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, et al. BLAST+: Architecture and applications. BMC Bioinformatics 2009;10: 1–9. https://doi.org/10.1186/1471-2105-10-421.
Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: Improvements in performance and usability. Mol Biol Evol 2013;30:772–80. https://doi.org/10.1093/molbev/mst010.
Capella-Gutiérrez S, Silla-Martínez JM, Gabaldón T. trimAl: A tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 2009;25:1972–3. https://doi.org/10.1093/bioinformatics/btp348.
Rice P, Longden L, Bleasby A. EMBOSS: The European Molecular Biology Open Software Suite. Trends Genet 2000;16:276–7. https://doi.org/10.1016/S0168-9525(00)02024-2.
Xu Z, Wang H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res 2007;35:W265–8. https://doi.org/10.1093/nar/gkm286.
Ou S, Jiang N. LTR_FINDER_parallel: parallelization of LTR_FINDER enabling rapid identification of long terminal repeat retrotransposons. BioRxiv 2019:2–6.
# Usage without installation
If you would like to try Earl Grey, or prefer to use it in a browser, you can do this through [gitpod](https://gitpod.io). You can get 50 hours free per month and use Earl Grey within a preconfigured environment. Simply select this repository by pasting the repository URL, and the system will be automatically configured for you. You can then upload your genome of interest, and run Earl Grey as you would on the command line.
<img width="1919" alt="Screenshot 2023-09-29 at 13 38 43" src="https://github.com/TobyBaril/EarlGrey/assets/46785187/7dd2f2de-3c13-4553-b13a-007fdd8d94d6">
# Recommended Installation with Conda or Mamba
Earl Grey version 6 uses Dfam 3.9. After installation, you MUST configure Dfam partitions as needed. Earl Grey will generate the script to do this and provide guidance when you run it for the first time. You need to specify which partitions of Dfam and/or RepBase to configure Earl Grey with. Choose partitions carefully as the combination will highly influence your results, especially if you want to pre-mask your input genome.
Earl Grey version 7.2.4 (latest stable release) with all required and configured dependencies is found in the `biooconda` conda channel. To install, simply run the following depending on your installation:
```
# With conda
conda create -n earlgrey -c conda-forge -c bioconda earlgrey=7.2.4
# With mamba
mamba create -n earlgrey -c conda-forge -c bioconda earlgrey=7.2.4
# Then run
earlGrey
# a script will be output to stdout and generated in the current directory to aid in setup
```
# Recommended Installation on ARM-based Mac Systems (M chips) Using Docker
It is possible to install x86-based Docker environments on M-chip Mac systems using Rosetta. The below form a guide that should work, but please reach out if you have any trouble!
Please first follow the Docker on Mac installation [instructions here](https://docs.docker.com/desktop/setup/install/mac-install/). Ensure you have installed Rosetta2, as this is required to get Earl Grey to behave as expected.
Next, we will create aliases to switch between arm and rosetta:
We want a simple way to activate the x86 architecture. We can do this by adding the following to `~/.zshrc`:
```
alias arm="env /usr/bin/arch -arm64 /bin/zsh --login"
alias intel="env /usr/bin/arch -x86_64 /bin/zsh --login"
```
Then close the terminal and open a new one.
To activate the intel/rosetta environment, run the following in a new terminal:
```
intel
```
Next, get the Docker installation and run in an interactive terminal following the instructions in the next section below. You will only need to pull the container once, then can use it for all your Earl Grey needs.
After this, you are ready to go! Just remember to activate the _intel_ terminal before starting the interactive container and running Earl Grey.
# Docker Container
A Docker container has been generated with none of Dfam 3.9, but with script generation to source required partitions
I try to keep an up-to-date container in docker hub, but this might not always be the case depending on if I have had time to build and upload a new image. Currently, the recommended image ready for use is `-nodfam` version. Upon running the container interactively and running the command `earlGrey`, instructions will print to `stdout` and a script that you can use will be placed in your current working directory. After an initial setup and configuration in an interative version of the container, you can commit the changes (i.e. the Dfam configuration) using `docker commit [container_ID] yourdockerusername/earlgrey:version7.2.4-configured`. Then, you can run this container interactively, or non-interatively, to annotate focal genomes.
```
# Interactive mode
# Version 7.2.4 with no preconfigured partitions (RECOMMENDED!) - bind a directory, in my case the current directory using pwd
docker run -it -v 'pwd':/data/ tobybaril/earlgrey:latest-nodfam
# change to library directory
cd /data/
# run earlGrey to make the configuration script
earlGrey
# then alter script with required partitions and run the configuration script
# change 0-16 to whichever you require, but at least 0. This relates to the partitions of Dfam 3.9 (https://www.dfam.org/releases/Dfam_3.9/families/FamDB/)
##### e.g. for 0-5:
sed -i '/^curl/ s/0-16/0-5/g' configure_dfam39.sh
##### e.g for 1,3,5:
sed -i '/^curl/ s/0-16/1,3,5/g' configure_dfam39.sh
# run the configuration script
bash configure_dfam39.sh
# return to your data directory
cd /data/
# to quit the container and leave it running so you can commit the configured changes
# ctrl + p; ctrl + q
# get container ID to commit (not necessarily required if you leave the container running. If you kill the container and start again, you would need to reconfigure the DFAM libraries, so this is recommended. However, the container will be BIG depending on which parts of DFAM you pull.
docker ps -a
# commit the modified container so you can use at will (replace yourdockerusername with your docker username)
docker commit [container_ID] yourdockerusername/earlgrey:version7.2.4-configured
# you can then run non-interatively if required:
docker run -v 'pwd':/data/ yourdockerusername/earlgrey:version7.2.4-configured earlGrey -g /data/GENOME.fasta -s nonInteractiveTest -o /data/ -t 8
# alternatively you can still run interactive sessions
docker run -it -v 'pwd':/data/ yourdockerusername/earlgrey:version7.2.4-configured
```
================================================
FILE: Singularity/README.md
================================================
# Singularity image for Earl Grey
A Singularity image can be built from the Docker image hosted on Docker Hub. The image contains no pre-configured Dfam partitions, but will generate a configuration script on first run of `earlGrey`. You will need to configure the Dfam libraries once, then you can use the configured sandbox for all subsequent analyses.
Wherever you start this is where your `/data` directory will point to inside the container. Make sure your genome assembly is found here, or change `$(pwd)` in the below commands to the directory where you want to perform your analysis.
**Write access is required inside the container to download and configure the Dfam partitions.** The recommended approach is to build a writable sandbox, configure Dfam, and then optionally convert the sandbox to a portable SIF file.
## First-time setup: building and configuring the sandbox
```bash
# Build a writable sandbox from the Docker image
singularity build --sandbox earlgrey_sandbox/ docker://tobybaril/earlgrey:latest-nodfam
# Run the sandbox interactively with write access, binding your working directory to /data
singularity shell -C --bind $(pwd):/data --writable earlgrey_sandbox/
# Inside the container: change to the data directory and run earlGrey to generate the configuration script
cd /data/
earlGrey
# Modify the configuration script to select the required Dfam 3.9 partitions
# (https://www.dfam.org/releases/Dfam_3.9/families/FamDB/)
# At minimum, partition 0 is required. Change 0-16 to whichever partitions you need.
##### e.g. for partitions 0-5:
sed -i '/^curl/ s/0-16/0-5/g' configure_dfam39.sh
##### e.g. for specific partitions 1, 3, and 5:
sed -i '/^curl/ s/0-16/1,3,5/g' configure_dfam39.sh
# Run the configuration script to download and install the selected Dfam partitions
bash configure_dfam39.sh
# Return to your data directory
cd /data/
# Exit the sandbox
exit
```
## Running Earl Grey after configuration
```bash
# Run the configured sandbox interactively
singularity shell -C --bind $(pwd):/data --writable earlgrey_sandbox/
# Or run non-interactively
singularity exec -C --bind $(pwd):/data --writable earlgrey_sandbox/ earlGrey -g /data/GENOME.fasta -s mySpecies -o /data/ -t 8
```
## Optional: convert the configured sandbox to a portable SIF
Once Dfam is configured, you can convert the sandbox to a read-only SIF file for easier distribution and deployment. Note that the resulting SIF will be large depending on which Dfam partitions were downloaded.
```bash
singularity build earlgrey_configured.sif earlgrey_sandbox/
# Run non-interactively with the configured SIF
singularity exec -C --bind $(pwd):/data earlgrey_configured.sif earlGrey -g /data/GENOME.fasta -s mySpecies -o /data/ -t 8
```
================================================
FILE: conda/build.sh
================================================
#!/bin/bash
#Based on https://github.com/TobyBaril/EarlGrey/blob/main/configure
set -x
# Define paths
PACKAGE_HOME=${PREFIX}/share/${PKG_NAME}-${PKG_VERSION}-${PKG_BUILDNUM}
SCRIPT_DIR="${PACKAGE_HOME}/scripts/"
# Create directories
mkdir -p ${PREFIX}/bin
mkdir -p ${PACKAGE_HOME}
# Put package in share directory
cp -rf * ${PACKAGE_HOME}/
# Install SA-SSR (has to be done here because SA-SSR is an ancient repository without releases)
git clone --depth 1 https://github.com/TobyBaril/SA-SSR
cd SA-SSR
make -j"${CPU_COUNT}"
install -v -m 0755 bin/sa-ssr "${PREFIX}/bin"
cd ../ && rm -rf SA-SSR/
# Fixes to earlGrey executable
sed -i.bak "/CONDA_DEFAULT_ENV/,+4d" ${PACKAGE_HOME}/earlGrey #remove check that conda environment has a specific name
sed -i.bak "/CONDA_DEFAULT_ENV/,+4d" ${PACKAGE_HOME}/earlGreyLibConstruct #remove check that conda environment has a specific name
sed -i.bak "/CONDA_DEFAULT_ENV/,+4d" ${PACKAGE_HOME}/earlGreyAnnotationOnly #remove check that conda environment has a specific name
# Fixes sed command for executables so that it works on both linux and macos
sed -i.bak "s|sed -i |sed -i.bak |g" ${PACKAGE_HOME}/earlGrey ${PACKAGE_HOME}/earlGreyLibConstruct ${PACKAGE_HOME}/earlGreyAnnotationOnly ${SCRIPT_DIR}/rcMergeRepeat* ${SCRIPT_DIR}/TEstrainer/TEstrainer_for_earlGrey.sh
# Remove -pa from RepeatClassifier
sed -i.bak 's/RepeatClassifier -pa ${THREADS} /RepeatClassifier /' ${SCRIPT_DIR}/TEstrainer/TEstrainer
# Remove -t parameter from sa-ssr (since multithreading doesn't work on OSX)
sed -i.bak 's/-t ${THREADS} / /' ${SCRIPT_DIR}/TEstrainer/TEstrainer_for_earlGrey.sh
sed -i.bak 's/-t ${THREADS} / /' ${SCRIPT_DIR}/TEstrainer/TEstrainer
# Add SCRIPT_DIR to correct path
sed -i.bak "s|SCRIPT_DIR=.*|SCRIPT_DIR=${SCRIPT_DIR}|g" ${PACKAGE_HOME}/earlGrey
sed -i.bak "s|SCRIPT_DIR=.*|SCRIPT_DIR=${SCRIPT_DIR}|g" ${PACKAGE_HOME}/earlGreyLibConstruct
sed -i.bak "s|SCRIPT_DIR=.*|SCRIPT_DIR=${SCRIPT_DIR}|g" ${PACKAGE_HOME}/earlGreyAnnotationOnly
sed -i.bak "s|SCRIPT_DIR=.*|SCRIPT_DIR=${SCRIPT_DIR}|g" ${SCRIPT_DIR}/rcMergeRepeat*
sed -i.bak "s|SCRIPT_DIR=.*|SCRIPT_DIR=${SCRIPT_DIR}|g" ${SCRIPT_DIR}/headSwap.sh
sed -i.bak "s|SCRIPT_DIR=.*|SCRIPT_DIR=${SCRIPT_DIR}|g" ${SCRIPT_DIR}/autoPie.sh
sed -i.bak "s|STRAIN_SCRIPTS=.*|STRAIN_SCRIPTS=${SCRIPT_DIR}/TEstrainer/scripts/|g" ${SCRIPT_DIR}/TEstrainer/TEstrainer_for_earlGrey.sh
# Set permissions to files
chmod +x ${PACKAGE_HOME}/earlGrey
chmod +x ${PACKAGE_HOME}/earlGreyLibConstruct
chmod +x ${PACKAGE_HOME}/earlGreyAnnotationOnly
chmod +x ${SCRIPT_DIR}/TEstrainer/TEstrainer_for_earlGrey.sh
chmod +x ${SCRIPT_DIR}/* > /dev/null 2>&1
chmod +x ${SCRIPT_DIR}/bin/LTR_FINDER.x86_64-1.0.7/ltr_finder
chmod a+w ${SCRIPT_DIR}/repeatCraft/example
# Extract tRNAdb
tar -zxf ${SCRIPT_DIR}/bin/LTR_FINDER.x86_64-1.0.7/tRNAdb.tar.gz --directory ${SCRIPT_DIR}/bin/LTR_FINDER.x86_64-1.0.7 && rm -r ${SCRIPT_DIR}/bin/LTR_FINDER.x86_64-1.0.7/tRNAdb.tar.gz
# test for conda
df -h
# Set PERL5LIB upon activate/deactivate
for CHANGE in "activate" "deactivate";
do
mkdir -p "${PREFIX}/etc/conda/${CHANGE}.d"
done
echo "#!/bin/sh" > "${PREFIX}/etc/conda/activate.d/${PKG_NAME}_activate.sh"
echo "export PERL5LIB=${PREFIX}/share/RepeatMasker/:${PREFIX}/share/RepeatModeler/" >> "${PREFIX}/etc/conda/activate.d/${PKG_NAME}_activate.sh"
echo "#!/bin/sh" > "${PREFIX}/etc/conda/deactivate.d/${PKG_NAME}_deactivate.sh"
echo "unset PERL5LIB" >> "${PREFIX}/etc/conda/deactivate.d/${PKG_NAME}_deactivate.sh"
# Put earlGrey executable in bin
cd ${PREFIX}/bin
ln -sf ${PACKAGE_HOME}/earlGrey .
ln -sf ${PACKAGE_HOME}/earlGreyLibConstruct .
ln -sf ${PACKAGE_HOME}/earlGreyAnnotationOnly .
================================================
FILE: conda/developmentNotes.md
================================================
# Updating Earl Grey to remove Python 3.9 dependency
Bjorn wants me to update Earl Grey to remove the Python 3.9 dependency. This is because Python 3.9 is no longer supported and we want to ensure that Earl Grey can run on newer versions of Python.
I will build Earl Grey with the meta dependency removed and test it to ensure that it still works correctly. I will also update the documentation to reflect the change in dependencies.
Build the conda package:
```bash
conda build .
```
Make a new conda environment and install the package:
```bash
conda create -n earlgrey_python_test -c conda-forge -c bioconda earlgrey --use-local
conda activate earlgrey_python_test
```
This still builds with 3.9. I will force it to build with 3.10 by adding `python >=3.10` to the `run` dependencies in `meta.yaml`. After making this change, I will rebuild the package and test it again. I also removed the dependency on `ncls` version 0.0.64 and replaced it with `ncls` without a specific version to allow for newer versions of `ncls` to be used.
```bash
conda build .
conda create -n earlgrey_python_test -c conda-forge -c bioconda earlgrey --use-local
conda activate earlgrey_python_test
```
Link the dfam libraries and configure RepeatMasker:
```bash
ln -sf ln -sf /data/toby/tools/earlgrey_databases/Libraries/famdb/* \
/data/toby/miniforge3/envs/earlgrey_python_test/share/RepeatMasker/Libraries/famdb/
cd /data/toby/miniforge3/envs/earlgrey_python_test/share/RepeatMasker
perl configure
```
Run a test command to ensure that Earl Grey is working correctly:
```bash
cd /data/toby/testDIR/
earlGrey -g /data/toby/tools/test.fasta -s test_python312 -o . -t 16
```
This passed with a small fix to the `backSwap.py` script where I changed the regular expression for splitting the input file from `'\s+'` to `r'\s+'` to ensure that it correctly handles whitespace. After making this change, I will test the script again to ensure that it works correctly with the new dependencies.
```bash
cd /data/toby/testDIR/
earlGrey -g /data/toby/testDIR/saturationTests/1_4genomesNoRepMasker/1A5.fa -s test_python312_2 -o . -t 32
```
---
# Reducing peak RAM usage in TEstrainer and divergence_calc.py
## Background
Two components were identified as disproportionately RAM-intensive when run with large thread counts, leading to OOM (out-of-memory) kills on compute nodes with limited RAM:
- **TEstrainer** / **TEstrainer_for_earlGrey.sh** — the GNU parallel jobs dispatched during the BEET curation loop (trf, initial_mafft_setup.py, mafft, TEtrim.py) had no memory back-pressure. With 32–64 threads all jobs fired simultaneously, each spawning a Python interpreter that imports `pandas`, `numpy`, `pyranges` and `BioPython` (~300–600 MB each at startup), along with concurrent `--localpair` mafft alignments.
- **divergence_calc.py** — used `multiprocessing.Pool` with the default `fork` start method. On Linux this fork-copies the entire parent address space (including the full GFF DataFrame, which can be multiple GB for large genomes) into every worker process. At 32 cores this means 32 nearly-identical copies of the GFF in RAM simultaneously, even though each worker only needs its small chunk.
## Changes made
### `scripts/TEstrainer/TEstrainer` and `scripts/TEstrainer/TEstrainer_for_earlGrey.sh`
**1. Default `MEM_FREE` raised from `200M` to `1G`**
The previous default of 200 MB was smaller than the memory footprint of a single Python interpreter after importing the heavy scientific libraries used by `initial_mafft_setup.py` and `TEtrim.py`. This meant `--memfree 200M` gave essentially no throttling protection. 1 GB is a conservative lower bound that reflects real startup costs.
**2. RAM-cap guard added after argument parsing**
A block querying `/proc/meminfo` via `free -m` is inserted immediately after `MAFFT_THREADS` is calculated. If the requested thread count exceeds what is safely supportable (at an estimate of 800 MB per concurrent job), `THREADS` (and `MAFFT_THREADS`) are silently capped with a warning message to stderr. The 800 MB estimate is conservative; real usage per job is closer to 300–500 MB for typical inputs but spikes during mafft's `--localpair` DP phase.
```bash
AVAIL_MEM_MB=$(free -m | awk '/^Mem:/{print $7}')
MEM_PER_JOB_MB=800
MAX_SAFE_THREADS=$(( AVAIL_MEM_MB / MEM_PER_JOB_MB ))
if [[ $MAX_SAFE_THREADS -gt 0 && $THREADS -gt $MAX_SAFE_THREADS ]]; then
echo "Warning: capping threads from ${THREADS} to ${MAX_SAFE_THREADS} based on available RAM (${AVAIL_MEM_MB} MB free)"
THREADS=$MAX_SAFE_THREADS
if [[ $THREADS -gt 4 ]]; then MAFFT_THREADS=$(($(($THREADS / 4)))); else MAFFT_THREADS=1; fi
fi
```
**3. `--memfree ${MEM_FREE}` added to all GNU parallel calls that were missing it**
In the standalone `TEstrainer`, the four `parallel` calls inside the BEET curation loop (trf, initial_mafft_setup, mafft, TEtrim) were missing `--memfree`. In `TEstrainer_for_earlGrey.sh`, the mafft and TEtrim calls were missing it (the others already had it). GNU parallel's `--memfree` halts job dispatch when free system RAM drops below the threshold, queuing further jobs until a running job finishes and memory is reclaimed. This is the most direct mechanism for preventing simultaneous over-subscription.
### `scripts/divergenceCalc/divergence_calc.py`
**1. `forkserver` multiprocessing start method**
```python
multiprocessing.set_start_method('forkserver')
```
On Linux the default is `fork`, which duplicates the entire parent address space into every worker via `clone()`. With a multi-GB GFF DataFrame in memory this creates N near-identical copies. `forkserver` instead spawns a minimal server process at startup; each worker is forked from that clean server and receives only the data it actually needs (the chunk file path string) via IPC. This is the primary RAM reduction mechanism.
**2. GFF DataFrame serialised to per-chunk TSV files before pool creation**
Previously the chunk DataFrames were passed to `pool.map()` directly — they were pickled in the parent and unpickled in each worker. With `forkserver` this already avoids the fork-copy problem, but the parent still holds the full `in_gff` DataFrame in memory throughout the entire pool run. The new approach serialises each chunk to a temp TSV on disk, then deletes `in_gff` and `chunks` from the parent before the pool is even created:
```python
chunk_files = []
for i, chunk in enumerate(chunks):
chunk_path = os.path.join(args.temp_dir, f"chunk_{i}.tsv")
chunk.to_csv(chunk_path, sep="\t", index=True)
chunk_files.append(chunk_path)
del chunks
del in_gff
```
Workers then receive a file path string (not a DataFrame), read the chunk at the start of `outer_func`, and immediately delete the file. Parent peak RAM during the pool run is reduced to `simple_gff + other_gff + Python overhead` rather than `full_gff + simple_gff + other_gff`.
**3. `outer_func` signature updated and `pybedtools.set_tempdir` moved inside worker**
The first argument to `outer_func` is now the chunk file path. `pybedtools.set_tempdir()` is now called at the top of `outer_func` rather than only in the parent. This is required for `forkserver` workers, which do not inherit the parent's in-memory pybedtools state.
**4. Periodic `pybedtools.cleanup()` inside `outer_func`**
```python
if row_counter % 500 == 0:
pybedtools.cleanup(remove_all=False)
```
Each row creates a pybedtools temp file. In a long-running worker processing thousands of rows, these accumulate in the temp directory and inflate virtual memory. Calling `cleanup(remove_all=False)` every 500 rows removes files that are no longer referenced by any active Python object, releasing both disk and virtual-memory pressure without affecting open handles.
**5. `imap_unordered` instead of `pool.map`**
```python
results = list(pool.imap_unordered(func, chunk_files))
```
`pool.map` buffers all results in memory until every worker finishes. `imap_unordered` yields results as each worker completes. Since `outer_func` returns only a file path string this has negligible direct memory impact, but it allows the Python runtime to close completed worker processes and release their memory sooner when combined with `maxtasksperchild`.
**6. `maxtasksperchild=1`**
```python
pool = multiprocessing.Pool(processes=num_processes, maxtasksperchild=1)
```
Forces each worker process to exit and be replaced after processing one chunk. This ensures that any memory accumulated during a chunk run (pybedtools internal state, BioPython caches, subprocess residuals) is released to the OS at chunk boundaries rather than accumulating across the lifetime of the pool.
---
## Tests
The following tests should be run to verify correctness and confirm the memory ceiling has been reduced. Run them from the root of the EarlGrey repo unless stated otherwise.
### Dependencies for memory monitoring
```bash
# Install memory_profiler if not already present
pip install memory_profiler psutil
# /usr/bin/time -v is available on Linux via the 'time' package
which /usr/bin/time || sudo apt-get install -y time
```
---
### Test 1 — Static audit: verify `--memfree` presence in all parallel calls
Confirms no `parallel` call in either TEstrainer script is dispatching jobs without a memory guard.
```bash
echo ""
echo "=== TEstrainer_for_earlGrey.sh ==="
grep -n 'parallel' scripts/TEstrainer/TEstrainer_for_earlGrey.sh | grep -v '#'
```
**Expected:** Every line containing `parallel --bar` should also contain `--memfree`.
---
### Test 2 — Static audit: verify MEM_FREE defaults
```bash
grep 'MEM_FREE=' scripts/TEstrainer/TEstrainer_for_earlGrey.sh
```
**Expected:** Both lines should read `MEM_FREE="1G"`.
---
### Test 3 — Static audit: verify forkserver and chunk-file pattern in divergence_calc.py
```bash
grep -n 'set_start_method\|chunk_path\|chunk_files\|imap_unordered\|maxtasksperchild\|pybedtools.cleanup' \
scripts/divergenceCalc/divergence_calc.py
```
**Expected output should contain all of:**
- `set_start_method('forkserver')`
- `chunk_path` (argument to `outer_func`)
- `chunk_files` (list built in `__main__`)
- `imap_unordered`
- `maxtasksperchild=1`
- `pybedtools.cleanup`
---
### Test 4 — RAM-cap guard: unit test on the bash logic
```bash
# Simulate a machine with 4 GB available and a 64-thread request.
# The guard should cap THREADS to (4096 / 800) = 5.
bash -c '
THREADS=64
AVAIL_MEM_MB=4096
MEM_PER_JOB_MB=800
MAX_SAFE_THREADS=$(( AVAIL_MEM_MB / MEM_PER_JOB_MB ))
if [[ $MAX_SAFE_THREADS -gt 0 && $THREADS -gt $MAX_SAFE_THREADS ]]; then
echo "CAPPED: THREADS set to ${MAX_SAFE_THREADS}"
THREADS=$MAX_SAFE_THREADS
else
echo "NOT CAPPED: THREADS remains ${THREADS}"
fi
echo "Final THREADS: ${THREADS}"
'
```
**Expected:**
```
CAPPED: THREADS set to 5
Final THREADS: 5
```
---
### Test 5 — Correctness: divergence_calc.py output matches reference
Generate a reference output using the old code path on a small test GFF, then confirm the new code produces identical output. A pre-existing completed EarlGrey run in `testDIR` can supply the required files.
```bash
# Adjust paths to a completed EarlGrey run with a .gff and matching library
TESTDIR=/data/toby/testDIR/condaPull
GENOME=/data/toby/testDIR/test.fasta
GFF=/data/toby/testDIR/condaPull/genome1_EarlGrey/genome1_mergedRepeats/looseMerge/genome1.filteredRepeats.gff
LIB=/data/toby/testDIR/condaPull/genome1_EarlGrey/genome1_strainer/genome1-families.fa.strained
# Run with the updated script
python scripts/divergenceCalc/divergence_calc.py \
-l "${LIB}" \
-i "${GFF}" \
-g "${GENOME}" \
-o /tmp/divergence_new.gff \
-tmp /tmp/divtest_new/ \
-t 4
echo "Exit code: $?"
echo "Output lines: $(wc -l < /tmp/divergence_new.gff)"
```
**Expected:** Zero exit code, output line count matches the input GFF line count, all lines end with `;KIMURA80=`.
```bash
# Verify all non-simple-repeat lines carry a KIMURA80 tag
grep -v 'Simple_repeat\|Satellite\|Low_complexity' /tmp/divergence_new.gff | \
grep -v 'KIMURA80' | wc -l
```
**Expected:** 0 for a pure RepeatMasker/Earl Grey GFF. Annotations from other tools (e.g. HELIANO, which uses a different `tool` column value) are passed through unmodified by `parse_gff()` and will not carry a `KIMURA80` tag — one such line per non-RM/EG annotation is acceptable and expected.
---
### Test 6 — Memory ceiling: peak RSS comparison for divergence_calc.py
Measure peak RSS with different thread counts to confirm sub-linear scaling (the old `fork` behaviour would show near-linear scaling of the peak).
```bash
TESTDIR=/data/toby/testDIR/condaPull
GENOME=/data/toby/testDIR/test.fasta
GFF=/data/toby/testDIR/condaPull/genome1_EarlGrey/genome1_mergedRepeats/looseMerge/genome1.filteredRepeats.gff
LIB=/data/toby/testDIR/condaPull/genome1_EarlGrey/genome1_strainer/genome1-families.fa.strained
for T in 1 4 8 16; do
rm -rf /tmp/divmem_t${T}
echo -n "Threads=${T}: "
/usr/bin/time -v python scripts/divergenceCalc/divergence_calc.py \
-l "${LIB}" -i "${GFF}" -g "${GENOME}" \
-o /tmp/divmem_t${T}.gff -tmp /tmp/divmem_t${T}/ -t ${T} \
2>&1 | grep 'Maximum resident'
done
```
**Expected:** Peak RSS should not scale linearly with thread count. With the old `fork` code, doubling the thread count roughly doubled peak RSS. With the new code, peak RSS should plateau once per-worker overheads are comparable to the chunk size — a significant reduction at 8+ threads on large GFFs.
As a rough guide, if the single-thread peak RSS is R MB, you should aim to see the 16-thread run use no more than ~2–3× R (library imports + 16 small chunks), compared to potentially 16× R with the old fork-based approach.
**Observed results (test genome, condaPull/genome1):**
| Threads | Peak RSS (kB) |
|---------|--------------|
| 1 | 75,992 |
| 4 | 75,940 |
| 8 | 75,920 |
| 16 | 76,092 |
Peak RSS is effectively flat across all thread counts (~76 MB). This confirms that `forkserver` workers start from a clean process and do not inherit the parent GFF DataFrame — the old `fork` implementation would have shown peak RSS scaling approximately linearly with thread count (potentially ~16× higher at 16 threads for a large GFF).
---
### Test 7 — Memory ceiling: peak RSS comparison for TEstrainer
```bash
TESTLIB=/data/toby/testDIR/condaPull/genome1_EarlGrey/genome1_strainer/genome1-families.fa.strained # use a small library for speed
GENOME=/data/toby/testDIR/test.fasta
# Baseline: 4 threads
/usr/bin/time -v bash scripts/TEstrainer/TEstrainer_for_earlGrey.sh \
-l ${TESTLIB} -g ${GENOME} -t 4 -r 1 -d /tmp/ts_mem_t4 \
2>&1 | grep 'Maximum resident'
# Higher thread count: confirm MEM_FREE throttles correctly
/usr/bin/time -v bash scripts/TEstrainer/TEstrainer_for_earlGrey.sh \
-l ${TESTLIB} -g ${GENOME} -t 32 -r 1 -d /tmp/ts_mem_t32 \
2>&1 | grep 'Maximum resident'
```
**Expected:** The 32-thread run should emit a `Warning: capping threads` message if available RAM is insufficient, and peak RSS should not grow proportionally with the requested thread count.
**Observed results (test genome, condaPull/genome1 library):**
| Threads | Peak RSS (kB) |
|---------|--------------|
| 4 | 897,860 |
| 32 | 891,420 |
Peak RSS is effectively identical between 4 and 32 threads (~877 MB). Without `--memfree` throttling, the 32-thread run would dispatch all jobs simultaneously and peak RSS would scale with the number of concurrent Python interpreter startups. The flat result confirms that GNU parallel is correctly queuing jobs when free RAM drops below the `MEM_FREE` threshold, keeping concurrent memory usage stable regardless of the requested thread count.
---
### Test 8 — `--memfree` throttling behaviour
Verify that GNU parallel actually pauses job dispatch when free RAM is low, rather than launching all jobs at once.
```bash
# Install stress-ng to consume a controllable amount of RAM
which stress-ng || sudo apt-get install -y stress-ng
# Consume ~80% of available RAM, then run TEstrainer with many threads.
# parallel should pause on subsequent jobs when memfree drops below MEM_FREE.
AVAIL=$(free -m | awk '/^Mem:/{print $7}')
CONSUME_MB=$(( AVAIL * 80 / 100 ))
stress-ng --vm 1 --vm-bytes ${CONSUME_MB}M --vm-keep &
STRESS_PID=$!
sleep 3
# Now attempt to run parallel with --memfree 1G and many small jobs.
# Observe that parallel waits rather than dispatching all at once.
seq 1 20 | parallel --jobs 8 --memfree 1G 'sleep 0.5 && echo job {} done'
kill $STRESS_PID
```
**Expected:** The 20 sleep jobs complete without error. If free RAM is below 1 GB after stress-ng starts, parallel will hold back some jobs rather than launching all 20 simultaneously, and the output will appear staggered as memory is reclaimed.
---
### Test 9 — Worker isolation: confirm forkserver workers do not inherit parent GFF
This Python script directly tests that a worker process does not see the parent's `in_gff` variable, confirming the forkserver isolation is working.
```python
# save as /tmp/test_forkserver_isolation.py and run with: python /tmp/test_forkserver_isolation.py
import multiprocessing
import os
import pandas as pd
def worker_mem_check(_):
"""Return RSS of this worker process in MB."""
import psutil
proc = psutil.Process(os.getpid())
return proc.memory_info().rss / 1024 / 1024
if __name__ == "__main__":
# Allocate a large DataFrame in the parent to simulate a big GFF.
big_df = pd.DataFrame({'a': range(5_000_000), 'b': range(5_000_000)})
parent_rss_mb = __import__('psutil').Process(os.getpid()).memory_info().rss / 1024 / 1024
print(f"Parent RSS after allocating large DataFrame: {parent_rss_mb:.0f} MB")
multiprocessing.set_start_method('forkserver')
with multiprocessing.Pool(processes=4) as pool:
worker_rss_values = pool.map(worker_mem_check, range(4))
for i, rss in enumerate(worker_rss_values):
print(f"Worker {i} RSS: {rss:.0f} MB")
print(f"\nParent RSS: {parent_rss_mb:.0f} MB")
print(f"Mean worker RSS: {sum(worker_rss_values)/len(worker_rss_values):.0f} MB")
print(f"Ratio (parent/worker): {parent_rss_mb / (sum(worker_rss_values)/len(worker_rss_values)):.1f}x")
```
**Expected:** Worker RSS values should be substantially smaller than the parent RSS (typically 5–20× smaller), confirming that workers do not inherit the large `big_df` from the parent. With the old `fork` start method, worker RSS would be approximately equal to parent RSS.
To contrast, change `set_start_method('forkserver')` to `set_start_method('fork')` and re-run — worker RSS values should roughly match the parent.
**Observed results:**
```
Parent RSS after allocating large DataFrame: 135 MB
Worker 0 RSS: 54 MB
Worker 1 RSS: 54 MB
Worker 2 RSS: 54 MB
Worker 3 RSS: 54 MB
Parent RSS: 135 MB
Mean worker RSS: 54 MB
Ratio (parent/worker): 2.5x
```
Workers consume 54 MB (Python + library imports only) versus 135 MB in the parent (which includes the 5M-row DataFrame). The workers carry no copy of the parent's data, confirming `forkserver` isolation is working correctly. With the old `fork` start method all four workers would each have shown ~135 MB.
---
### Test 10 — End-to-end regression: full EarlGrey run
A complete run to confirm that all components still produce correct outputs after the changes.
```bash
cd /data/toby/testDIR/
earlGrey -g test.fasta -s mem_regression_test -o . -t 16
```
**Expected:** Run completes without error. Output directory `mem_regression_test_EarlGrey/` is populated with the expected files including `*-families.fa`, `*.gff`, and `*.divergence.pdf`. Compare output against a previous known-good run to confirm no regressions in repeat library composition or divergence values.
**Observed:** Run completed without error (exit code 0). All memory-reduction changes are confirmed compatible with a full end-to-end EarlGrey pipeline run.
This also passed successfully, confirming that Earl Grey is now compatible with Python 3.10 and does not have a dependency on Python 3.9. I will now update the documentation to reflect these changes and ensure that users are aware of the new dependencies when installing Earl Grey.
---
# Version 7.2.3 — bug fixes, robustness improvements, and `-q` quiet-bar flag
## Background
Several bugs were found and fixed after the v7.2.0 memory-refinement release. The changes also add a new `-q` flag to `earlGrey` to suppress the GNU parallel progress bar, which is useful when running jobs through batch schedulers (SLURM `sbatch`, PBS, etc.) where the ANSI escape codes written by `--bar` corrupt log files.
## Changes made
### `conda/meta.yaml`
- Bumped version to `7.2.3`.
- Added `numpy` to the `run` dependencies. `divergence_calc.py` now imports `numpy` directly (for `np.array_split`), so it must be declared explicitly.
- Changed `source` from a remote tarball + SHA256 to `path: ..` to allow local conda builds during development without needing a published release tarball.
### `earlGrey` (main pipeline script)
**1. New `-q` flag to suppress the TEstrainer parallel progress bar**
```bash
earlGrey -g genome.fa -s myspecies -o outdir -t 16 -q yes
```
When `-q yes` is passed, `earlGrey` appends `-q` to the `TEstrainer_for_earlGrey.sh` invocation. This propagates to every GNU `parallel --bar` call inside TEstrainer, replacing `--bar` with nothing so the progress bar is omitted. Useful for `sbatch` jobs where `--bar` writes ANSI codes into log files.
**2. Fixed `AF_UNIX` socket path-too-long crash in `divergence_calc.py`**
`pybedtools.set_tempdir()` sets `tempfile.tempdir` globally, which is also used by `forkserver` as the base directory for its `AF_UNIX` socket. If the EarlGrey output path is deeply nested, the socket path can exceed the 108-character kernel limit, causing an `OSError: AF_UNIX path too long`. The fix creates a short per-run temp directory under `/tmp`:
```bash
divcalc_tmp="/tmp/egdiv_${species}"
mkdir -p "${divcalc_tmp}"
python .../divergence_calc.py ... -tmp "${divcalc_tmp}"
# ...
rm -rf "${divcalc_tmp}"
```
### `scripts/TEstrainer/TEstrainer_for_earlGrey.sh`
Added a `-q` flag that sets `BAR_FLAG=""` (default `BAR_FLAG="--bar"`). The variable is substituted into every `parallel` call so a single flag controls all progress bars in the script.
### `scripts/divergenceCalc/divergence_calc.py`
**1. Added `import numpy as np`** — used for `np.array_split` (see below).
**2. Replaced manual chunk splitting with `np.array_split`**
The old code computed `chunk_size = int(rows / num_processes)` and stepped through the DataFrame with a fixed stride. For inputs where `rows` is not evenly divisible by `num_processes` this could silently drop up to `num_processes - 1` rows. `np.array_split` distributes any remainder evenly:
```python
chunks = [in_gff.iloc[idx] for idx in np.array_split(range(len(in_gff)), num_processes)]
```
**3. Replaced all path string concatenation with `os.path.join`**
All occurrences of `temp_dir + "/subdir/" + name` have been replaced with `os.path.join(temp_dir, "subdir", name)`. This fixes `OSError`s that occurred when `args.temp_dir` was passed with a trailing slash (e.g. `/tmp/egdiv_myspecies/`), which caused double-slash paths like `//pybedtools` that could fail on some systems.
### `scripts/filteringOverlappingRepeats.R`
Fixed a crash that occurred when no nested TEs were found in the genome. The previous code unconditionally called `mutate()` and `bind_rows()` on `nested_only`, which threw an error when `nested_only` was an empty data frame. The fix wraps the nested-TE processing in an `if (nrow(nested_only) > 0)` guard and falls through to writing `unnested_only` directly when there are no nested elements:
```r
if (nrow(nested_only) > 0) {
nested_only <- nested_only %>%
mutate(attributes = paste0(attributes, ";nested=", nested, "-round", nesting_round)) %>%
select(-c(nested, nesting_round))
output.gff <- bind_rows(nested_only, unnested_only) %>% arrange(seqid, start)
} else {
output.gff <- unnested_only %>% arrange(seqid, start)
}
```
### `scripts/divergenceCalc/divergence_plot.R`
Fixed crashes in the per-superfamily Kimura divergence plots when a TE class (DNA, LINE, LTR, SINE, PLE, RC) is absent from the genome. The previous code built each plot unconditionally and then caught errors with `try(ggplot_build(...))`. The new approach checks `!is.null(...) && nrow(...) > 0` before constructing the ggplot object:
```r
if (!is.null(divergence_eg_tes_rounded_for_superfamily_plot[["DNA"]]) &&
nrow(divergence_eg_tes_rounded_for_superfamily_plot[["DNA"]]) > 0) {
kimura_superfamily_plot_1 <- ggplot(...) + ...
} else {
kimura_superfamily_plot_1 <- NULL
}
```
The axis-flipping logic was also updated to skip `NULL` plots rather than attempting to add `scale_x_continuous()` to them (which would previously crash).
## Build and test environment
Build the conda package from the local source tree:
```bash
cd /data/toby/EarlGrey/conda
conda build .
```
Create a fresh environment and install:
```bash
conda create -n earlgrey_723_test -c conda-forge -c bioconda earlgrey --use-local
conda activate earlgrey_723_test
```
Link the Dfam libraries and configure RepeatMasker:
```bash
ln -sf /data/toby/tools/earlgrey_databases/Libraries/famdb/* \
/data/toby/miniforge3/envs/earlgrey_723_test/share/RepeatMasker/Libraries/famdb/
cd /data/toby/miniforge3/envs/earlgrey_723_test/share/RepeatMasker
perl configure
```
### Static checks
```bash
cd /data/toby/EarlGrey
# 1. Confirm numpy is in meta.yaml
grep 'numpy' conda/meta.yaml
# 2. Confirm -q flag is wired up in earlGrey
grep -n 'quietBar\|BAR_FLAG\|\-q' earlGrey | head -20
# 3. Confirm BAR_FLAG is used in TEstrainer_for_earlGrey.sh
grep -n 'BAR_FLAG\|bar' scripts/TEstrainer/TEstrainer_for_earlGrey.sh | grep -v '#'
# 4. Confirm os.path.join used throughout divergence_calc.py (no string concat with temp_dir)
grep 'temp_dir *+' scripts/divergenceCalc/divergence_calc.py # should return nothing
# 5. Confirm np.array_split is used for chunking
grep 'array_split' scripts/divergenceCalc/divergence_calc.py
# 6. Confirm nrow guard in filteringOverlappingRepeats.R
grep -n 'nrow(nested_only)' scripts/filteringOverlappingRepeats.R
```
### Functional tests
```bash
cd /data/toby/testDIR/
# Basic run — confirm pipeline completes with -q yes
earlGrey -g test.fasta -s test_723_quiet -o . -t 16 -q yes
# Confirm no ANSI bar codes in the log (stdout should have no ESC sequences)
earlGrey -g test.fasta -s test_723_quiet2 -o . -t 16 -q yes 2>&1 | cat | grep -c $'\033' || echo "No ANSI codes — good"
# Confirm pipeline still works without -q (bar shown as before)
earlGrey -g test.fasta -s test_723_noq -o . -t 16
```
---
# Version 7.2.4 — pipeline error handling, samplable genome size for RepeatModeler, and bash safety
## Background
Two complementary robustness improvements were identified and implemented across the three main Earl Grey pipeline scripts (`earlGrey`, `earlGreyLibConstruct`, `earlGreyAnnotationOnly`):
1. The `deNovo1()` function in `earlGrey` and `earlGreyLibConstruct` used a deeply nested retry loop to work around RepeatModeler failures on small or fragmented genomes. This approach was reactive: RepeatModeler would run, fail, be retried with a smaller `-genomeSampleSizeMax`, fail again, and be retried once more — wasting potentially hours of compute and producing confusing log output. The ParTEA (pangenome) version of Earl Grey had already solved this more elegantly by pre-computing the samplable genome size and choosing the correct flag upfront.
2. All three scripts lacked `set -eo pipefail`, meaning that any command that exited non-zero in the middle of a function body (outside an explicit `if` check) would silently be ignored and execution would continue into the next stage with corrupt or missing intermediate files. Additionally, two common shell idioms in the scripts would produce false failures under strict error handling: `expr N / 4` (returns exit code 1 when the result is 0) and `ls ... | head -n 1` pipelines (SIGPIPE exit 141 from `ls` when `head` exits early under `pipefail`).
## Changes made
### `earlGrey` and `earlGreyLibConstruct` — `deNovo1()` redesign
The previous implementation:
```bash
deNovo1() {
RepeatModeler -threads ${ProcNum} -database ...
if [ ! -e ...-families.fa ]; then
echo "ERROR: Retrying with limit set as Round 5"
RepeatModeler ... -genomeSampleSizeMax 81000000
if [ ! -e ...-families.fa ]; then
echo "ERROR: Retrying with limit set as Round 4"
RepeatModeler ... -genomeSampleSizeMax 27000000
if [ ! -e ...-families.fa ]; then
echo "ERROR: RepeatModeler Failed" && exit 2
fi
fi
fi
}
```
This ran RepeatModeler up to three times, each time with a smaller sampling cap, without any principled basis for which cap to choose. It also only covered two fallback tiers (rounds 5 and 4), leaving genomes smaller than 12 Mb (rounds 1–2) unhandled.
The new implementation mirrors the approach used in ParTEA. Before invoking RepeatModeler, the samplable genome size is computed as the sum of all contigs ≥ 40 kb (the threshold below which RepeatModeler discards contigs during sampling). The appropriate `-genomeSampleSizeMax` cap is then derived from the cumulative RECON round thresholds and set in `SAMPLE_FLAG`. RepeatModeler is invoked exactly once with the correct flag, and a single post-run existence check is retained as a safety net.
**Cumulative RECON round thresholds:**
| Rounds supported | Cumulative threshold | Cap applied |
|-----------------|---------------------|-------------|
| All 6 rounds | ≥ 363 Mb (3+9+27+81+243) | none (default) |
| Rounds 1–5 | ≥ 120 Mb (3+9+27+81) | `-genomeSampleSizeMax 81000000` |
| Rounds 1–4 | ≥ 39 Mb (3+9+27) | `-genomeSampleSizeMax 27000000` |
| Rounds 1–3 | ≥ 12 Mb (3+9) | `-genomeSampleSizeMax 9000000` |
| Rounds 1–2 | < 12 Mb | `-genomeSampleSizeMax 3000000` |
The samplable genome size is computed using `awk` directly on `$genome` (the `.prep` file, which is already available at the `deNovo1` call site):
```bash
GENOME_SIZE=$(awk '/^>/{if(len>=40000)sum+=len; len=0; next}{len+=length($0)} END{if(len>=40000)sum+=len; print sum+0}' "${genome}")
```
### `earlGrey`, `earlGreyLibConstruct`, `earlGreyAnnotationOnly` — `set -eo pipefail`
`set -eo pipefail` was added immediately after `#!/bin/bash` in all three scripts.
- `-e`: exits the script immediately if any simple command returns a non-zero exit code and is not part of a condition test.
- `-o pipefail`: extends `-e` to pipelines — a pipeline fails if any stage within it returns non-zero, not just the last stage.
Together these ensure that a failure in any tool invocation within a function body (RepeatMasker, BuildDatabase, TEstrainer, rcMergeRepeats, R scripts, etc.) will abort the pipeline at the point of failure rather than silently continuing to the next stage.
### `earlGrey`, `earlGreyLibConstruct`, `earlGreyAnnotationOnly` — `expr` replaced with `$(( ))`
All occurrences of `rmthreads=$(expr $ProcNum / 4)` and `strainthreads=$(expr $ProcNum / 4)` were replaced with the equivalent `$(( ProcNum / 4 ))`. The `expr` utility returns exit code 1 when the arithmetic result is 0, which occurs when `ProcNum < 4`. Under `set -e` this would abort the script before any tool was invoked if the user specified fewer than 4 threads.
### `earlGrey`, `earlGreyLibConstruct`, `earlGreyAnnotationOnly` — `ls | head -n 1` pipeline fix
All `ls -td ... | head -n 1` pipeline expressions used to retrieve the most recently created directory (TEstrainer output directories, HELIANO output directories) were appended with `|| true`:
```bash
latestFile="$(realpath $(ls -td -- ${OUTDIR}/${species}_strainer/*/ | head -n 1) || true)/..."
```
Under `pipefail`, when `head` exits after reading one line, `ls` receives SIGPIPE (exit code 141), causing the pipeline to return non-zero. This is expected behaviour — `ls` has more output but `head` is done — and `|| true` suppresses this signal without masking genuine `ls` errors (which would produce a different non-zero exit and an error message to stderr).
## Static verification
```bash
cd /data/toby/EarlGrey
# 1. Confirm set -eo pipefail is present in all three scripts
grep -n 'set -eo pipefail' earlGrey earlGreyLibConstruct earlGreyAnnotationOnly
# 2. Confirm no remaining expr calls
grep -n 'expr' earlGrey earlGreyLibConstruct earlGreyAnnotationOnly
# 3. Confirm ls|head pipelines are guarded
grep -n 'head -n 1' earlGrey earlGreyLibConstruct earlGreyAnnotationOnly
# 4. Confirm GENOME_SIZE and SAMPLE_FLAG logic is present in deNovo1
grep -n 'GENOME_SIZE\|SAMPLE_FLAG' earlGrey earlGreyLibConstruct
```
**Expected for check 1:** Three lines, one per script, each showing `set -eo pipefail` at line 2.
**Expected for check 2:** No output (all `expr` calls replaced).
**Expected for check 3:** All matching lines should contain `|| true`.
**Expected for check 4:** Both scripts should show `GENOME_SIZE` and `SAMPLE_FLAG` variable assignments and the five-tier `if/elif/else` block inside `deNovo1`.
================================================
FILE: conda/meta.yaml
================================================
{% set name = "EarlGrey" %}
{% set version = "7.2.4" %}
package:
name: {{ name|lower }}
version: {{ version }}
source:
path: ..
build:
number: 0
skip: True # [osx]
run_exports:
- {{ pin_subpackage('earlgrey', max_pin="x") }}
requirements:
build:
- make
- {{ compiler('cxx') }}
run:
- python >=3.10
- hmmer
- trf
- cd-hit
- genometools-genometools
- numpy
- pandas
- ncls
- pyfaidx
- pyranges
- parallel
- repeatmasker >=4.2.3
- ltr_retriever
- mafft
- mreps
- ninja-nj
- repeatscout
- recon
- repeatmodeler >=2.0.4
- bioconductor-genomeinfodb
- bioconductor-genomeinfodbdata
- bioconductor-bsgenome
- bioconductor-plyranges
- r-ape
- r-optparse
- r-tidyverse
- r-plyr
- r-viridis
- r-cowplot
- r-ggtext
- r-data.table
- r-magrittr
- bedtools
- emboss
- pybedtools
- samtools
- heliano
- r-kableextra
test:
commands:
- df -h
- earlGrey -h
about:
home: "https://github.com/TobyBaril/EarlGrey"
dev_url: "https://github.com/TobyBaril/EarlGrey"
doc_url: "https://github.com/TobyBaril/EarlGrey/blob/v{{ version }}/README.md"
license: "OSL-2.1"
license_file: LICENSE
summary: "Earl Grey: A fully automated TE curation and annotation pipeline."
description: |
Earl Grey is a full-automated transposable element (TE) annotation pipeline,
leveraging the most widely-used tools and combining these with a consensus
elongation process (BEAT) to better define de novo consensus sequences when
annotating new genome assemblies.
extra:
identifiers:
- doi:10.1093/molbev/msae068
================================================
FILE: earlGrey
================================================
#!/bin/bash
set -eo pipefail
usage()
{
echo " #############################
earlGrey version 7.2.4
Required Parameters:
-g == genome.fasta
-s == species name
-o == output directory
Optional Parameters:
-t == Number of Threads (DO NOT specify more than are available)
-r == RepeatMasker search term (e.g arthropoda/eukarya)
-l == Starting consensus library for an initial mask (in fasta format)
-i == Number of Iterations to BLAST, Extract, Extend (Default: 10)
-f == Number flanking basepairs to extract (Default: 1000)
-c == Cluster TE library to reduce redundancy? (yes/no, Default: no)
-m == Remove putative spurious TE annotations <100bp? (yes/no, Default: no)
-d == Create soft-masked genome at the end? (yes/no, Default: no)
-n == Max number of sequences used to generate consensus sequences (Default: 20)
-a == minimum number of sequences required to build a consensus sequence (Default: 3)
-e == Run HELIANO as an optional step to detect Helitrons (yes/no, Default: no)
-q == Suppress TEstrainer parallel progress bar (yes/no, Default: no, useful for batch/sbatch jobs)
-h == Show help
Example Usage:
earlGrey -g bombyxMori.fasta -s bombyxMori -o /home/toby/bombyxMori/repeatAnnotation/ -t 16
Queries can be sent to:
tobias.baril[at]unine.ch
Please make use of the GitHub Issues and Discussion Tabs at: https://github.com/TobyBaril/EarlGrey
#############################"
}
# Subprocess Make Directories #
makeDirectory()
{
directory=$(realpath ${directory})
mkdir -p ${directory}/${species}_EarlGrey/ && OUTDIR=${directory}/${species}_EarlGrey
if [ ! -z "$RepSpec" ] || [ ! -z "$startCust" ] ; then
mkdir -p $OUTDIR/${species}_RepeatMasker/
fi
mkdir -p $OUTDIR/${species}_Database/
mkdir -p $OUTDIR/${species}_RepeatModeler/
mkdir -p $OUTDIR/${species}_strainer/
mkdir -p $OUTDIR/${species}_RepeatMasker_Against_Custom_Library/
mkdir -p $OUTDIR/${species}_RepeatLandscape/
mkdir -p $OUTDIR/${species}_mergedRepeats/
mkdir -p ${OUTDIR}/${species}_summaryFiles/
if [ ! -z "$heli" ]; then
mkdir -p ${OUTDIR}/${species}_heliano/
fi
}
# Subprocess PrepGenome #
prepGenome()
{
if [ ! -L ${genome} ]; then
genome=$(realpath ${genome})
fi
if [ -L $genome ]; then
genome=$(realpath -s ${genome})
fi
if [ ! -f ${genome}.prep ] || [ ! -f ${genome}.dict ]; then
cp ${genome} ${genome}.bak && gzip -f ${genome}.bak
sed '/>/ s/[[:space:]].*//g; /^$/d' ${genome} > ${genome}.tmp
${SCRIPT_DIR}/headSwap.sh -i ${genome}.tmp -o ${genome}.prep && rm ${genome}.tmp
mv ${genome}.tmp.dict ${genome}.dict
sed -i '/^>/! s/[DVHBPE]/N/g' ${genome}.prep
dict=${genome}.dict
genOrig=${genome}
genome=${genome}.prep
else
dict=${genome}.dict
genOrig=${genome}
genome=${genome}.prep
fi
}
# Subprocess getRepeatMaskerFasta
# Generate a copy of the RepeatMasker library subset used for the initial TE mask in FASTA format (only used if a RepeatMasker run is specified)
getRepeatMaskerFasta()
{
if [[ $RepSpec = *" "* ]]; then
echo "ERROR: You have entered a species name that contains a space, please use the NCBI TaxID rather than name. E.G In place of \"Homo sapiens\" use \"9606\""
exit 2
fi
if [[ $(which RepeatMasker) == *"bin"* ]]; then
libpath="$(which RepeatMasker | sed 's|bin/RepeatMasker|share/RepeatMasker/Libraries/famdb/|')"
else
libpath="$(which RepeatMasker | sed 's|/[^/]*$||g')/Libraries/famdb/"
fi
if [[ $(which RepeatMasker) == *"bin"* ]]; then
PATH=$PATH:"$(which RepeatMasker | sed 's|bin/RepeatMasker|share/RepeatMasker/|')"
fi
mkdir -p $OUTDIR/${species}_Curated_Library/
famdb.py -i $libpath families -f fasta_name --include-class-in-name -a -d --curated $RepSpec > ${OUTDIR}/${species}_Curated_Library/${RepSpec}.RepeatMasker.lib
RepSub=${OUTDIR}/${species}_Curated_Library/${RepSpec}.RepeatMasker.lib
}
# Subprocess firstMask
# Run the initial RepeatMasker run with the specified species subset (only used if a RepeatMasker run is specified)
firstMask()
{
cd ${OUTDIR}/${species}_RepeatMasker
rmthreads=$(( ProcNum / 4 ))
RepeatMasker -species $RepSpec -no_is -lcambig -s -a -pa $rmthreads -dir $OUTDIR/${species}_RepeatMasker $genome
if [ ! -f ${OUTDIR}/${species}_RepeatMasker/*.masked ]; then
echo "ERROR: RepeatMasker failed, please check logs. This is likely because of an invalid species search term, if issue persists please use NCBI Taxids (E.G Drosophila is replaced with 7125)"; exit 2
fi
}
# Subprocess firstMaskCustomLib
# Run the initial RepeatMasker run with the specific fasta consensus library
firstMaskCustomLib()
{
cd ${OUTDIR}/${species}_RepeatMasker
rmthreads=$(( ProcNum / 4 ))
RepeatMasker -lib ${startCust} -no_is -lcambig -s -a -pa $rmthreads -dir $OUTDIR/${species}_RepeatMasker $genome
RepSub=${startCust}
if [ ! -f ${OUTDIR}/${species}_RepeatMasker/*.masked ]; then
echo "ERROR: RepeatMasker failed, please check logs. This is likely because of an invalid custom consensus file, check the fasta file looks as expected"; exit 2
fi
}
# Subprocess buildDB
# if masked genome exists, build database from this. If not, build database from original input genome
buildDB()
{
if [ -f ${OUTDIR}/${species}_RepeatMasker/*.masked ]; then
cd ${OUTDIR}/${species}_Database
BuildDatabase -name ${species} ${OUTDIR}/${species}_RepeatMasker/*.masked
else
cd ${OUTDIR}/${species}_Database
BuildDatabase -name ${species} $genome
fi
}
# Subprocess deNovo1
# Run the initial de novo annotation run with RepeatModeler2.
# The genomeSampleSizeMax flag is set based on the samplable genome size (contigs >= 40 kb)
# to avoid RepeatModeler failures on small or fragmented genomes.
deNovo1()
{
cd ${OUTDIR}/${species}_RepeatModeler
# Compute samplable genome size: sum of contigs >= 40 kb only.
# RepeatModeler discards contigs shorter than 40 kb during sampling, so using
# the total genome size would overestimate the sequence available per round.
GENOME_SIZE=$(awk '/^>/{if(len>=40000)sum+=len; len=0; next}{len+=length($0)} END{if(len>=40000)sum+=len; print sum+0}' "${genome}")
echo "Samplable genome size (contigs >= 40 kb): ${GENOME_SIZE} bp"
# Set -genomeSampleSizeMax to the highest RECON round threshold the genome
# can support. Thresholds are cumulative across rounds (r2=3M, r3=9M, r4=27M,
# r5=81M, r6=243M), so the samplable size must cover all rounds up to the cap:
# >= 363M (3+9+27+81+243): no cap -> all 6 rounds
# >= 120M (3+9+27+81): cap 81M -> rounds 1-5
# >= 39M (3+9+27): cap 27M -> rounds 1-4
# >= 12M (3+9): cap 9M -> rounds 1-3
# < 12M: cap 3M -> rounds 1-2
SAMPLE_FLAG=""
if [ "${GENOME_SIZE}" -ge 363000000 ] 2>/dev/null; then
: # genome large enough for all rounds; use RepeatModeler default
elif [ "${GENOME_SIZE}" -ge 120000000 ] 2>/dev/null; then
echo "Capping at round 5 (-genomeSampleSizeMax 81000000)"
SAMPLE_FLAG="-genomeSampleSizeMax 81000000"
elif [ "${GENOME_SIZE}" -ge 39000000 ] 2>/dev/null; then
echo "Capping at round 4 (-genomeSampleSizeMax 27000000)"
SAMPLE_FLAG="-genomeSampleSizeMax 27000000"
elif [ "${GENOME_SIZE}" -ge 12000000 ] 2>/dev/null; then
echo "Capping at round 3 (-genomeSampleSizeMax 9000000)"
SAMPLE_FLAG="-genomeSampleSizeMax 9000000"
else
echo "Capping at round 2 (-genomeSampleSizeMax 3000000)"
SAMPLE_FLAG="-genomeSampleSizeMax 3000000"
fi
RepeatModeler -threads ${ProcNum} -database ${OUTDIR}/${species}_Database/${species} ${SAMPLE_FLAG}
if [ ! -e ${OUTDIR}/${species}_Database/${species}-families.fa ]; then
echo "ERROR: RepeatModeler Failed"
exit 2
fi
}
# Subprocess strainer
# contains the BLAST, Extract, Extend, Trim pipeline from James Galbraith
strainer()
{
cd ${OUTDIR}/${species}_strainer/
if [ "$ProcNum" -ge 4 ]; then
strainthreads=$(( ProcNum / 4 ))
else
strainthreads=$ProcNum
fi
${SCRIPT_DIR}/TEstrainer/TEstrainer_for_earlGrey.sh -g $genome -l ${OUTDIR}/${species}_Database/${species}-families.fa -t ${strainthreads} -f $Flank -r $num -n $no_seq -m $min_seq$([ "$quietBar" == "yes" ] && echo " -q")
latestFile="$(realpath $(ls -td -- ${OUTDIR}/${species}_strainer/*/ | head -n 1) || true)/${species}-families.fa.strained"
cp $latestFile ${OUTDIR}/${species}_strainer/
}
# Subprocess strainerResume
# resumes interrupted TEstrainer runs
strainerResume() {
cd ${OUTDIR}/${species}_strainer/
strainDataDir=$(find . -maxdepth 1 -type d -name "TS_${species}-families.fa_*")
if [ ! $(echo "$strainDataDir" | wc -l) -eq 1 ]; then
echo "ERROR: There are more TEstrainer directories than expected present"
exit 2
fi
if [ "$ProcNum" -ge 4 ]; then
strainthreads=$(( ProcNum / 4 ))
else
strainthreads=$ProcNum
fi
${SCRIPT_DIR}/TEstrainer/TEstrainer_for_earlGrey.sh -g $genome -l ${OUTDIR}/${species}_Database/${species}-families.fa -t ${strainthreads} -f $Flank -r $num -n $no_seq -m $min_seq$([ "$quietBar" == "yes" ] && echo " -q") -d ${strainDataDir} -R
latestFile="$(realpath $(ls -td -- ${OUTDIR}/${species}_strainer/*/ | head -n 1) || true)/${species}-families.fa.strained"
cp $latestFile ${OUTDIR}/${species}_strainer/
}
# Subprocess clust
# optional wicker clustering of TE library consensus sequences to reduce redundancy
clust()
{
cd ${OUTDIR}/${species}_strainer/
cd-hit-est -d 0 -aS 0.8 -c 0.8 -G 0 -g 1 -b 500 -r 1 -i $latestFile -o ${latestFile}.clstrd.fa
latestFile=$latestFile.clstrd.fa
}
# Subprocess novoMask
# Run RepeatMasker with the final processed library
novoMask()
{
if [ -z "$RepSpec" ] && [ -z "$startCust" ]; then
cd ${OUTDIR}/${species}_RepeatMasker_Against_Custom_Library/
rmthreads=$(( ProcNum / 4 ))
RepeatMasker -lib $latestFile -no_is -lcambig -s -a -pa $rmthreads -dir ${OUTDIR}/${species}_RepeatMasker_Against_Custom_Library/ $genome
else
mkdir -p $OUTDIR/${species}_Curated_Library/
cd ${OUTDIR}/${species}_Curated_Library/
cat $latestFile $RepSub > ${species}_combined_library.fasta
latestFile=$(realpath ${species}_combined_library.fasta)
cd ${OUTDIR}/${species}_RepeatMasker_Against_Custom_Library/
rmthreads=$(( ProcNum / 4 ))
RepeatMasker -lib $latestFile -no_is -lcambig -s -a -pa $rmthreads -dir ${OUTDIR}/${species}_RepeatMasker_Against_Custom_Library/ $genome
fi
}
# Subprocess heliano
# Run HELIANO as an optional step, then replace overlapping repeats in the merged output with HELIANO outputs
### TODO:
heliano_optional()
{
cd ${OUTDIR}/${species}_heliano/
heliano -g $genome --nearest -dn 6000 -flank_sim 0.5 -o ${OUTDIR}/${species}_heliano/HEL_${timestamp} -w 10000 -n $ProcNum
awk '{OFS="\t"}{print $1, "HELIANO", "RC/Helitron", $2+1, $3, $5, $6, ".", "ID="$9"_"$11";shortTE=F"}' ${OUTDIR}/${species}_heliano/HEL_${timestamp}/RC.representative.bed > ${OUTDIR}/${species}_heliano/HEL_${timestamp}/RC.representative.gff
helitron_gff=${OUTDIR}/${species}_heliano/HEL_${timestamp}/RC.representative.gff
}
# Subprocess rcMergeRepeats
# Defragment repeat sequences to adjust for insertion times
mergeRep()
{
mkdir ${OUTDIR}/${species}_mergedRepeats/looseMerge
if [ -s "$helitron_gff" ]; then
echo "Running loose merge with HELIANO output"
${SCRIPT_DIR}/rcMergeRepeatsLoose -f $genome -s $species -d ${OUTDIR}/${species}_mergedRepeats/looseMerge -u ${OUTDIR}/${species}_RepeatMasker_Against_Custom_Library/$(basename $genome).out -q ${OUTDIR}/${species}_RepeatMasker_Against_Custom_Library/$(basename $genome).tbl -t $ProcNum -b ${dict} -m $margin -e $helitron_gff
else
${SCRIPT_DIR}/rcMergeRepeatsLoose -f $genome -s $species -d ${OUTDIR}/${species}_mergedRepeats/looseMerge -u ${OUTDIR}/${species}_RepeatMasker_Against_Custom_Library/$(basename $genome).out -q ${OUTDIR}/${species}_RepeatMasker_Against_Custom_Library/$(basename $genome).tbl -t $ProcNum -b ${dict} -m $margin
fi
if [ -f "${OUTDIR}/${species}_mergedRepeats/looseMerge/${species}.filteredRepeats.bed" ]; then
awk '{OFS="\t"}{print $1, $2, $3, $4, $5, $6, $7, $8, toupper($9)}' ${OUTDIR}/${species}_mergedRepeats/looseMerge/${species}.filteredRepeats.gff > ${OUTDIR}/${species}_mergedRepeats/looseMerge/${species}.filteredRepeats.gff.1 && mv ${OUTDIR}/${species}_mergedRepeats/looseMerge/${species}.filteredRepeats.gff{.1,}
fi
if [ ! -f "${OUTDIR}/${species}_mergedRepeats/looseMerge/${species}.filteredRepeats.bed" ]; then
echo "ERROR: loose merge defragmentation failed, trying strict merge..."
cd ${OUTDIR}/${species}_mergedRepeats/
if [ -s "$helitron_gff" ]; then
${SCRIPT_DIR}/rcMergeRepeats -f $genome -s $species -d ${OUTDIR}/${species}_mergedRepeats/ -u ${OUTDIR}/${species}_RepeatMasker_Against_Custom_Library/$(basename $genome).out -q ${OUTDIR}/${species}_RepeatMasker_Against_Custom_Library/$(basename $genome).tbl -t $ProcNum -b ${dict} -m $margin -e $helitron_gff
else
${SCRIPT_DIR}/rcMergeRepeats -f $genome -s $species -d ${OUTDIR}/${species}_mergedRepeats/ -u ${OUTDIR}/${species}_RepeatMasker_Against_Custom_Library/$(basename $genome).out -q ${OUTDIR}/${species}_RepeatMasker_Against_Custom_Library/$(basename $genome).tbl -t $ProcNum -b ${dict} -m $margin
fi
if [ ! -f "${OUTDIR}/${species}_mergedRepeats/${species}.filteredRepeats.bed" ]; then
echo "ERROR: strict merge also failed, check ${OUTDIR}/${species}_RepeatMasker_Against_Custom_Library/$(basename $genome).out looks as expected"
exit 2
fi
fi
}
# Subprocess pieChart
# Generate a pie chart summary of TE content
charts()
{
cd ${OUTDIR}/${species}_summaryFiles/
if [ -f "${OUTDIR}/${species}_mergedRepeats/looseMerge/${species}.filteredRepeats.bed" ]; then
${SCRIPT_DIR}/autoPie.sh -i ${OUTDIR}/${species}_mergedRepeats/looseMerge/${species}.filteredRepeats.summary -t ${OUTDIR}/${species}_RepeatMasker_Against_Custom_Library/$(basename $genome).tbl -p ${OUTDIR}/${species}_summaryFiles/${species}.summaryPie.pdf -o ${OUTDIR}/${species}_summaryFiles/${species}.highLevelCount.txt
else
${SCRIPT_DIR}/autoPie.sh -i ${OUTDIR}/${species}_mergedRepeats/${species}.filteredRepeats.summary -t ${OUTDIR}/${species}_RepeatMasker_Against_Custom_Library/$(basename $genome).tbl -p ${OUTDIR}/${species}_summaryFiles/${species}.summaryPie.pdf -o ${OUTDIR}/${species}_summaryFiles/${species}.highLevelCount.txt
fi
}
# Subprocess calcDivRL
# Calculate divergence estimates
calcDivRL()
{
cd ${OUTDIR}/${species}_RepeatLandscape
# divergence_calc.py calls pybedtools.set_tempdir() which sets tempfile.tempdir
# globally. forkserver creates its AF_UNIX socket under tempfile.tempdir, so a
# long output path can exceed the 108-char socket limit (OSError: AF_UNIX path
# too long). Pass a short per-species path in /tmp to avoid this.
divcalc_tmp="/tmp/egdiv_${species}"
mkdir -p "${divcalc_tmp}"
if [ -z "$RepSpec" ] && [ -z "$startCust" ]; then
python ${SCRIPT_DIR}/divergenceCalc/divergence_calc.py -l $latestFile -g $genOrig -i ${OUTDIR}/${species}_mergedRepeats/looseMerge/${species}.filteredRepeats.gff -o ${OUTDIR}/${species}_RepeatLandscape/${species}.filteredRepeats.withDivergence.gff -t $ProcNum -tmp "${divcalc_tmp}"
Rscript ${SCRIPT_DIR}/divergenceCalc/divergence_plot.R -s $species -g ${OUTDIR}/${species}_RepeatLandscape/${species}.filteredRepeats.withDivergence.gff -o ${OUTDIR}/${species}_RepeatLandscape/ && \
awk -F'\t' 'BEGIN{OFS="\t"} {gsub("NAME=","Name=",$9); gsub("TSTART=","tstart=",$9); gsub("TEND=","tend=",$9); gsub("SHORTTE=","shortte=",$9); gsub("TEGROUP=","tegroup=",$9); gsub("KIMURA80=","kimura80=",$9); gsub("NESTED=", "nested=", $9); print}' ${OUTDIR}/${species}_RepeatLandscape/${species}.filteredRepeats.withDivergence.gff > ${OUTDIR}/${species}_mergedRepeats/looseMerge/${species}.filteredRepeats.gff
rm -rf "${divcalc_tmp}"
cp ${OUTDIR}/${species}_RepeatLandscape/*.pdf ${OUTDIR}/${species}_summaryFiles/
cp ${OUTDIR}/${species}_RepeatLandscape/*_summary_table.tsv ${OUTDIR}/${species}_summaryFiles/${species}_divergence_summary_table.tsv
else
python ${SCRIPT_DIR}/divergenceCalc/divergence_calc.py -l ${OUTDIR}/${species}_Curated_Library/${species}_combined_library.fasta -g $genOrig -i ${OUTDIR}/${species}_mergedRepeats/looseMerge/${species}.filteredRepeats.gff -o ${OUTDIR}/${species}_RepeatLandscape/${species}.filteredRepeats.withDivergence.gff -t $ProcNum -tmp "${divcalc_tmp}"
Rscript ${SCRIPT_DIR}/divergenceCalc/divergence_plot.R -s $species -g ${OUTDIR}/${species}_RepeatLandscape/${species}.filteredRepeats.withDivergence.gff -o ${OUTDIR}/${species}_RepeatLandscape/ && \
awk -F'\t' 'BEGIN{OFS="\t"} {gsub("NAME=","Name=",$9); gsub("TSTART=","tstart=",$9); gsub("TEND=","tend=",$9); gsub("SHORTTE=","shortte=",$9); gsub("TEGROUP=","tegroup=",$9); gsub("KIMURA80=","kimura80=",$9); gsub("NESTED=", "nested=", $9); print}' ${OUTDIR}/${species}_RepeatLandscape/${species}.filteredRepeats.withDivergence.gff > ${OUTDIR}/${species}_mergedRepeats/looseMerge/${species}.filteredRepeats.gff
rm -rf "${divcalc_tmp}"
cp ${OUTDIR}/${species}_RepeatLandscape/*.pdf ${OUTDIR}/${species}_summaryFiles/
cp ${OUTDIR}/${species}_RepeatLandscape/*_summary_table.tsv ${OUTDIR}/${species}_summaryFiles/${species}_divergence_summary_table.tsv
fi
}
# Subprocess sweepUp
# Puts required files into a summary folder
sweepUp()
{
cd ${OUTDIR}/${species}_summaryFiles/
if [ -f "${OUTDIR}/${species}_mergedRepeats/looseMerge/${species}.filteredRepeats.bed" ] && [ ! -f "${OUTDIR}/${species}_Curated_Library/${species}_combined_library.fasta" ]; then
cp ${OUTDIR}/${species}_mergedRepeats/looseMerge/${species}.filteredRepeats.bed ${OUTDIR}/${species}_mergedRepeats/looseMerge/${species}.filteredRepeats.gff $latestFile ${OUTDIR}/${species}_summaryFiles/
elif [ -f "${OUTDIR}/${species}_mergedRepeats/looseMerge/${species}.filteredRepeats.bed" ] && [ -f "${OUTDIR}/${species}_Curated_Library/${species}_combined_library.fasta" ]; then
cp ${OUTDIR}/${species}_mergedRepeats/looseMerge/${species}.filteredRepeats.bed ${OUTDIR}/${species}_mergedRepeats/looseMerge/${species}.filteredRepeats.gff ${OUTDIR}/${species}_Curated_Library/${species}_combined_library.fasta $latestFile ${OUTDIR}/${species}_summaryFiles/
elif [ ! -f "${OUTDIR}/${species}_mergedRepeats/looseMerge/${species}.filteredRepeats.bed" ] && [ ! -f "${OUTDIR}/${species}_Curated_Library/${species}_combined_library.fasta" ]; then
cp ${OUTDIR}/${species}_mergedRepeats/${species}.filteredRepeats.bed ${OUTDIR}/${species}_mergedRepeats/${species}.filteredRepeats.gff $latestFile ${OUTDIR}/${species}_summaryFiles/
elif [ ! -f "${OUTDIR}/${species}_mergedRepeats/looseMerge/${species}.filteredRepeats.bed" ] && [ -f "${OUTDIR}/${species}_Curated_Library/${species}_combined_library.fasta" ]; then
cp ${OUTDIR}/${species}_mergedRepeats/${species}.filteredRepeats.bed ${OUTDIR}/${species}_mergedRepeats/${species}.filteredRepeats.gff ${OUTDIR}/${species}_Curated_Library/${species}_combined_library.fasta $latestFile ${OUTDIR}/${species}_summaryFiles/
else
echo "Error: Cannot find required summary files, please check one of the following is present: ${OUTDIR}/${species}_mergedRepeats/looseMerge/${species}.filteredRepeats.bed and/or ${OUTDIR}/${species}_mergedRepeats/${species}.filteredRepeats.bed"
fi
}
# Subprocess runningTea
runningTea()
{
echo "
) (
( ) )
) ( (
_______)_
.-'---------|
( C|/\/\/\/\/|
'-./\/\/\/\/|
'_________'
'-------'
<<< $stage >>>"
}
# Subprocess GetTime
convertsecs()
{
h=$(bc <<< "${1}/3600")
m=$(bc <<< "(${1}%3600)/60")
s=$(bc <<< "${1}%60")
printf "%02d:%02d:%05.2f\n" $h $m $s
}
# Subprocess Checks
Checks()
{
if [ -z "$genome" ] || [ -z "$species" ] || [ -z "$directory" ] ; then
usage; exit 1
fi
if [ -z "$ProcNum" ] ; then
ProcNum=1; echo "$ProcNum Cores Will Be Used"
else
echo "$ProcNum Cores Will Be Used"
fi
if [ -z "$RepSpec" ] && [ -z "$startCust" ] ; then
echo "RepeatMasker species not specified, running Earl Grey without an initial mask with known repeats"
else
echo "Running Earl Grey with an intial mask with known repeats"
fi
if [ -z "$num" ] ; then
num=10; echo "De Novo Sequences Will Be Extended Through a Maximum of $num Iterations"
else
echo "De Novo Sequences Will Be Extended Through a Maximum of $num Iterations"
fi
if [ -z "$no_seq" ] ; then
no_seq=20; echo "$no_seq sequences will be used in BEAT consensus generation"
else
echo "$no_seq sequences will be used in BEAT consensus generation"
fi
if [ -z "$cluster" ] || [ "$cluster" == "no" ] ; then
cluster=no; echo "TE library consensus sequences will not be clustered"
elif [ "$cluster" == "yes" ]; then
cluster=yes; echo "TE library consensus sequences will be clustered, be aware of the impact this can have on subfamilies and chimeric TEs!"
else
cluster=no; echo "Clustering not specified using (yes/no). Using default parameter (no)."
fi
if [ -z "$softMask" ] || [ "$softMask" == "no" ] ; then
softMask=no; echo "Softmasked genome will not be generated"
elif [ "$softMask" == "yes" ]; then
softMask=yes; echo "Softmasked genome will be generated"
else
softMask=no; echo "Softmask not specified using (yes/no). Using default parameter (no)."
fi
if [ "$margin" == "yes" ] ; then
margin=yes; echo "Short TE sequences <100bp will be removed from annotation"
elif [ -z "$margin" ] || [ "$margin" == "no" ]; then
margin=no; echo "Short TE sequences <100bp will not be removed from annotation"
else
margin=no; echo "Margin not specified using (yes/no). Using default parameter (no)."
fi
if [ -z "$Flank" ]; then
Flank=1000; echo "Blast, Extend, Align, Trim Process Will Add 1000bp to Each End in Each Iteration"
else
echo "Blast, Extract, Extend, Trim Process Will Add $Flank to Each End in Each Iteration"
fi
if [ -z "$min_seq" ]; then
min_seq=3; echo "Blast, Extend, Align, Trim Process Will Require 3 Sequences to Generate a New Consensus Sequence"
else
echo "Blast, Extend, Align, Trim Process Will Require $min_seq Sequences to Generate a New Consensus Sequence"
fi
if [ ! -d "$SCRIPT_DIR" ]; then
echo "ERROR: Script directory variable not set, please run the configure script in the Earl Grey directory before attempting to run Earl Grey"; exit 1
fi
if [ ! -d "$SCRIPT_DIR/TEstrainer/" ]; then
echo "ERROR: teStrainer module not found, please check all modules are present and run the configure script in the Earl Grey directory before attempting to run Earl Grey"; exit 1
fi
if [ "$heli" == "yes" ] ; then
heli=yes; echo "HELITRON detection will be run using HELIANO"
elif [ -z "$heli" ] || [ "$heli" == "no" ]; then
heli=no; echo "HELITRON detection will not be run"
else
heli=no; echo "HELITRON detection not specified using (yes/no). Using default parameter (no)."
fi
if [ "$quietBar" == "yes" ] ; then
quietBar=yes; echo "TEstrainer parallel progress bar will be suppressed"
elif [ -z "$quietBar" ] || [ "$quietBar" == "no" ]; then
quietBar=no; echo "TEstrainer parallel progress bar will be shown"
else
quietBar=no; echo "quietBar not specified using (yes/no). Using default parameter (no)."
fi
echo "Please cite the following paper when using this software:"
echo "Baril, T., Galbraith, J. and Hayward, A., 2024. Earl Grey: a fully automated user-friendly transposable element annotation and analysis pipeline. Molecular Biology and Evolution, 41(4), p.msae068."
}
# Subprocess Dfam Checks
dfamCheck()
{
# Dfam 3.9 checks
library_path=$(which RepeatMasker | sed 's|/bin/RepeatMasker|/share/RepeatMasker/Libraries/famdb|g')
expected_file="${library_path}/dfam39_full.0.h5"
file_count=$(find "$library_path" -maxdepth 1 -type f | wc -l)
if [ $file_count -eq 2 ] && [ -f $expected_file ] && [ -f ${library_path}/rmlib.config ] ; then
echo "WARNING: Earl Grey v6.1.0 has updated to Dfam v3.9."
echo "Before using Earl Grey, you MUST download the required partitions from Dfam (https://dfam.org/releases/current/families/FamDB/)"
echo "These must be located at ${library_path}/. Then, you must reconfigure RepeatMasker before using Earl Grey."
echo "I have written the required commands for you to faciliate the process."
echo "Please follow these carefully to ensure successful configuration of RepeatMasker with Dfam 3.9."
echo "As a note, these DB paritions are BIG, so only download the ones you need for your studies (or, if you do need all of them, make sure you have the space for them)"
echo "########################################################"
echo "A script for modification and automation of these steps has been generated in $(pwd)/configure_dfam39.sh"
echo "Please open the script in your preferred text editor (e.g. nano or vim) and change the partition numbers in [] to those that you would like to download"
echo "then, make the script executable (chmod +x $(pwd)/configure_dfam39.sh) and run it: ./configure_dfam39.sh"
echo "Alternatively, you can run the steps below one-by-one to achieve the same outcome:"
echo "########################################################"
echo ""
if [ -f "$(pwd)/configure_dfam39.sh" ]; then
rm $(pwd)/configure_dfam39.sh
fi
echo "# first, change directory to the famdb library location" | tee -a $(pwd)/configure_dfam39.sh
echo "cd ${library_path}/" | tee -a $(pwd)/configure_dfam39.sh
echo "" | tee -a $(pwd)/configure_dfam39.sh
echo "# download the partitions you require from Dfam 3.9. In the below, change the numbers or range inside the square brackets to choose your subsets." | tee -a $(pwd)/configure_dfam39.sh
echo "# e.g. to download partitions 0 to 10: [0-10]; or to download partitions 3,5, and 7: [3,5,7]; [0-16] is ALL PARTITIONS" | tee -a $(pwd)/configure_dfam39.sh
echo "curl -o 'dfam39_full.#1.h5.gz' 'https://dfam.org/releases/current/families/FamDB/dfam39_full.[0-16].h5.gz'" | tee -a $(pwd)/configure_dfam39.sh
echo "" | tee -a $(pwd)/configure_dfam39.sh
echo "# decompress Dfam 3.9 paritions" | tee -a $(pwd)/configure_dfam39.sh
echo "gunzip *.gz" | tee -a $(pwd)/configure_dfam39.sh
echo "" | tee -a $(pwd)/configure_dfam39.sh
echo "# move up to RepeatMasker main directory" | tee -a $(pwd)/configure_dfam39.sh
echo "cd $(which RepeatMasker | sed 's|/bin/RepeatMasker|/share/RepeatMasker/|g')" | tee -a $(pwd)/configure_dfam39.sh
echo "" | tee -a $(pwd)/configure_dfam39.sh
#echo "# Replace famdb with version 2.0.1 (required for the Dfam 3.9)" | tee -a $(pwd)/configure_dfam39.sh
#echo "wget https://github.com/Dfam-consortium/FamDB/archive/refs/tags/2.0.1.tar.gz && tar --wildcards -zxvf 2.0.1.tar.gz FamDB-2.0.1/famdb.py FamDB-2.0.1/famdb_*.py" | tee -a $(pwd)/configure_dfam39.sh
#echo "mv FamDB-2.0.1/*.py . && rm -rf 2.0.1.tar.gz FamDB-2.0.1/" | tee -a $(pwd)/configure_dfam39.sh
#echo "" | tee -a $(pwd)/configure_dfam39.sh
echo "# save the min_init partition as a backup, just in case!" | tee -a $(pwd)/configure_dfam39.sh
echo "mv ${library_path}/min_init.0.h5 ${library_path}/min_init.0.h5.bak" | tee -a $(pwd)/configure_dfam39.sh
echo "" | tee -a $(pwd)/configure_dfam39.sh
echo "# Rerun RepeatMasker configuration" | tee -a $(pwd)/configure_dfam39.sh
echo "perl ./configure -libdir $(which RepeatMasker | sed 's|/bin/RepeatMasker|/share/RepeatMasker/Libraries/|g') \
-trf_prgm $(which RepeatMasker | sed 's|/bin/RepeatMasker|/bin/trf|g') \
-rmblast_dir $(which RepeatMasker | sed 's|/bin/RepeatMasker|/bin|g') \
-hmmer_dir $(which RepeatMasker | sed 's|/bin/RepeatMasker|/bin|g') \
-abblast_dir $(which RepeatMasker | sed 's|/bin/RepeatMasker|/bin|g') \
-crossmatch_dir $(which RepeatMasker | sed 's|/bin/RepeatMasker|/bin|g') \
-default_search_engine rmblast" | tee -a $(pwd)/configure_dfam39.sh
echo "" | tee -a $(pwd)/configure_dfam39.sh
sed -i '/^perl/ s/\s\s*/ /g' $(pwd)/configure_dfam39.sh
echo "########################################################"
echo ""
echo "Following these steps should enable you to use Dfam 3.9 with RepeatMasker and Earl Grey"
echo "You will not see this message again if configuration is successful"
echo "If you only want partition 0, you do not have to do anything except rerun your Earl Grey command"
echo "If you are still having trouble, please open an issue on the Earl Grey github and we will try to help!"
touch ${library_path}/.earlgrey.config.complete
exit 2
fi
}
# Main #
while getopts g:s:o:t:f:i:r:c:l:m:d:n:a:e:q:h option
do
case "${option}"
in
g) genome=${OPTARG};;
s) species=${OPTARG};;
o) directory=${OPTARG};;
t) ProcNum=${OPTARG};;
f) Flank=${OPTARG};;
i) num=${OPTARG};;
r) RepSpec=${OPTARG};;
l) startCust=${OPTARG};;
c) cluster=${OPTARG};;
m) margin=${OPTARG};;
d) softMask=${OPTARG};;
n) no_seq=${OPTARG};;
a) min_seq=${OPTARG};;
e) heli=${OPTARG};;
q) quietBar=${OPTARG};;
h) usage; exit 0;;
esac
done
SECONDS=0
# Step 1 - set up the directories and make sure all modules are present
stage="Checking Parameters" && runningTea
SCRIPT_DIR=/data/toby/EarlGrey/scripts/
dfamCheck
Checks
stage="Making Directories" && runningTea
makeDirectory
# Start Logs - only bother starting if everything is present and correct
exec > >(tee "${OUTDIR}/${species}EarlGrey.log") 2>&1
stage="Cleaning Genome" && runningTea
prepGenome
sleep 1
# Step 2 - initial annotation, either de novo, or if required an initial RepeatMasker followed by a de novo run
if [ -z "$RepSpec" ]; then
sleep 1
else
if [ ! -f ${OUTDIR}/${species}_RepeatMasker/*.masked ]; then
stage="Getting RepeatMasker Sequences for $RepSpec and Saving as Fasta" && runningTea
getRepeatMaskerFasta
stage="Running Initial Mask with Known Repeats" && runningTea
firstMask
else
stage="Genome has already been masked, skipping..." && runningTea
getRepeatMaskerFasta
fi
fi
if [ -z "$startCust" ]; then
sleep 1
else
if [ ! -f ${OUTDIR}/${species}_RepeatMasker/*.masked ]; then
stage="Running Initial Mask with Known Repeats in Custom Library" && runningTea
firstMaskCustomLib
else
stage="Genome has already been masked, skipping..." && runningTea
fi
fi
# Step 3 - make a database from the genome, and then run a de novo TE annotation using RepeatModeler2
if [ ! -f ${OUTDIR}/${species}_Database/${species}-families.fa ]; then
stage="Detecting Novel Repeats" && runningTea
buildDB
deNovo1
sleep 1
else
stage="De novo repeats have already been detected, skipping..." && runningTea
sleep 1
fi
# Step 4 - run TEstrainer, which runs a BLAST, extract, extent, trim, on the the de novo repeat library to refine consensus sequences
if [ ! -s ${OUTDIR}/${species}_strainer/${species}-families.fa.strained ]; then
stage="Straining TEs and Refining de novo Consensus Sequences" && runningTea
if [ -f ${OUTDIR}/${species}_strainer/${species}-families.fa.strained ]; then
rm -r ${OUTDIR}/${species}_strainer/${species}-families.fa.strained ${OUTDIR}/${species}_strainer/TS*
strainer
elif [ -d ${OUTDIR}/${species}_strainer/TS_${species}-families.fa_* ]; then
stage="Resuming TEstrainer" && runningTea
strainerResume
else
strainer
fi
if [ ! -s ${OUTDIR}/${species}_strainer/${species}-families.fa.strained ]; then
echo "WARNING: TEstrainer failed to produce a strain file, please check the log file for more information. If you have run an intial mask with known repeats, this could be due RepeatModeler2 failing to identify any new repeats. Please check if this is expected."
fi
sleep 1
else
stage="TEs already strained, skipping..." && runningTea
latestFile=${OUTDIR}/${species}_strainer/${species}-families.fa.strained
sleep 1
fi
# Step 4.5 - cluster TE library IF REQUIRED
if [ "$cluster" == "yes" ]; then
stage="Clustering TE sequences to Wicker TE Family Level (80-80-80 rule)" && runningTea
clust
sleep 1
fi
# Stage 5 - run the final RepeatMasker annotation using the refined TE consensus library
if [ ! -f ${OUTDIR}/${species}_RepeatMasker_Against_Custom_Library/*.tbl ]; then
stage="Identifying Repeats Using Species-Specific Library" && runningTea
novoMask
if [ ! -f ${OUTDIR}/${species}_RepeatMasker_Against_Custom_Library/*.tbl ]; then
echo "ERROR: RepeatMasker failed, please check logs" && exit 2
fi
sleep 1
else
stage="Final masking already complete, skipping..." && runningTea
sleep 1
fi
# Stage 5.5 - Run HELIANO as an optional step
if [ "$heli" == "yes" ]; then
if [ ! -s ${OUTDIR}/${species}_heliano/*/RC.representative.gff ]; then
stage="Running HELIANO to Detect Helitrons" && runningTea
timestamp=$(date +"%Y%m%d_%H%M")
heliano_optional
else
stage="HELITRON detection already complete, skipping..." && runningTea
helitron_gff="$(realpath $(ls -td -- ${OUTDIR}/${species}_heliano/*/ | head -n 1) || true)/RC.representative.gff"
echo "HELITRON GFF: $helitron_gff"
fi
sleep 1
fi
# Stage 6
if [ ! -f ${OUTDIR}/${species}_mergedRepeats/looseMerge/${species}.filteredRepeats.bed ] ; then
stage="Defragmenting Repeats" && runningTea
mergeRep
sleep 1
else
stage="Repeats already defragmented, skipping..." && runningTea
sleep 1
fi
# Stage 7
stage="Generating Summary Plots" && runningTea
charts
calcDivRL
sleep 1
# Stage 8
stage="Tidying Directories and Organising Important Files" && runningTea
sweepUp
sleep 1
# Stage 8.5 Optional Softmask
if [ "$softMask" == "yes" ]; then
stage="Generating Softmasked Genome" && runningTea
gunzip ${genome%.prep}.bak.gz && bedtools maskfasta -fi ${genome%.prep}.bak -bed ${OUTDIR}/${species}_mergedRepeats/looseMerge/${species}.filteredRepeats.bed -fo ${OUTDIR}/${species}_summaryFiles/${species}.softmasked.fasta -soft && gzip ${genome%.prep}.bak
sleep 1
fi
# Stage 9
time=$(convertsecs $SECONDS)
stage="Done in $time" && runningTea
sleep 5
# Stage 10
stage="TE library, Summary Figures, and TE Quantifications in Standard Formats Can Be Found in ${OUTDIR}/${species}_summaryFiles/" && runningTea
cat ${OUTDIR}/${species}_summaryFiles/${species}.highLevelCount.kable
sleep 5
================================================
FILE: earlGreyAnnotationOnly
================================================
#!/bin/bash
set -eo pipefail
usage()
{
echo " #############################
earlGrey version 7.2.4 (AnnotationOnly)
Required Parameters:
-g == genome.fasta
-s == species name
-o == output directory
-l == Starting consensus library for annotation (in fasta format)
Optional Parameters:
-t == Number of Threads (DO NOT specify more than are available)
-r == RepeatMasker species for addition to custom library (Default: None)
-m == Remove putative spurious TE annotations <100bp? (yes/no, Default: no)
-d == Create soft-masked genome at the end? (yes/no, Default: no)
-e == Run HELIANO as an optional step to detect Helitrons (yes/no, Default: no)
-h == Show help
Example Usage:
earlGreyAnnotationOnly -g bombyxMori.fasta -s bombyxMori -o /home/toby/bombyxMori/repeatAnnotation/ -l bombyx-families.fa.strained -t 16
Queries can be sent to:
tobias.baril[at]unine.ch
Please make use of the GitHub Issues and Discussion Tabs at: https://github.com/TobyBaril/EarlGrey
#############################"
}
# Subprocess Make Directories #
makeDirectory()
{
directory=$(realpath ${directory})
mkdir -p ${directory}/${species}_EarlGrey/ && OUTDIR=${directory}/${species}_EarlGrey
mkdir -p ${OUTDIR}/${species}_Curated_Library/
mkdir -p $OUTDIR/${species}_RepeatMasker_Against_Custom_Library/
mkdir -p $OUTDIR/${species}_RepeatLandscape/
mkdir -p $OUTDIR/${species}_mergedRepeats/
mkdir -p ${OUTDIR}/${species}_summaryFiles/
if [ ! -z "$heli" ]; then
mkdir -p ${OUTDIR}/${species}_heliano/
fi
}
# Subprocess PrepGenome #
prepGenome()
{
if [ ! -L ${genome} ]; then
genome=$(realpath ${genome})
fi
if [ -L $genome ]; then
genome=$(realpath -s ${genome})
fi
if [ ! -f ${genome}.prep ] || [ ! -f ${genome}.dict ]; then
cp ${genome} ${genome}.bak && gzip -f ${genome}.bak
sed '/>/ s/[[:space:]].*//g; /^$/d' ${genome} > ${genome}.tmp
${SCRIPT_DIR}/headSwap.sh -i ${genome}.tmp -o ${genome}.prep && rm ${genome}.tmp
mv ${genome}.tmp.dict ${genome}.dict
sed -i '/^>/! s/[DVHBPE]/N/g' ${genome}.prep
dict=${genome}.dict
genOrig=${genome}
genome=${genome}.prep
else
dict=${genome}.dict
genOrig=${genome}
genome=${genome}.prep
fi
}
# Subprocess novoMask
# Run RepeatMasker with the final processed library
novoMask()
{
if [ -z "$RepSpec" ]; then
cd ${OUTDIR}/${species}_RepeatMasker_Against_Custom_Library/
rmthreads=$(( ProcNum / 4 ))
RepeatMasker -lib $latestFile -no_is -lcambig -s -a -pa $rmthreads -dir ${OUTDIR}/${species}_RepeatMasker_Against_Custom_Library/ $genome
else
mkdir -p $OUTDIR/${species}_Curated_Library/
cd ${OUTDIR}/${species}_Curated_Library/
cat $latestFile $RepSub > ${species}_combined_library.fasta
latestFile=$(realpath ${species}_combined_library.fasta)
cd ${OUTDIR}/${species}_RepeatMasker_Against_Custom_Library/
rmthreads=$(( ProcNum / 4 ))
RepeatMasker -lib $latestFile -no_is -lcambig -s -a -pa $rmthreads -dir ${OUTDIR}/${species}_RepeatMasker_Against_Custom_Library/ $genome
fi
}
# Subprocess heliano
# Run HELIANO as an optional step, then replace overlapping repeats in the merged output with HELIANO outputs
### TODO:
heliano_optional()
{
cd ${OUTDIR}/${species}_heliano/
heliano -g $genome --nearest -dn 6000 -flank_sim 0.5 -o ${OUTDIR}/${species}_heliano/HEL_${timestamp} -w 10000 -n $ProcNum
awk '{OFS="\t"}{print $1, "HELIANO", "RC/Helitron", $2+1, $3, $5, $6, ".", "ID="$9"_"$11";shortTE=F"}' ${OUTDIR}/${species}_heliano/HEL_${timestamp}/RC.representative.bed > ${OUTDIR}/${species}_heliano/HEL_${timestamp}/RC.representative.gff
helitron_gff=${OUTDIR}/${species}_heliano/HEL_${timestamp}/RC.representative.gff
}
# Subprocess rcMergeRepeats
# Defragment repeat sequences to adjust for insertion times
mergeRep()
{
mkdir ${OUTDIR}/${species}_mergedRepeats/looseMerge
if [ -s "$helitron_gff" ]; then
echo "Running loose merge with HELIANO output"
${SCRIPT_DIR}/rcMergeRepeatsLoose -f $genome -s $species -d ${OUTDIR}/${species}_mergedRepeats/looseMerge -u ${OUTDIR}/${species}_RepeatMasker_Against_Custom_Library/$(basename $genome).out -q ${OUTDIR}/${species}_RepeatMasker_Against_Custom_Library/$(basename $genome).tbl -t $ProcNum -b ${dict} -m $margin -e $helitron_gff
else
${SCRIPT_DIR}/rcMergeRepeatsLoose -f $genome -s $species -d ${OUTDIR}/${species}_mergedRepeats/looseMerge -u ${OUTDIR}/${species}_RepeatMasker_Against_Custom_Library/$(basename $genome).out -q ${OUTDIR}/${species}_RepeatMasker_Against_Custom_Library/$(basename $genome).tbl -t $ProcNum -b ${dict} -m $margin
fi
if [ -f "${OUTDIR}/${species}_mergedRepeats/looseMerge/${species}.filteredRepeats.bed" ]; then
awk '{OFS="\t"}{print $1, $2, $3, $4, $5, $6, $7, $8, toupper($9)}' ${OUTDIR}/${species}_mergedRepeats/looseMerge/${species}.filteredRepeats.gff > ${OUTDIR}/${species}_mergedRepeats/looseMerge/${species}.filteredRepeats.gff.1 && mv ${OUTDIR}/${species}_mergedRepeats/looseMerge/${species}.filteredRepeats.gff{.1,}
fi
if [ ! -f "${OUTDIR}/${species}_mergedRepeats/looseMerge/${species}.filteredRepeats.bed" ]; then
echo "ERROR: loose merge defragmentation failed, trying strict merge..."
cd ${OUTDIR}/${species}_mergedRepeats/
if [ -s "$helitron_gff" ]; then
${SCRIPT_DIR}/rcMergeRepeats -f $genome -s $species -d ${OUTDIR}/${species}_mergedRepeats/ -u ${OUTDIR}/${species}_RepeatMasker_Against_Custom_Library/$(basename $genome).out -q ${OUTDIR}/${species}_RepeatMasker_Against_Custom_Library/$(basename $genome).tbl -t $ProcNum -b ${dict} -m $margin -e $helitron_gff
else
${SCRIPT_DIR}/rcMergeRepeats -f $genome -s $species -d ${OUTDIR}/${species}_mergedRepeats/ -u ${OUTDIR}/${species}_RepeatMasker_Against_Custom_Library/$(basename $genome).out -q ${OUTDIR}/${species}_RepeatMasker_Against_Custom_Library/$(basename $genome).tbl -t $ProcNum -b ${dict} -m $margin
fi
if [ ! -f "${OUTDIR}/${species}_mergedRepeats/${species}.filteredRepeats.bed" ]; then
echo "ERROR: strict merge also failed, check ${OUTDIR}/${species}_RepeatMasker_Against_Custom_Library/$(basename $genome).out looks as expected"
exit 2
fi
fi
}
# Subprocess pieChart
# Generate a pie chart summary of TE content
charts()
{
cd ${OUTDIR}/${species}_summaryFiles/
if [ -f "${OUTDIR}/${species}_mergedRepeats/looseMerge/${species}.filteredRepeats.bed" ]; then
${SCRIPT_DIR}/autoPie.sh -i ${OUTDIR}/${species}_mergedRepeats/looseMerge/${species}.filteredRepeats.summary -t ${OUTDIR}/${species}_RepeatMasker_Against_Custom_Library/$(basename $genome).tbl -p ${OUTDIR}/${species}_summaryFiles/${species}.summaryPie.pdf -o ${OUTDIR}/${species}_summaryFiles/${species}.highLevelCount.txt
else
${SCRIPT_DIR}/autoPie.sh -i ${OUTDIR}/${species}_mergedRepeats/${species}.filteredRepeats.summary -t ${OUTDIR}/${species}_RepeatMasker_Against_Custom_Library/$(basename $genome).tbl -p ${OUTDIR}/${species}_summaryFiles/${species}.summaryPie.pdf -o ${OUTDIR}/${species}_summaryFiles/${species}.highLevelCount.txt
fi
}
# Subprocess calcDivRL
# Calculate divergence estimates
calcDivRL()
{
cd ${OUTDIR}/${species}_RepeatLandscape
if [ -z "$RepSpec" ]; then
python ${SCRIPT_DIR}/divergenceCalc/divergence_calc.py -l $latestFile -g $genOrig -i ${OUTDIR}/${species}_mergedRepeats/looseMerge/${species}.filteredRepeats.gff -o ${OUTDIR}/${species}_RepeatLandscape/${species}.filteredRepeats.withDivergence.gff -t $ProcNum
Rscript ${SCRIPT_DIR}/divergenceCalc/divergence_plot.R -s $species -g ${OUTDIR}/${species}_RepeatLandscape/${species}.filteredRepeats.withDivergence.gff -o ${OUTDIR}/${species}_RepeatLandscape/ && \
awk -F'\t' 'BEGIN{OFS="\t"} {gsub("NAME=","Name=",$9); gsub("TSTART=","tstart=",$9); gsub("TEND=","tend=",$9); gsub("SHORTTE=","shortte=",$9); gsub("TEGROUP=","tegroup=",$9); gsub("KIMURA80=","kimura80=",$9); gsub("NESTED=", "nested=", $9); print}' ${OUTDIR}/${species}_RepeatLandscape/${species}.filteredRepeats.withDivergence.gff > ${OUTDIR}/${species}_mergedRepeats/looseMerge/${species}.filteredRepeats.gff
rm -rf ${OUTDIR}/${species}_RepeatLandscape/tmp/
cp ${OUTDIR}/${species}_RepeatLandscape/*.pdf ${OUTDIR}/${species}_summaryFiles/
cp ${OUTDIR}/${species}_RepeatLandscape/*_summary_table.tsv ${OUTDIR}/${species}_summaryFiles/${species}_divergence_summary_table.tsv
else
python ${SCRIPT_DIR}/divergenceCalc/divergence_calc.py -l ${OUTDIR}/${species}_Curated_Library/${species}_combined_library.fasta -g $genOrig -i ${OUTDIR}/${species}_mergedRepeats/looseMerge/${species}.filteredRepeats.gff -o ${OUTDIR}/${species}_RepeatLandscape/${species}.filteredRepeats.withDivergence.gff -t $ProcNum
Rscript ${SCRIPT_DIR}/divergenceCalc/divergence_plot.R -s $species -g ${OUTDIR}/${species}_RepeatLandscape/${species}.filteredRepeats.withDivergence.gff -o ${OUTDIR}/${species}_RepeatLandscape/ && \
awk -F'\t' 'BEGIN{OFS="\t"} {gsub("NAME=","Name=",$9); gsub("TSTART=","tstart=",$9); gsub("TEND=","tend=",$9); gsub("SHORTTE=","shortte=",$9); gsub("TEGROUP=","tegroup=",$9); gsub("KIMURA80=","kimura80=",$9); gsub("NESTED=", "nested=", $9); print}' ${OUTDIR}/${species}_RepeatLandscape/${species}.filteredRepeats.withDivergence.gff > ${OUTDIR}/${species}_mergedRepeats/looseMerge/${species}.filteredRepeats.gff
rm -rf ${OUTDIR}/${species}_RepeatLandscape/tmp/
cp ${OUTDIR}/${species}_RepeatLandscape/*.pdf ${OUTDIR}/${species}_summaryFiles/
cp ${OUTDIR}/${species}_RepeatLandscape/*_summary_table.tsv ${OUTDIR}/${species}_summaryFiles/${species}_divergence_summary_table.tsv
fi
}
# Subprocess sweepUp
# Puts required files into a summary folder
sweepUp()
{
cd ${OUTDIR}/${species}_summaryFiles/
if [ -f "${OUTDIR}/${species}_mergedRepeats/looseMerge/${species}.filteredRepeats.bed" ] && [ ! -f "${OUTDIR}/${species}_Curated_Library/${species}_combined_library.fasta" ]; then
cp ${OUTDIR}/${species}_mergedRepeats/looseMerge/${species}.filteredRepeats.bed ${OUTDIR}/${species}_mergedRepeats/looseMerge/${species}.filteredRepeats.gff $latestFile ${OUTDIR}/${species}_summaryFiles/
elif [ -f "${OUTDIR}/${species}_mergedRepeats/looseMerge/${species}.filteredRepeats.bed" ] && [ -f "${OUTDIR}/${species}_Curated_Library/${species}_combined_library.fasta" ]; then
cp ${OUTDIR}/${species}_mergedRepeats/looseMerge/${species}.filteredRepeats.bed ${OUTDIR}/${species}_mergedRepeats/looseMerge/${species}.filteredRepeats.gff ${OUTDIR}/${species}_Curated_Library/${species}_combined_library.fasta $latestFile ${OUTDIR}/${species}_summaryFiles/
elif [ ! -f "${OUTDIR}/${species}_mergedRepeats/looseMerge/${species}.filteredRepeats.bed" ] && [ ! -f "${OUTDIR}/${species}_Curated_Library/${species}_combined_library.fasta" ]; then
cp ${OUTDIR}/${species}_mergedRepeats/${species}.filteredRepeats.bed ${OUTDIR}/${species}_mergedRepeats/${species}.filteredRepeats.gff $latestFile ${OUTDIR}/${species}_summaryFiles/
elif [ ! -f "${OUTDIR}/${species}_mergedRepeats/looseMerge/${species}.filteredRepeats.bed" ] && [ -f "${OUTDIR}/${species}_Curated_Library/${species}_combined_library.fasta" ]; then
cp ${OUTDIR}/${species}_mergedRepeats/${species}.filteredRepeats.bed ${OUTDIR}/${species}_mergedRepeats/${species}.filteredRepeats.gff ${OUTDIR}/${species}_Curated_Library/${species}_combined_library.fasta $latestFile ${OUTDIR}/${species}_summaryFiles/
else
echo "Error: Cannot find required summary files, please check one of the following is present: ${OUTDIR}/${species}_mergedRepeats/looseMerge/${species}.filteredRepeats.bed and/or ${OUTDIR}/${species}_mergedRepeats/${species}.filteredRepeats.bed"
fi
}
# Subprocess runningTea
runningTea()
{
echo "
) (
( ) )
) ( (
_______)_
.-'---------|
( C|/\/\/\/\/|
'-./\/\/\/\/|
'_________'
'-------'
<<< $stage >>>"
}
# Subprocess GetTime
convertsecs()
{
h=$(bc <<< "${1}/3600")
m=$(bc <<< "(${1}%3600)/60")
s=$(bc <<< "${1}%60")
printf "%02d:%02d:%05.2f\n" $h $m $s
}
# Subprocess Checks
Checks()
{
if [ -z "$genome" ] || [ -z "$species" ] || [ -z "$directory" ] || [ -z "$startCust" ] ; then
usage; exit 1
fi
if [ -z "$ProcNum" ] ; then
ProcNum=1; echo "$ProcNum Cores Will Be Used"
else
echo "$ProcNum Cores Will Be Used"
fi
if [ -z "$RepSpec" ]; then
echo "RepeatMasker species not specified, running Earl Grey without known repeats"
else
echo "Running Earl Grey with sequences from configured databases in addition to custom library"
fi
if [ -z "$softMask" ] || [ "$softMask" == "no" ] ; then
softMask=no; echo "Softmasked genome will not be generated"
elif [ "$softMask" == "yes" ]; then
softMask=yes; echo "Softmasked genome will be generated"
else
softMask=no; echo "Softmask not specified using (yes/no). Using default parameter (no)."
fi
if [ "$margin" == "yes" ] ; then
margin=yes; echo "Short TE sequences <100bp will be removed from annotation"
elif [ -z "$margin" ] || [ "$margin" == "no" ]; then
margin=no; echo "Short TE sequences <100bp will not be removed from annotation"
else
margin=no; echo "Margin not specified using (yes/no). Using default parameter (no)."
fi
if [ ! -d "$SCRIPT_DIR" ]; then
echo "ERROR: Script directory variable not set, please run the configure script in the Earl Grey directory before attempting to run Earl Grey"; exit 1
fi
if [ ! -d "$SCRIPT_DIR/TEstrainer/" ]; then
echo "ERROR: teStrainer module not found, please check all modules are present and run the configure script in the Earl Grey directory before attempting to run Earl Grey"; exit 1
fi
if [ "$heli" == "yes" ] ; then
heli=yes; echo "HELITRON detection will be run using HELIANO"
elif [ -z "$heli" ] || [ "$heli" == "no" ]; then
heli=no; echo "HELITRON detection will not be run"
else
heli=no; echo "HELITRON detection not specified using (yes/no). Using default parameter (no)."
fi
echo "Please cite the following paper when using this software:"
echo "Baril, T., Galbraith, J. and Hayward, A., 2024. Earl Grey: a fully automated user-friendly transposable element annotation and analysis pipeline. Molecular Biology and Evolution, 41(4), p.msae068."
# Dfam 3.9 checks
library_path=$(which RepeatMasker | sed 's|/bin/RepeatMasker|/share/RepeatMasker/Libraries/famdb|g')
expected_file="${library_path}/dfam39_full.0.h5"
file_count=$(find "$library_path" -maxdepth 1 -type f | wc -l)
if [ $file_count -eq 2 ] && [ -f $expected_file ] && [ -f ${library_path}/rmlib.config ] ; then
echo "WARNING: Earl Grey v6.1.0 has updated to Dfam v3.9."
echo "Before using Earl Grey, you MUST download the required partitions from Dfam (https://dfam.org/releases/current/families/FamDB/)"
echo "These must be located at ${library_path}/. Then, you must reconfigure RepeatMasker before using Earl Grey."
echo "I have written the required commands for you to faciliate the process."
echo "Please follow these carefully to ensure successful configuration of RepeatMasker with Dfam 3.9."
echo "As a note, these DB paritions are BIG, so only download the ones you need for your studies (or, if you do need all of them, make sure you have the space for them)"
echo "########################################################"
echo "A script for modification and automation of these steps has been generated in $(pwd)/configure_dfam39.sh"
echo "Please open the script in your preferred text editor (e.g. nano or vim) and change the partition numbers in [] to those that you would like to download"
echo "then, make the script executable (chmod +x $(pwd)/configure_dfam39.sh) and run it: ./configure_dfam39.sh"
echo "Alternatively, you can run the steps below one-by-one to achieve the same outcome:"
echo "########################################################"
echo ""
if [ -f "$(pwd)/configure_dfam39.sh" ]; then
rm $(pwd)/configure_dfam39.sh
fi
echo "# first, change directory to the famdb library location" | tee -a $(pwd)/configure_dfam39.sh
echo "cd ${library_path}/" | tee -a $(pwd)/configure_dfam39.sh
echo "" | tee -a $(pwd)/configure_dfam39.sh
echo "# download the partitions you require from Dfam 3.9. In the below, change the numbers or range inside the square brackets to choose your subsets." | tee -a $(pwd)/configure_dfam39.sh
echo "# e.g. to download partitions 0 to 10: [0-10]; or to download partitions 3,5, and 7: [3,5,7]; [0-16] is ALL PARTITIONS" | tee -a $(pwd)/configure_dfam39.sh
echo "curl -o 'dfam39_full.#1.h5.gz' 'https://dfam.org/releases/current/families/FamDB/dfam39_full.[0-16].h5.gz'" | tee -a $(pwd)/configure_dfam39.sh
echo "" | tee -a $(pwd)/configure_dfam39.sh
echo "# decompress Dfam 3.9 paritions" | tee -a $(pwd)/configure_dfam39.sh
echo "gunzip *.gz" | tee -a $(pwd)/configure_dfam39.sh
echo "" | tee -a $(pwd)/configure_dfam39.sh
echo "# move up to RepeatMasker main directory" | tee -a $(pwd)/configure_dfam39.sh
echo "cd $(which RepeatMasker | sed 's|/bin/RepeatMasker|/share/RepeatMasker/|g')" | tee -a $(pwd)/configure_dfam39.sh
echo "" | tee -a $(pwd)/configure_dfam39.sh
#echo "# Replace famdb with version 2.0.1 (required for the Dfam 3.9)" | tee -a $(pwd)/configure_dfam39.sh
#echo "wget https://github.com/Dfam-consortium/FamDB/archive/refs/tags/2.0.1.tar.gz && tar --wildcards -zxvf 2.0.1.tar.gz FamDB-2.0.1/famdb.py FamDB-2.0.1/famdb_*.py" | tee -a $(pwd)/configure_dfam39.sh
#echo "mv FamDB-2.0.1/*.py . && rm -rf 2.0.1.tar.gz FamDB-2.0.1/" | tee -a $(pwd)/configure_dfam39.sh
#echo "" | tee -a $(pwd)/configure_dfam39.sh
echo "# save the min_init partition as a backup, just in case!" | tee -a $(pwd)/configure_dfam39.sh
echo "mv ${library_path}/min_init.0.h5 ${library_path}/min_init.0.h5.bak" | tee -a $(pwd)/configure_dfam39.sh
echo "" | tee -a $(pwd)/configure_dfam39.sh
echo "# Rerun RepeatMasker configuration" | tee -a $(pwd)/configure_dfam39.sh
echo "perl ./configure -libdir $(which RepeatMasker | sed 's|/bin/RepeatMasker|/share/RepeatMasker/Libraries/|g') \
-trf_prgm $(which RepeatMasker | sed 's|/bin/RepeatMasker|/bin/trf|g') \
-rmblast_dir $(which RepeatMasker | sed 's|/bin/RepeatMasker|/bin|g') \
-hmmer_dir $(which RepeatMasker | sed 's|/bin/RepeatMasker|/bin|g') \
-abblast_dir $(which RepeatMasker | sed 's|/bin/RepeatMasker|/bin|g') \
-crossmatch_dir $(which RepeatMasker | sed 's|/bin/RepeatMasker|/bin|g') \
-default_search_engine rmblast" | tee -a $(pwd)/configure_dfam39.sh
echo "" | tee -a $(pwd)/configure_dfam39.sh
sed -i '/^perl/ s/\s\s*/ /g' $(pwd)/configure_dfam39.sh
echo "########################################################"
echo ""
echo "Following these steps should enable you to use Dfam 3.9 with RepeatMasker and Earl Grey"
echo "You will not see this message again if configuration is successful"
echo "If you only want partition 0, you do not have to do anything except rerun your Earl Grey command"
echo "If you are still having trouble, please open an issue on the Earl Grey github and we will try to help!"
touch ${library_path}/.earlgrey.config.complete
exit 2
fi
}
# Main #
while getopts g:s:o:t:r:l:m:d:e:h option
do
case "${option}"
in
g) genome=${OPTARG};;
s) species=${OPTARG};;
o) directory=${OPTARG};;
t) ProcNum=${OPTARG};;
f) Flank=${OPTARG};;
i) num=${OPTARG};;
r) RepSpec=${OPTARG};;
l) startCust=${OPTARG};;
c) cluster=${OPTARG};;
m) margin=${OPTARG};;
d) softMask=${OPTARG};;
n) no_seq=${OPTARG};;
a) min_seq=${OPTARG};;
e) heli=${OPTARG};;
h) usage; exit 0;;
esac
done
SECONDS=0
# Step 1 - set up the directories and make sure all modules are present
stage="Checking Parameters" && runningTea
SCRIPT_DIR=/data/toby/EarlGrey/scripts/
Checks
stage="Making Directories" && runningTea
makeDirectory
# Start Logs - only bother starting if everything is present and correct
exec > >(tee "${OUTDIR}/${species}EarlGrey.log") 2>&1
stage="Cleaning Genome" && runningTea
prepGenome
sleep 1
# Stage 2 - Assign libraries to variables
latestFile=$(realpath $startCust)
if [ ! -z "$RepSpec" ]; then
if [[ $(which RepeatMasker) == *"bin"* ]]; then
libpath="$(which RepeatMasker | sed 's|bin/RepeatMasker|share/RepeatMasker/Libraries/famdb/|')"
else
libpath="$(which RepeatMasker | sed 's|/[^/]*$||g')/Libraries/famdb/"
fi
if [[ $(which RepeatMasker) == *"bin"* ]]; then
PATH=$PATH:"$(which RepeatMasker | sed 's|bin/RepeatMasker|share/RepeatMasker/|g')"
fi
famdb.py -i $libpath families -f fasta_name --include-class-in-name -a -d --curated $RepSpec > ${OUTDIR}/${species}_Curated_Library/${RepSpec}.RepeatMasker.lib
RepSub=${OUTDIR}/${species}_Curated_Library/${RepSpec}.RepeatMasker.lib
fi
# Stage 5 - run the final RepeatMasker annotation using the refined TE consensus library
if [ ! -f ${OUTDIR}/${species}_RepeatMasker_Against_Custom_Library/*.tbl ]; then
stage="Identifying Repeats Using Species-Specific Library" && runningTea
novoMask
if [ ! -f ${OUTDIR}/${species}_RepeatMasker_Against_Custom_Library/*.tbl ]; then
echo "ERROR: RepeatMasker failed, please check logs" && exit 2
fi
sleep 1
else
stage="Final masking already complete, skipping..." && runningTea
sleep 1
fi
# Stage 5.5 - Run HELIANO as an optional step
if [ "$heli" == "yes" ]; then
if [ ! -s ${OUTDIR}/${species}_heliano/*/RC.representative.gff ]; then
stage="Running HELIANO to Detect Helitrons" && runningTea
timestamp=$(date +"%Y%m%d_%H%M")
heliano_optional
else
stage="HELITRON detection already complete, skipping..." && runningTea
helitron_gff="$(realpath $(ls -td -- ${OUTDIR}/${species}_heliano/*/ | head -n 1) || true)/RC.representative.gff"
echo "HELITRON GFF: $helitron_gff"
fi
sleep 1
fi
# Stage 6
if [ ! -f ${OUTDIR}/${species}_mergedRepeats/looseMerge/${species}.filteredRepeats.bed ] ; then
stage="Defragmenting Repeats" && runningTea
mergeRep
sleep 1
else
stage="Repeats already defragmented, skipping..." && runningTea
sleep 1
fi
# Stage 7
stage="Generating Summary Plots" && runningTea
charts
calcDivRL
sleep 1
# Stage 8
stage="Tidying Directories and Organising Important Files" && runningTea
sweepUp
sleep 1
# Stage 8.5 Optional Softmask
if [ "$softMask" == "yes" ]; then
stage="Generating Softmasked Genome" && runningTea
gunzip ${genome%.prep}.bak.gz && bedtools maskfasta -fi ${genome%.prep}.bak -bed ${OUTDIR}/${species}_mergedRepeats/looseMerge/${species}.filteredRepeats.bed -fo ${OUTDIR}/${species}_summaryFiles/${species}.softmasked.fasta -soft && gzip ${genome%.prep}.bak
sleep 1
fi
# Stage 9
time=$(convertsecs $SECONDS)
stage="Done in $time" && runningTea
sleep 5
# Stage 10
stage="TE library, Summary Figures, and TE Quantifications in Standard Formats Can Be Found in ${OUTDIR}/${species}_summaryFiles/" && runningTea
cat ${OUTDIR}/${species}_summaryFiles/${species}.highLevelCount.kable
sleep 5
================================================
FILE: earlGreyLibConstruct
================================================
#!/bin/bash
set -eo pipefail
usage()
{
echo " #############################
earlGrey version 7.2.4 (Library Construction Only)
Required Parameters:
-g == genome.fasta
-s == species name
-o == output directory
Optional Parameters:
-t == Number of Threads (DO NOT specify more than are available)
-r == RepeatMasker search term (e.g arthropoda/eukarya)
-l == Starting consensus library for an initial mask (in fasta format)
-i == Number of Iterations to BLAST, Extract, Extend (Default: 10)
-f == Number flanking basepairs to extract (Default: 1000)
-n == Max number of sequences used to generate consensus sequences (Default: 20)
-a == minimum number of sequences required to build a consensus sequence (Default: 3)
-h == Show help
Example Usage:
earlGrey -g bombyxMori.fasta -s bombyxMori -o /home/toby/bombyxMori/repeatAnnotation/ -t 16
Queries can be sent to:
tobias.baril[at]unine.ch
Please make use of the GitHub Issues and Discussion Tabs at: https://github.com/TobyBaril/EarlGrey
#############################"
}
# Subprocess Make Directories #
makeDirectory()
{
directory=$(realpath ${directory})
mkdir -p ${directory}/${species}_EarlGrey/ && OUTDIR=${directory}/${species}_EarlGrey
if [ ! -z "$RepSpec" ] || [ ! -z "$startCust" ] ; then
mkdir -p $OUTDIR/${species}_RepeatMasker/
fi
mkdir -p $OUTDIR/${species}_Database/
mkdir -p $OUTDIR/${species}_RepeatModeler/
mkdir -p $OUTDIR/${species}_strainer/
mkdir -p ${OUTDIR}/${species}_summaryFiles/
}
# Subprocess PrepGenome #
prepGenome()
{
if [ ! -L ${genome} ]; then
genome=$(realpath ${genome})
fi
if [ -L $genome ]; then
genome=$(realpath -s ${genome})
fi
if [ ! -f ${genome}.prep ] || [ ! -f ${genome}.dict ]; then
cp ${genome} ${genome}.bak && gzip -f ${genome}.bak
sed '/>/ s/[[:space:]].*//g; /^$/d' ${genome} > ${genome}.tmp
${SCRIPT_DIR}/headSwap.sh -i ${genome}.tmp -o ${genome}.prep && rm ${genome}.tmp
mv ${genome}.tmp.dict ${genome}.dict
sed -i.bak '/^>/! s/[DVHBPE]/N/g' ${genome}.prep
dict=${genome}.dict
genOrig=${genome}
genome=${genome}.prep
else
dict=${genome}.dict
genOrig=${genome}
genome=${genome}.prep
fi
}
# Subprocess getRepeatMaskerFasta
# Generate a copy of the RepeatMasker library subset used for the initial TE mask in FASTA format (only used if a RepeatMasker run is specified)
getRepeatMaskerFasta()
{
if [[ $RepSpec = *" "* ]]; then
echo "ERROR: You have entered a species name that contains a space, please use the NCBI TaxID rather than name. E.G In place of \"Homo sapiens\" use \"9606\""
exit 2
fi
if [[ $(which RepeatMasker) == *"bin"* ]]; then
libpath="$(which RepeatMasker | sed 's|bin/RepeatMasker|share/RepeatMasker/Libraries/famdb/|')"
else
libpath="$(which RepeatMasker | sed 's|/[^/]*$||g')/Libraries/famdb/"
fi
if [[ $(which RepeatMasker) == *"bin"* ]]; then
PATH=$PATH:"$(which RepeatMasker | sed 's|bin/RepeatMasker|share/RepeatMasker/|')"
fi
mkdir -p $OUTDIR/${species}_Curated_Library/
famdb.py -i $libpath families -f fasta_name --include-class-in-name -a -d --curated $RepSpec > ${OUTDIR}/${species}_Curated_Library/${RepSpec}.RepeatMasker.lib
RepSub=${OUTDIR}/${species}_Curated_Library/${RepSpec}.RepeatMasker.lib
}
# Subprocess firstMask
# Run the initial RepeatMasker run with the specified species subset (only used if a RepeatMasker run is specified)
firstMask()
{
cd ${OUTDIR}/${species}_RepeatMasker
rmthreads=$(( ProcNum / 4 ))
RepeatMasker -species $RepSpec -no_is -lcambig -s -a -pa $rmthreads -dir $OUTDIR/${species}_RepeatMasker $genome
if [ ! -f ${OUTDIR}/${species}_RepeatMasker/*.masked ]; then
echo "ERROR: RepeatMasker failed, please check logs. This is likely because of an invalid species search term, if issue persists please use NCBI Taxids (E.G Drosophila is replaced with 7125)"; exit 2
fi
}
# Subprocess firstMaskCustomLib
# Run the initial RepeatMasker run with the specific fasta consensus library
firstMaskCustomLib()
{
cd ${OUTDIR}/${species}_RepeatMasker
rmthreads=$(( ProcNum / 4 ))
RepeatMasker -lib ${startCust} -no_is -lcambig -s -a -pa $rmthreads -dir $OUTDIR/${species}_RepeatMasker $genome
RepSub=${startCust}
if [ ! -f ${OUTDIR}/${species}_RepeatMasker/*.masked ]; then
echo "ERROR: RepeatMasker failed, please check logs. This is likely because of an invalid custom consensus file, check the fasta file looks as expected"; exit 2
fi
}
# Subprocess buildDB
# if masked genome exists, build database from this. If not, build database from original input genome
buildDB()
{
if [ -f ${OUTDIR}/${species}_RepeatMasker/*.masked ]; then
cd ${OUTDIR}/${species}_Database
BuildDatabase -name ${species} ${OUTDIR}/${species}_RepeatMasker/*.masked
else
cd ${OUTDIR}/${species}_Database
BuildDatabase -name ${species} $genome
fi
}
# Subprocess deNovo1
# Run the initial de novo annotation run with RepeatModeler2.
# The genomeSampleSizeMax flag is set based on the samplable genome size (contigs >= 40 kb)
# to avoid RepeatModeler failures on small or fragmented genomes.
deNovo1()
{
cd ${OUTDIR}/${species}_RepeatModeler
# Compute samplable genome size: sum of contigs >= 40 kb only.
# RepeatModeler discards contigs shorter than 40 kb during sampling, so using
# the total genome size would overestimate the sequence available per round.
GENOME_SIZE=$(awk '/^>/{if(len>=40000)sum+=len; len=0; next}{len+=length($0)} END{if(len>=40000)sum+=len; print sum+0}' "${genome}")
echo "Samplable genome size (contigs >= 40 kb): ${GENOME_SIZE} bp"
# Set -genomeSampleSizeMax to the highest RECON round threshold the genome
# can support. Thresholds are cumulative across rounds (r2=3M, r3=9M, r4=27M,
# r5=81M, r6=243M), so the samplable size must cover all rounds up to the cap:
# >= 363M (3+9+27+81+243): no cap -> all 6 rounds
# >= 120M (3+9+27+81): cap 81M -> rounds 1-5
# >= 39M (3+9+27): cap 27M -> rounds 1-4
# >= 12M (3+9): cap 9M -> rounds 1-3
# < 12M: cap 3M -> rounds 1-2
SAMPLE_FLAG=""
if [ "${GENOME_SIZE}" -ge 363000000 ] 2>/dev/null; then
: # genome large enough for all rounds; use RepeatModeler default
elif [ "${GENOME_SIZE}" -ge 120000000 ] 2>/dev/null; then
echo "Capping at round 5 (-genomeSampleSizeMax 81000000)"
SAMPLE_FLAG="-genomeSampleSizeMax 81000000"
elif [ "${GENOME_SIZE}" -ge 39000000 ] 2>/dev/null; then
echo "Capping at round 4 (-genomeSampleSizeMax 27000000)"
SAMPLE_FLAG="-genomeSampleSizeMax 27000000"
elif [ "${GENOME_SIZE}" -ge 12000000 ] 2>/dev/null; then
echo "Capping at round 3 (-genomeSampleSizeMax 9000000)"
SAMPLE_FLAG="-genomeSampleSizeMax 9000000"
else
echo "Capping at round 2 (-genomeSampleSizeMax 3000000)"
SAMPLE_FLAG="-genomeSampleSizeMax 3000000"
fi
RepeatModeler -threads ${ProcNum} -database ${OUTDIR}/${species}_Database/${species} ${SAMPLE_FLAG}
if [ ! -e ${OUTDIR}/${species}_Database/${species}-families.fa ]; then
echo "ERROR: RepeatModeler Failed"
exit 2
fi
}
# Subprocess strainer # CHECK FILE STRUCTURE FOR EARL GREY RUN
# contains the BLAST, Extract, Extend, Trim pipeline from James Galbraith
strainer()
{
cd ${OUTDIR}/${species}_strainer/
if [ "$ProcNum" -ge 4 ]; then
strainthreads=$(( ProcNum / 4 ))
else
strainthreads=$ProcNum
fi
${SCRIPT_DIR}/TEstrainer/TEstrainer_for_earlGrey.sh -g $genome -l ${OUTDIR}/${species}_Database/${species}-families.fa -t ${strainthreads} -f $Flank -r $num -n $no_seq -m $min_seq
latestFile="$(realpath $(ls -td -- ${OUTDIR}/${species}_strainer/*/ | head -n 1) || true)/${species}-families.fa.strained"
cp $latestFile ${OUTDIR}/${species}_strainer/
}
# Subprocess strainerResume
# resumes interrupted TEstrainer runs
strainerResume() {
cd ${OUTDIR}/${species}_strainer/
strainDataDir=$(find . -maxdepth 1 -type d -name "TS_${species}-families.fa_*")
if [ ! $(echo "$strainDataDir" | wc -l) -eq 1 ]; then
echo "ERROR: There are more TEstrainer directories than expected present"
exit 2
fi
if [ "$ProcNum" -ge 4 ]; then
strainthreads=$(( ProcNum / 4 ))
else
strainthreads=$ProcNum
fi
${SCRIPT_DIR}/TEstrainer/TEstrainer_for_earlGrey.sh -g $genome -l ${OUTDIR}/${species}_Database/${species}-families.fa -t ${strainthreads} -f $Flank -r $num -n $no_seq -m $min_seq -d ${strainDataDir} -R
latestFile="$(realpath $(ls -td -- ${OUTDIR}/${species}_strainer/*/ | head -n 1) || true)/${species}-families.fa.strained"
cp $latestFile ${OUTDIR}/${species}_strainer/
}
# Subprocess sweepUp
# Puts required files into a summary folder
sweepUp()
{
cd ${OUTDIR}/${species}_summaryFiles/
if [ -s ${latestFile} ]; then
cp $latestFile ${OUTDIR}/${species}_summaryFiles/
else
echo "ERROR: TEstrainer failed to generate a TE library, please check logs" && exit 1
fi
}
# Subprocess runningTea
runningTea()
{
echo "
) (
( ) )
) ( (
_______)_
.-'---------|
( C|/\/\/\/\/|
'-./\/\/\/\/|
'_________'
'-------'
<<< $stage >>>"
}
# Subprocess GetTime
convertsecs()
{
h=$(bc <<< "${1}/3600")
m=$(bc <<< "(${1}%3600)/60")
s=$(bc <<< "${1}%60")
printf "%02d:%02d:%05.2f\n" $h $m $s
}
# Subprocess Checks
Checks()
{
if [ -z "$genome" ] || [ -z "$species" ] || [ -z "$directory" ] ; then
usage; exit 1
fi
if [ -z "$ProcNum" ] ; then
ProcNum=1; echo "$ProcNum Cores Will Be Used"
else
echo "$ProcNum Cores Will Be Used"
fi
if [ -z "$RepSpec" ] && [ -z "$startCust" ] ; then
echo "RepeatMasker species not specified, running Earl Grey without an initial mask with known repeats"
else
echo "Running Earl Grey with an intial mask with known repeats"
fi
if [ -z "$num" ] ; then
num=10; echo "De Novo Sequences Will Be Extended Through a Maximum of $num Iterations"
else
echo "De Novo Sequences Will Be Extended Through a Maximum of $num Iterations"
fi
if [ -z "$no_seq" ] ; then
no_seq=20; echo "$no_seq sequences will be used in BEAT consensus generation"
else
echo "$no_seq sequences will be used in BEAT consensus generation"
fi
if [ -z "$Flank" ]; then
Flank=1000; echo "Blast, Extend, Align, Trim Process Will Add 1000bp to Each End in Each Iteration"
else
echo "Blast, Extract, Extend, Trim Process Will Add $Flank to Each End in Each Iteration"
fi
if [ -z "$min_seq" ]; then
min_seq=3; echo "Blast, Extend, Align, Trim Process Will Require 3 Sequences to Generate a New Consensus Sequence"
else
echo "Blast, Extend, Align, Trim Process Will Require $min_seq Sequences to Generate a New Consensus Sequence"
fi
if [ ! -d "$SCRIPT_DIR" ]; then
echo "ERROR: Script directory variable not set, please run the configure script in the Earl Grey directory before attempting to run Earl Grey"; exit 1
fi
if [ ! -d "$SCRIPT_DIR/TEstrainer/" ]; then
echo "ERROR: teStrainer module not found, please check all modules are present and run the configure script in the Earl Grey directory before attempting to run Earl Grey"; exit 1
fi
echo "Please cite the following paper when using this software:"
echo "Baril, T., Galbraith, J. and Hayward, A., 2024. Earl Grey: a fully automated user-friendly transposable element annotation and analysis pipeline. Molecular Biology and Evolution, 41(4), p.msae068."
}
# Subprocess Dfam Checks
dfamCheck()
{
# Dfam 3.9 checks
library_path=$(which RepeatMasker | sed 's|/bin/RepeatMasker|/share/RepeatMasker/Libraries/famdb|g')
expected_file="${library_path}/dfam39_full.0.h5"
file_count=$(find "$library_path" -maxdepth 1 -type f | wc -l)
if [ $file_count -eq 2 ] && [ -f $expected_file ] && [ -f ${library_path}/rmlib.config ] ; then
echo "WARNING: Earl Grey v6.1.0 has updated to Dfam v3.9."
echo "Before using Earl Grey, you MUST download the required partitions from Dfam (https://dfam.org/releases/current/families/FamDB/)"
echo "These must be located at ${library_path}/. Then, you must reconfigure RepeatMasker before using Earl Grey."
echo "I have written the required commands for you to faciliate the process."
echo "Please follow these carefully to ensure successful configuration of RepeatMasker with Dfam 3.9."
echo "As a note, these DB paritions are BIG, so only download the ones you need for your studies (or, if you do need all of them, make sure you have the space for them)"
echo "########################################################"
echo "A script for modification and automation of these steps has been generated in $(pwd)/configure_dfam39.sh"
echo "Please open the script in your preferred text editor (e.g. nano or vim) and change the partition numbers in [] to those that you would like to download"
echo "then, make the script executable (chmod +x $(pwd)/configure_dfam39.sh) and run it: ./configure_dfam39.sh"
echo "Alternatively, you can run the steps below one-by-one to achieve the same outcome:"
echo "########################################################"
echo ""
if [ -f "$(pwd)/configure_dfam39.sh" ]; then
rm $(pwd)/configure_dfam39.sh
fi
echo "# first, change directory to the famdb library location" | tee -a $(pwd)/configure_dfam39.sh
echo "cd ${library_path}/" | tee -a $(pwd)/configure_dfam39.sh
echo "" | tee -a $(pwd)/configure_dfam39.sh
echo "# download the partitions you require from Dfam 3.9. In the below, change the numbers or range inside the square brackets to choose your subsets." | tee -a $(pwd)/configure_dfam39.sh
echo "# e.g. to download partitions 0 to 10: [0-10]; or to download partitions 3,5, and 7: [3,5,7]; [0-16] is ALL PARTITIONS" | tee -a $(pwd)/configure_dfam39.sh
echo "curl -o 'dfam39_full.#1.h5.gz' 'https://dfam.org/releases/current/families/FamDB/dfam39_full.[0-16].h5.gz'" | tee -a $(pwd)/configure_dfam39.sh
echo "" | tee -a $(pwd)/configure_dfam39.sh
echo "# decompress Dfam 3.9 paritions" | tee -a $(pwd)/configure_dfam39.sh
echo "gunzip *.gz" | tee -a $(pwd)/configure_dfam39.sh
echo "" | tee -a $(pwd)/configure_dfam39.sh
echo "# move up to RepeatMasker main directory" | tee -a $(pwd)/configure_dfam39.sh
echo "cd $(which RepeatMasker | sed 's|/bin/RepeatMasker|/share/RepeatMasker/|g')" | tee -a $(pwd)/configure_dfam39.sh
echo "" | tee -a $(pwd)/configure_dfam39.sh
#echo "# Replace famdb with version 2.0.1 (required for the Dfam 3.9)" | tee -a $(pwd)/configure_dfam39.sh
#echo "wget https://github.com/Dfam-consortium/FamDB/archive/refs/tags/2.0.1.tar.gz && tar --wildcards -zxvf 2.0.1.tar.gz FamDB-2.0.1/famdb.py FamDB-2.0.1/famdb_*.py" | tee -a $(pwd)/configure_dfam39.sh
#echo "mv FamDB-2.0.1/*.py . && rm -rf 2.0.1.tar.gz FamDB-2.0.1/" | tee -a $(pwd)/configure_dfam39.sh
#echo "" | tee -a $(pwd)/configure_dfam39.sh
echo "# save the min_init partition as a backup, just in case!" | tee -a $(pwd)/configure_dfam39.sh
echo "mv ${library_path}/min_init.0.h5 ${library_path}/min_init.0.h5.bak" | tee -a $(pwd)/configure_dfam39.sh
echo "" | tee -a $(pwd)/configure_dfam39.sh
echo "# Rerun RepeatMasker configuration" | tee -a $(pwd)/configure_dfam39.sh
echo "perl ./configure -libdir $(which RepeatMasker | sed 's|/bin/RepeatMasker|/share/RepeatMasker/Libraries/|g') \
-trf_prgm $(which RepeatMasker | sed 's|/bin/RepeatMasker|/bin/trf|g') \
-rmblast_dir $(which RepeatMasker | sed 's|/bin/RepeatMasker|/bin|g') \
-hmmer_dir $(which RepeatMasker | sed 's|/bin/RepeatMasker|/bin|g') \
-abblast_dir $(which RepeatMasker | sed 's|/bin/RepeatMasker|/bin|g') \
-crossmatch_dir $(which RepeatMasker | sed 's|/bin/RepeatMasker|/bin|g') \
-default_search_engine rmblast" | tee -a $(pwd)/configure_dfam39.sh
echo "" | tee -a $(pwd)/configure_dfam39.sh
sed -i '/^perl/ s/\s\s*/ /g' $(pwd)/configure_dfam39.sh
echo "########################################################"
echo ""
echo "Following these steps should enable you to use Dfam 3.9 with RepeatMasker and Earl Grey"
echo "You will not see this message again if configuration is successful"
echo "If you only want partition 0, you do not have to do anything except rerun your Earl Grey command"
echo "If you are still having trouble, please open an issue on the Earl Grey github and we will try to help!"
touch ${library_path}/.earlgrey.config.complete
exit 2
fi
}
# Main #
while getopts g:s:o:t:f:i:r:l:n:a:h option
do
case "${option}"
in
g) genome=${OPTARG};;
s) species=${OPTARG};;
o) directory=${OPTARG};;
t) ProcNum=${OPTARG};;
f) Flank=${OPTARG};;
i) num=${OPTARG};;
r) RepSpec=${OPTARG};;
l) startCust=${OPTARG};;
n) no_seq=${OPTARG};;
a) min_seq=${OPTARG};;
h) usage; exit 0;;
esac
done
SECONDS=0
# Step 1 - set up the directories and make sure all modules are present
stage="Checking Parameters" && runningTea
SCRIPT_DIR=/data/toby/mambaforge/envs/earlgrey_440/share/earlgrey-4.4.0-0/scripts/
dfamCheck
Checks
stage="Making Directories" && runningTea
makeDirectory
# Start Logs - only bother starting if everything is present and correct
exec > >(tee "${OUTDIR}/${species}EarlGrey.log") 2>&1
stage="Cleaning Genome" && runningTea
prepGenome
sleep 1
# Step 2 - initial annotation, either de novo, or if required an initial RepeatMasker followed by a de novo run
if [ -z "$RepSpec" ]; then
sleep 1
else
if [ ! -f ${OUTDIR}/${species}_RepeatMasker/*.masked ]; then
stage="Getting RepeatMasker Sequences for $RepSpec and Saving as Fasta" && runningTea
getRepeatMaskerFasta
stage="Running Initial Mask with Known Repeats" && runningTea
firstMask
else
stage="Genome has already been masked, skipping..." && runningTea
getRepeatMaskerFasta
fi
fi
if [ -z "$startCust" ]; then
sleep 1
else
if [ ! -f ${OUTDIR}/${species}_RepeatMasker/*.masked ]; then
stage="Running Initial Mask with Known Repeats in Custom Library" && runningTea
firstMaskCustomLib
else
stage="Genome has already been masked, skipping..." && runningTea
fi
fi
# Step 3 - make a database from the genome, and then run a de novo TE annotation using RepeatModeler2
if [ ! -f ${OUTDIR}/${species}_Database/${species}-families.fa ]; then
stage="Detecting Novel Repeats" && runningTea
buildDB
deNovo1
sleep 1
else
stage="De novo repeats have already been detected, skipping..." && runningTea
sleep 1
fi
# Step 4 - run TEstrainer, which runs a BLAST, extract, extent, trim, on the the de novo repeat library to refine consensus sequences
if [ ! -s ${OUTDIR}/${species}_strainer/${species}-families.fa.strained ]; then
stage="Straining TEs and Refining de novo Consensus Sequences" && runningTea
if [ -f ${OUTDIR}/${species}_strainer/${species}-families.fa.strained ]; then
rm -r ${OUTDIR}/${species}_strainer/${species}-families.fa.strained ${OUTDIR}/${species}_strainer/TS*
strainer
elif [ -d ${OUTDIR}/${species}_strainer/TS_${species}-families.fa_* ]; then
stage="Resuming TEstrainer" && runningTea
strainerResume
else
strainer
fi
if [ ! -s ${OUTDIR}/${species}_strainer/${species}-families.fa.strained ]; then
echo "WARNING: TEstrainer failed to produce a strain file, please check the log file for more information. If you have run an intial mask with known repeats, this could be due RepeatModeler2 failing to identify any new repeats. Please check if this is expected."
fi
sleep 1
else
stage="TEs already strained, skipping..." && runningTea
latestFile=${OUTDIR}/${species}_strainer/${species}-families.fa.strained
cp ${latestFile} ${OUTDIR}/${species}_summaryFiles/
sleep 1
fi
# Stage 5 - Tidy up!
stage="Tidying Directories and Organising Important Files" && runningTea
sweepUp
sleep 1
# Stage 6
time=$(convertsecs $SECONDS)
stage="Done in $time" && runningTea
sleep 5
# Stage 10
stage="TE library in Standard Format Can Be Found in ${OUTDIR}/${species}_summaryFiles/" && runningTea
sleep 5
================================================
FILE: legacyDocs.md
================================================
# This file contains installation instructions for legacy versions of Earl Grey
## We recommend installing via conda or docker for the latest stable releases of Earl Grey, the below are retained as a reference
#==============================================================================================================================================================================#
## Earl Grey Installation and Configuration (If you already have RepeatMasker and RepeatModeler)
If you do not currently have RepeatMasker and RepeatModeler installed, the instructions are provided further down this page. If you do have them installed, **ensure the executables are in your PATH environment, including the RepeatMasker/util/ directory!**
All of the scripts and associated modules are contained within this github repository. Earl Grey runs inside an anaconda environment to ensure all required packages are present and are the correct version. Therefore to run Earl Grey, you will require anaconda to be installed on your system.
**If anaconda is NOT installed on your system, please install it following these instructions:**
```
# Change to /tmp directory as we won't need the script after running it
cd /tmp
# Download the anaconda installation script
curl https://repo.anaconda.com/archive/Anaconda3-2020.02-Linux-x86_64.sh --output anaconda.sh
# Run the script to install anaconda
bash anaconda.sh
# answer yes when asked, and install anaconda3 in the specified location (recommended) unless you want it to be installed elsewhere.
# When asked "Do you wish the installer to initialize Anaconda3 by running conda init?", answer "yes" for ease of use.
# Activate conda by refreshing terminal session
source ~/.bashrc
# If successful, you should now see (base) on the left of your username on the command line
```
**Now that Anaconda is installed, we can install Earl Grey**
Clone the Earl Grey github repo
```
# Clone into a home directory, or somewhere you want to install Earl Grey
git clone https://github.com/TobyBaril/EarlGrey
```
Enter the Earl Grey directory and configure the program
```
cd ./EarlGrey
chmod +x ./configure
./configure
```
Once this is complete, remember to activate the earlGrey conda environment before attempting to run the Earl Grey pipeline
```
conda activate earlGrey
earlGrey -g genome.fasta -s speciesName -o outputDirectory -t threads
```
For suggestions or questions, please use the discussion and issues functions in github.
Thank you for trying Earl Grey!
#==============================================================================================================================================================================#
## Earl Grey Installation and Configuration (If you DO NOT have RepeatMasker and RepeatModeler) - WITH SUDO PRIVILEGES
These instructions will guide you through configuring all required programs and scripts to run Earl Grey.
All of the scripts and associated modules are contained within this github repository. Earl Grey runs inside an anaconda environment to ensure all required packages are present and are the correct version. Therefore to run Earl Grey, you will require anaconda to be installed on your system.
**If anaconda is NOT installed on your system, please install it following these instructions:**
```
# Change to /tmp directory as we won't need the script after running it
cd /tmp
# Download the anaconda installation script
curl https://repo.anaconda.com/archive/Anaconda3-2020.02-Linux-x86_64.sh --output anaconda.sh
# Run the script to install anaconda
bash anaconda.sh
# answer yes when asked, and install anaconda3 in the specified location (recommended) unless you want it to be installed elsewhere.
# When asked "Do you wish the installer to initialize Anaconda3 by running conda init?", answer "yes" for ease of use.
# Activate conda by refreshing terminal session
source ~/.bashrc
# If successful, you should now see (base) on the left of your username on the command line
```
**Now that Anaconda is installed, we can install Earl Grey**
### To install RepeatMasker
RepeatMasker can be downloaded from: http://www.repeatmasker.org/RepeatMasker/. Installation instructions can be found on the website. Alternatively, please use the code below:
You will need to download and install a couple of programs for RepeatMasker to work.
Download rmblast
```wget https://www.repeatmasker.org/rmblast/rmblast-2.13.0+-x64-linux.tar.gz```
Extract the rmblast package
```tar -zxvf rmblast-2.13.0+-x64-linux.tar.gz```
Make a note of the full path to rmblast-2.13.0/bin/ as you will need this in the RepeatMasker configuration
If you are not certain of the full path, please run the following command
```realpath ./rmblast-2.13.0/bin/```
Download RepeatMasker (this will download it to the current directory). NOTE some extra steps are required for Dfam 3.8, please refer to repeatmasker.org/RepeatMasker.
```wget https://www.repeatmasker.org/RepeatMasker/RepeatMasker-4.1.6.tar.gz```
Copy the RepeatMasker package to /usr/local/, or somewhere that all users will be able to access the installation.
Copying to /usr/local/ might require sudo privileges
```sudo cp RepeatMasker-4.1.6.tar.gz /usr/local/```
Change directory to /usr/local, and extract the RepeatMasker package. This might require sudo privileges.
```cd /usr/local/```
```sudo tar -zxvf RepeatMasker-4.1.6.tar.gz```
Install the required RepeatMasker libraries - Earl Grey has been tested with Dfam 3.4 onwards and RepBase.
Unfortunately, RepBase is now behind a paywall, but to ensure Earl Grey remains open it does not rely on RepBase, although inclusion of RepBase can improve classification of repeats by RepeatModeler. If you have access to this database, please include it in your configuration of RepeatMasker.
We recommend that you download Dfam 3.8 before using Earl Grey, and this is required for RepeatMasker 4.1.6. The Dfam library is large - this could take a while!
We recommend downloading Dfam into your home directory (~/) or a subdirectory of home
Change directory to home
```cd ~/```
Download lastest Dfam release - This may take a while
*** NOTE: There are many partitions for Dfam, each containing curated and uncurated TE consensi split by taxonomy. As a minimum, you need to configure RepeatMasker with the root partition, but you can also include others. Note: There may be erroneous annotations in the uncurated Dfam database, use at your own risk***
For whole Dfam library run all of the below, or run each part as required (minimum the root (0) partition):
```
wget https://dfam.org/releases/Dfam_3.8/families/FamDB/dfam38_full.0.h5.gz
wget https://dfam.org/releases/Dfam_3.8/families/FamDB/dfam38_full.1.h5.gz
wget https://dfam.org/releases/Dfam_3.8/families/FamDB/dfam38_full.2.h5.gz
wget https://dfam.org/releases/Dfam_3.8/families/FamDB/dfam38_full.3.h5.gz
wget https://dfam.org/releases/Dfam_3.8/families/FamDB/dfam38_full.4.h5.gz
wget https://dfam.org/releases/Dfam_3.8/families/FamDB/dfam38_full.5.h5.gz
wget https://dfam.org/releases/Dfam_3.8/families/FamDB/dfam38_full.6.h5.gz
wget https://dfam.org/releases/Dfam_3.8/families/FamDB/dfam38_full.7.h5.gz
wget https://dfam.org/releases/Dfam_3.8/families/FamDB/dfam38_full.8.h5.gz
```
Unzip the Dfam release - This may take a while with no indication that anything
gitextract_2xvj2bh0/
├── .gitpod.yml
├── CITATION.cff
├── CODE_OF_CONDUCT.md
├── Docker/
│ ├── Dockerfile
│ ├── README.md
│ ├── configInstructions.txt
│ └── getFiles.sh
├── LICENSE
├── README.md
├── Singularity/
│ └── README.md
├── conda/
│ ├── build.sh
│ ├── developmentNotes.md
│ └── meta.yaml
├── earlGrey
├── earlGreyAnnotationOnly
├── earlGreyLibConstruct
├── legacyDocs.md
├── modules/
│ └── trf409.linux64
└── scripts/
├── LTR_FINDER_parallel
├── TEstrainer/
│ ├── README.md
│ ├── TEstrainer
│ ├── TEstrainer.Rproj
│ ├── TEstrainer_for_earlGrey.sh
│ ├── data/
│ │ ├── acceptable_domains.tsv
│ │ ├── additional_domains.tsv
│ │ ├── exceptional_domains.tsv
│ │ ├── old_acceptable_domains.tsv
│ │ └── unacceptable_domains.tsv
│ └── scripts/
│ ├── Dfam_extractor.py
│ ├── TEtrim.py
│ ├── domain_plotter.R
│ ├── final_sorter.R
│ ├── indexer.py
│ ├── initial_domain_finder.R
│ ├── initial_mafft_setup.py
│ ├── mreps_parser.sh
│ ├── simple_repeat_filter_trim.R
│ ├── splitter.py
│ ├── strainer.R
│ └── trf_parser.py
├── autoLand.R
├── autoPie.R
├── autoPie.sh
├── backSwap.py
├── backSwapGFF.py
├── bin/
│ ├── LTR_FINDER.x86_64-1.0.7/
│ │ ├── ltr_finder
│ │ └── tRNA.list.txt
│ └── cut.pl
├── divergenceCalc/
│ ├── divergence_calc.py
│ └── divergence_plot.R
├── extract_align.py
├── faswap.py
├── filteringOverlappingRepeats.R
├── headSwap.sh
├── install_r_packages.R
├── makeGff.R
├── mergeRepeats.R
├── rcMergeRepeats
├── rcMergeRepeatsLoose
├── repeatCraft/
│ ├── Dockerfile
│ ├── LICENSE
│ ├── README.md
│ ├── example/
│ │ ├── example.rclabel.gff
│ │ ├── example.rmerge.gff
│ │ ├── example.summary.txt
│ │ ├── example_input.gff
│ │ ├── example_input.out
│ │ ├── example_ltrfinder.gff
│ │ ├── mapfile.tsv
│ │ ├── repeatcraft.cfg
│ │ └── repeatcraft_strict.cfg
│ ├── helper/
│ │ ├── combineGFFoverlapm.py
│ │ ├── extraFuseTEm.py
│ │ ├── extraTrueMergeTEm.py
│ │ ├── filtershortm.py
│ │ ├── fuseltr.py
│ │ ├── fusetem.py
│ │ ├── rcStatm.py
│ │ ├── reformatm.py
│ │ ├── repeatcraftHelper.py
│ │ ├── truemergeltrm.py
│ │ ├── truemergeltrmErrorManagement.py
│ │ └── truemergetem.py
│ ├── repeatcraft.py
│ ├── repeatcraftErrorManagement.py
│ └── test/
│ ├── ci.sh
│ └── requirements.txt
├── repeatcraft.py
└── repeatcraftErrorManagement.py
SYMBOL INDEX (43 symbols across 17 files) FILE: scripts/TEstrainer/scripts/TEtrim.py function single_trim (line 34) | def single_trim(aln_in): function con_maker (line 43) | def con_maker(aln_in): FILE: scripts/TEstrainer/scripts/initial_mafft_setup.py function file_check (line 26) | def file_check(file_name, debug): function size_check (line 53) | def size_check(var_name, size): function df_to_fasta (line 60) | def df_to_fasta(df, path, name, overwrite): function blast_to_bed (line 81) | def blast_to_bed(df): FILE: scripts/TEstrainer/scripts/trf_parser.py function parse_args (line 5) | def parse_args(): function main (line 17) | def main(): FILE: scripts/divergenceCalc/divergence_calc.py function file_check (line 36) | def file_check(repeat_library, in_gff, genome, out_gff, temp_dir): function splitter (line 50) | def splitter(in_seq, temp_dir): function parse_gff (line 58) | def parse_gff(in_gff): function file_name_generator (line 71) | def file_name_generator(): function Kimura80 (line 77) | def Kimura80(qseq, sseq): function outer_func (line 108) | def outer_func(genome_path, temp_dir, timeoutSeconds, chunk_path): function tmp_out_parser (line 191) | def tmp_out_parser(file_list, simple_gff, other_gff): FILE: scripts/extract_align.py function get_args (line 19) | def get_args(): function CREATE_TE_OUTFILES (line 47) | def CREATE_TE_OUTFILES(LIBRARY): function EXTRACT_BLAST_HITS (line 56) | def EXTRACT_BLAST_HITS(GENOME, BLAST, LBUFFER, RBUFFER, HITNUM): function MUSCLE (line 88) | def MUSCLE(TOALIGN): function CONSENSUSGEN (line 94) | def CONSENSUSGEN(ALIGNED): function DIRS (line 106) | def DIRS(DIR): function main (line 112) | def main(): FILE: scripts/repeatCraft/helper/combineGFFoverlapm.py function combineGff (line 7) | def combineGff(ltrgff,column,outfile): FILE: scripts/repeatCraft/helper/extraFuseTEm.py function truefusete (line 7) | def truefusete(gffp,gapsize,outfile): FILE: scripts/repeatCraft/helper/extraTrueMergeTEm.py function extratruemergete (line 6) | def extratruemergete(gffp,outfile): FILE: scripts/repeatCraft/helper/filtershortm.py function filtershortTE (line 3) | def filtershortTE(rmgff,m,tesize,mapfile,outfile): FILE: scripts/repeatCraft/helper/fuseltr.py function fuseltr (line 7) | def fuseltr(rmgff,ltrgff_p,ltr_maxlength,ltr_flank,outfile): FILE: scripts/repeatCraft/helper/fusetem.py function fusete (line 7) | def fusete(gffp,outfile,gapsize=150): FILE: scripts/repeatCraft/helper/rcStatm.py function rcstat (line 4) | def rcstat(rclabelp,rmergep,outfile,ltrgroup=True): FILE: scripts/repeatCraft/helper/reformatm.py function reformat (line 4) | def reformat(rmgff,rmout,outfile): FILE: scripts/repeatCraft/helper/repeatcraftHelper.py function reformat (line 11) | def reformat(rmgff,rmout,outfile): function filtershortTE (line 49) | def filtershortTE(rmgff,m,tesize,mapfile,outfile): function fuseltr (line 119) | def fuseltr(rmgff,ltrgff_p,ltr_maxlength,ltr_flank,outfile): function fusete (line 198) | def fusete(gffp,outfile,gapsize=150): function combineGff (line 342) | def combineGff(ltrgff,column,outfile): function truefusete (line 383) | def truefusete(gffp,gapsize,outfile): function trumergeLTR (line 549) | def trumergeLTR(rmgff,outfile): function truemergete (line 661) | def truemergete(rmgff,outfile): function extratruemergete (line 793) | def extratruemergete(gffp,outfile): function rcstat (line 890) | def rcstat(rclabelp,rmergep,outfile,ltrgroup=True): FILE: scripts/repeatCraft/helper/truemergeltrm.py function trumergeLTR (line 4) | def trumergeLTR(rmgff,outfile): FILE: scripts/repeatCraft/helper/truemergeltrmErrorManagement.py function trumergeLTR (line 4) | def trumergeLTR(rmgff,outfile): FILE: scripts/repeatCraft/helper/truemergetem.py function truemergete (line 4) | def truemergete(rmgff,outfile):
Copy disabled (too large)
Download .json
Condensed preview — 89 files, each showing path, character count, and a content snippet. Download the .json file for the full structured content (13,191K chars).
[
{
"path": ".gitpod.yml",
"chars": 1487,
"preview": "image: gitpod/workspace-base\n\ntasks:\n- name: install mamba and earlgrey\n init: |\n curl -L -O \"https://github.com/con"
},
{
"path": "CITATION.cff",
"chars": 729,
"preview": "# This CITATION.cff file was generated with cffinit.\n# Visit https://bit.ly/cffinit to generate yours today!\n\ncff-versio"
},
{
"path": "CODE_OF_CONDUCT.md",
"chars": 5199,
"preview": "# Contributor Covenant Code of Conduct\n\n## Our Pledge\n\nWe as members, contributors, and leaders pledge to make participa"
},
{
"path": "Docker/Dockerfile",
"chars": 7536,
"preview": "# Based on Dockerfile from Dfam TETools https://github.com/Dfam-consortium/TETools, with added database configuration, c"
},
{
"path": "Docker/README.md",
"chars": 1401,
"preview": "# Docker Container \n\nA Docker container has been generated with none of Dfam 3.9, but with script generation to source r"
},
{
"path": "Docker/configInstructions.txt",
"chars": 1462,
"preview": "# Get Earl Grey to run in Docker\n\n# Download dependencies\n./getFiles.sh\n\n# NOTE: please make sure that Dfam.h5.gz and Re"
},
{
"path": "Docker/getFiles.sh",
"chars": 1961,
"preview": "#!/bin/sh\n\nset -eu\n\n# download src name\ndownload() {\n src=\"$1\"\n shift\n\n if [ $# -ge 1 ]; then\n name=\"$1\"\n else\n "
},
{
"path": "LICENSE",
"chars": 9860,
"preview": "Open Software License v. 2.1\n============================\nThis Open Software License (the \"License\") applies to any orig"
},
{
"path": "README.md",
"chars": 40620,
"preview": "\n\n"
},
{
"path": "Singularity/README.md",
"chars": 2752,
"preview": "# Singularity image for Earl Grey\n\nA Singularity image can be built from the Docker image hosted on Docker Hub. The imag"
},
{
"path": "conda/build.sh",
"chars": 3660,
"preview": "#!/bin/bash\n#Based on https://github.com/TobyBaril/EarlGrey/blob/main/configure\nset -x\n\n# Define paths\nPACKAGE_HOME=${PR"
},
{
"path": "conda/developmentNotes.md",
"chars": 33280,
"preview": "# Updating Earl Grey to remove Python 3.9 dependency\n\nBjorn wants me to update Earl Grey to remove the Python 3.9 depend"
},
{
"path": "conda/meta.yaml",
"chars": 1697,
"preview": "{% set name = \"EarlGrey\" %}\n{% set version = \"7.2.4\" %}\n\npackage:\n name: {{ name|lower }}\n version: {{ version }}\n\nsou"
},
{
"path": "earlGrey",
"chars": 33955,
"preview": "#!/bin/bash\nset -eo pipefail\n\nusage()\n{\n\techo \"\t#############################\n\tearlGrey version 7.2.4\n\tRequired Paramete"
},
{
"path": "earlGreyAnnotationOnly",
"chars": 22881,
"preview": "#!/bin/bash\nset -eo pipefail\n\nusage()\n{\n\techo \"\t#############################\n\tearlGrey version 7.2.4 (AnnotationOnly)\n\t"
},
{
"path": "earlGreyLibConstruct",
"chars": 22073,
"preview": "#!/bin/bash\nset -eo pipefail\n\nusage()\n{\n echo \" #############################\n earlGrey version 7.2.4 (Li"
},
{
"path": "legacyDocs.md",
"chars": 23090,
"preview": "# This file contains installation instructions for legacy versions of Earl Grey\n\n## We recommend installing via conda or"
},
{
"path": "scripts/LTR_FINDER_parallel",
"chars": 8415,
"preview": "#!/usr/bin/perl -w\nuse strict;\nuse threads;\nuse Thread::Queue;\nuse File::Basename;\nuse FindBin;\n\n# customized parameters"
},
{
"path": "scripts/TEstrainer/README.md",
"chars": 2836,
"preview": "# TEstrainer\nA pipeline to dramatically improve repeat libraries before genome annotation through curation of each repea"
},
{
"path": "scripts/TEstrainer/TEstrainer",
"chars": 10633,
"preview": "#!/bin/bash\n\nusage() { echo \"Usage: [-l Repeat library] [-g Genome ] [-t Threads (default 4) ] [-f Flank (default 1000) "
},
{
"path": "scripts/TEstrainer/TEstrainer.Rproj",
"chars": 205,
"preview": "Version: 1.0\n\nRestoreWorkspace: Default\nSaveWorkspace: Default\nAlwaysSaveHistory: Default\n\nEnableCodeIndexing: Yes\nUseSp"
},
{
"path": "scripts/TEstrainer/TEstrainer_for_earlGrey.sh",
"chars": 10396,
"preview": "#!/bin/bash\n\nusage() { echo \"Usage: [-l Repeat library] [-g Genome ] [-t Threads (default 4) ] [-f Flank (default 1000) "
},
{
"path": "scripts/TEstrainer/data/acceptable_domains.tsv",
"chars": 11309,
"preview": "ref\tabbrev\npfam17917\tRT_RNaseH\ncd09274\tRNase_HI_RT_Ty3\ncd01647\tRT_LTR\npfam17919\tRT_RNaseH_2\npfam00078\tRVT_1\npfam17921\tIn"
},
{
"path": "scripts/TEstrainer/data/additional_domains.tsv",
"chars": 3452,
"preview": "class\tref\tabbrev\tclade\nPao\tNF033838\tPspC_subgroup_1\tAmphibia\nPao\tPRK14475\tPRK14475\tAmphibia\nPao\tTIGR02499\ttranslocation_"
},
{
"path": "scripts/TEstrainer/data/exceptional_domains.tsv",
"chars": 1054,
"preview": "class\tref\tabbrev\tclade\nERV\tcd00096\tIg\tVertebrata\nERV\tcd05717\tIgV_1_Necl_like\tVertebrata\nERV\tcd05718\tIgV_1_PVR_like\tVerte"
},
{
"path": "scripts/TEstrainer/data/old_acceptable_domains.tsv",
"chars": 8285,
"preview": "ref\tabbrev\ncd09274\tRNase_HI_RT_Ty3\ncd01647\tRT_LTR\npfam17917\tRT_RNaseH\npfam17919\tRT_RNaseH_2\npfam00078\tRVT_1\npfam17921\tIn"
},
{
"path": "scripts/TEstrainer/data/unacceptable_domains.tsv",
"chars": 6112,
"preview": "ref\tabbrev\torganism\ncd20663\tCYP2D\tCeratotherium_simum_cottoni\ncd11026\tCYP2\tCeratotherium_simum_cottoni\npfam00067\tp450\tCe"
},
{
"path": "scripts/TEstrainer/scripts/Dfam_extractor.py",
"chars": 1706,
"preview": "#!/usr/bin/env python\n\nimport os\nfrom re import sub\nfrom os.path import exists\nimport sys\nimport argparse\nfrom Bio impor"
},
{
"path": "scripts/TEstrainer/scripts/TEtrim.py",
"chars": 8552,
"preview": "import os\nimport sys\nimport string\nimport statistics\nimport argparse\nimport re\nimport Bio\nfrom Bio import SeqIO, AlignIO"
},
{
"path": "scripts/TEstrainer/scripts/domain_plotter.R",
"chars": 2637,
"preview": "library(tidyverse)\n\nin_seq <- \"seq/Chiloscyllium_punctatum-families.fa\"\ndata_dir <- \"TS_Chiloscyllium_punctatum-families"
},
{
"path": "scripts/TEstrainer/scripts/final_sorter.R",
"chars": 1123,
"preview": "#!/usr/bin/Rscript\n\nlibrary(optparse)\n\noption_list <- list(\n make_option(c(\"-i\", \"--in_seq\"), default=NA, type = \"chara"
},
{
"path": "scripts/TEstrainer/scripts/indexer.py",
"chars": 263,
"preview": "#!/usr/bin/env python\n\nimport argparse\nfrom pyfaidx import Faidx\n\nparser = argparse.ArgumentParser()\nparser.add_argument"
},
{
"path": "scripts/TEstrainer/scripts/initial_domain_finder.R",
"chars": 6472,
"preview": "#!/usr/bin/Rscript\n\nlibrary(ORFik)\nlibrary(tidyverse)\nlibrary(plyranges)\nlibrary(BSgenome)\n\n# read in sequences\nrepbase_"
},
{
"path": "scripts/TEstrainer/scripts/initial_mafft_setup.py",
"chars": 12466,
"preview": "#!/usr/bin/env python\n\nimport argparse\nimport sys\nimport os\nfrom os.path import exists\n\nparser = argparse.ArgumentParser"
},
{
"path": "scripts/TEstrainer/scripts/mreps_parser.sh",
"chars": 820,
"preview": "#!/bin/bash\n# run mreps and parse to usable format\n\nusage() { echo \"Usage: [-i input sequence (fasta file)] [-h Print th"
},
{
"path": "scripts/TEstrainer/scripts/simple_repeat_filter_trim.R",
"chars": 11481,
"preview": "library(optparse)\n\noption_list <- list(\n make_option(c(\"-i\", \"--in_seq\"), default=NA, type = \"character\", help=\"Input s"
},
{
"path": "scripts/TEstrainer/scripts/splitter.py",
"chars": 1075,
"preview": "#!/usr/bin/env python\n\nimport os\nimport re\nfrom os.path import exists\nimport sys\nimport argparse\nfrom Bio import SeqIO\n\n"
},
{
"path": "scripts/TEstrainer/scripts/strainer.R",
"chars": 8556,
"preview": "#!/usr/bin/Rscript\n\nlibrary(optparse)\n\noption_list <- list(\n make_option(c(\"-i\", \"--in_seq\"), default=NA, type = \"chara"
},
{
"path": "scripts/TEstrainer/scripts/trf_parser.py",
"chars": 1366,
"preview": "#!/usr/bin/env python\n\nfrom argparse import (ArgumentParser, FileType)\n\ndef parse_args():\n \"Parse the input arguments"
},
{
"path": "scripts/autoLand.R",
"chars": 4166,
"preview": "# load libraries\n\nlibrary(tidyverse)\nlibrary(data.table)\nlibrary(magrittr)\n\n# set options\n\noptions(scipen = 100, strings"
},
{
"path": "scripts/autoPie.R",
"chars": 9322,
"preview": "# load libraries\n\nsuppressPackageStartupMessages(library(tidyverse))\nsuppressPackageStartupMessages(library(data.table))"
},
{
"path": "scripts/autoPie.sh",
"chars": 675,
"preview": "#!/bin/bash\n\n\nusage()\n{\n\techo \"autoPie.sh -i mergedRepeat.bed -t finalRepeatmasker.tbl -p plotName.pdf -o summaryTable.t"
},
{
"path": "scripts/backSwap.py",
"chars": 539,
"preview": "import pandas as pd\nimport sys\nimport csv\n\nargs = sys.argv\n\ninput = args[1]\ndictionary = args[2]\noutput = args[3]\n\norig "
},
{
"path": "scripts/backSwapGFF.py",
"chars": 557,
"preview": "import pandas as pd\nimport sys\nimport csv\n\nargs = sys.argv\n\ninput = args[1]\ndictionary = args[2]\noutput = args[3]\n\norig "
},
{
"path": "scripts/bin/LTR_FINDER.x86_64-1.0.7/tRNA.list.txt",
"chars": 12962,
"preview": ">Eukarya\nArabidopsis thaliana\tAthal-tRNAs.fa\nCaenorhabditis briggsae (cb25.agp8 July 2002)\tCbrig-tRNAs.fa\nCaenorhabditis"
},
{
"path": "scripts/bin/cut.pl",
"chars": 1427,
"preview": "#!usr/bin/perl -w\nuse strict;\n#usage: perl cut.pl seq [-s|-S|-l int]\n\nmy $length=5000000; #length of a sequence\nmy $sepa"
},
{
"path": "scripts/divergenceCalc/divergence_calc.py",
"chars": 13358,
"preview": "import os\nfrom os.path import exists, getsize\nimport sys\nimport argparse\nimport numpy as np\nimport pandas as pd\nimport m"
},
{
"path": "scripts/divergenceCalc/divergence_plot.R",
"chars": 13632,
"preview": "\nlibrary(optparse)\n\noption_list <- list(\n make_option(c(\"-s\", \"--species_name\"), default=NA, type = \"character\",\n "
},
{
"path": "scripts/extract_align.py",
"chars": 10235,
"preview": "import argparse\nimport shutil\nimport re\nimport os\nimport subprocess\nfrom Bio import SeqIO\nimport pandas as pd\nimport num"
},
{
"path": "scripts/faswap.py",
"chars": 285,
"preview": "#!/usr/bin/env python3\n\nimport sys\nimport csv\n\nargs = sys.argv\n\ndictionary = args[1]\nfastaIn = args[2]\n\na=dict(csv.reade"
},
{
"path": "scripts/filteringOverlappingRepeats.R",
"chars": 2519,
"preview": "# load libraries\nsuppressPackageStartupMessages(library(GenomicRanges))\nsuppressPackageStartupMessages(library(ape))\nsup"
},
{
"path": "scripts/headSwap.sh",
"chars": 505,
"preview": "#!/bin/bash\n\nwhile getopts i:f:o: option\ndo\n case \"${option}\" \n in\n i) input=${OPTARG};;\n o) out"
},
{
"path": "scripts/install_r_packages.R",
"chars": 197,
"preview": "# Install R Packages\r\n\r\ninstall.packages(c(\"BiocManager\",\"optparse\", \"ape\", \"optparse\"), repos=\"https://www.stats.bris.a"
},
{
"path": "scripts/makeGff.R",
"chars": 1653,
"preview": "suppressPackageStartupMessages(library(tidyverse))\nsuppressPackageStartupMessages(library(data.table))\nsuppressPackageSt"
},
{
"path": "scripts/mergeRepeats.R",
"chars": 14364,
"preview": "suppressPackageStartupMessages(library(tidyverse))\nsuppressPackageStartupMessages(library(plyr))\nsuppressPackageStartupM"
},
{
"path": "scripts/rcMergeRepeats",
"chars": 3845,
"preview": "#!/bin/bash \n\nusage()\n{\n echo \"Usage: rcMergeRepeats -f genome.fasta -s species -d output_directory -u RepeatMasker.o"
},
{
"path": "scripts/rcMergeRepeatsLoose",
"chars": 3985,
"preview": "#!/bin/bash \n\nusage()\n{\n echo \"Usage: rcMergeRepeats -f genome.fasta -s species -d output_directory -u RepeatMasker.o"
},
{
"path": "scripts/repeatCraft/Dockerfile",
"chars": 208,
"preview": "FROM python:3.6-alpine3.6\n\nRUN apk add --no-cache bash\n\nRUN mkdir /repeatcraft\nCOPY ./repeatcraft.py /repeatcraft\nRUN mk"
},
{
"path": "scripts/repeatCraft/LICENSE",
"chars": 1069,
"preview": "MIT License\n\nCopyright (c) 2018 Wai Yee Wong\n\nPermission is hereby granted, free of charge, to any person obtaining a co"
},
{
"path": "scripts/repeatCraft/README.md",
"chars": 7800,
"preview": "[](https://travis-ci.org/niccw/repeatcraftp)\n"
},
{
"path": "scripts/repeatCraft/example/example.rclabel.gff",
"chars": 608380,
"preview": "##gff-version 3\n##repeatcraft\nscaffold_1\tRepeatMasker\tDNA/Academ\t19\t162\t26.6\t-\t.\tTstart=2688;Tend=2832;ID=rnd-5_family-1"
},
{
"path": "scripts/repeatCraft/example/example.rmerge.gff",
"chars": 589654,
"preview": "##gff-version 3\nscaffold_1\tRepeatMasker\tDNA/Academ\t19\t162\t26.6\t-\t.\tTstart=2688;Tend=2832;ID=rnd-5_family-1942;shortTE=F\n"
},
{
"path": "scripts/repeatCraft/example/example.summary.txt",
"chars": 1723,
"preview": "#1. Number of repeats (by class) before and after merge\n=============================================================\nre"
},
{
"path": "scripts/repeatCraft/example/example_input.gff",
"chars": 545339,
"preview": "##gff-version 2\n##date 2018-07-22\n##sequence-region example.fasta\nscaffold_1\tRepeatMasker\tsimilarity\t19\t162\t26.6\t-\t.\tTar"
},
{
"path": "scripts/repeatCraft/example/example_input.out",
"chars": 771846,
"preview": " SW perc perc perc query position in query matching repeat position in r"
},
{
"path": "scripts/repeatCraft/example/example_ltrfinder.gff",
"chars": 8874260,
"preview": "scaffold_1\tltr_finder\tLTR_retrotransposon\t128032\t129833\t6\t-\t.\tltr_finder_1\nscaffold_1\tltr_finder\tfive_prime_LTR\t128032\t1"
},
{
"path": "scripts/repeatCraft/example/mapfile.tsv",
"chars": 111,
"preview": "Unknown\t0\nSINE\t50\nLINE\t200\nLTR\t200\nDNA\t300\nRC\t100\nrRNA\t30\nSimple_repeat\t0\nLow_complexity\t0\nSatellite\t0\nsnRNA\t0\n"
},
{
"path": "scripts/repeatCraft/example/repeatcraft.cfg",
"chars": 204,
"preview": "# Label short TEs\nshortTE_size: 100\nmapfile: None\n\n# LTR grouping (based on LTR_FINER result)\nltr_finder_gff: sample_ltr"
},
{
"path": "scripts/repeatCraft/example/repeatcraft_strict.cfg",
"chars": 188,
"preview": "# Label short TEs\nshortTE_size: 100\nmapfile: None\n\n# LTR grouping (based on LTR_FINER result)\nltr_finder_gff: None\nmax_L"
},
{
"path": "scripts/repeatCraft/helper/combineGFFoverlapm.py",
"chars": 1406,
"preview": "import sys\n\n# the gff need to be sorted by contig and start position\n# `sort -k1,1 -k4,4n -k5,5n .gff` before running th"
},
{
"path": "scripts/repeatCraft/helper/extraFuseTEm.py",
"chars": 7263,
"preview": "import subprocess\nimport sys\n\nfrom collections import defaultdict\nnested_dict = lambda: defaultdict(nested_dict)\n\ndef tr"
},
{
"path": "scripts/repeatCraft/helper/extraTrueMergeTEm.py",
"chars": 3379,
"preview": "import sys\nimport re\nfrom collections import defaultdict\nnested_dict = lambda: defaultdict(nested_dict)\n\ndef extratrueme"
},
{
"path": "scripts/repeatCraft/helper/filtershortm.py",
"chars": 1534,
"preview": "import sys\n\ndef filtershortTE(rmgff,m,tesize,mapfile,outfile):\n\n\tgff = rmgff\n\n\t# print track\n\tstdout = sys.stdout\n\tsys.s"
},
{
"path": "scripts/repeatCraft/helper/fuseltr.py",
"chars": 2285,
"preview": "import re\nimport sys\nimport os\nfrom collections import defaultdict\n\n\ndef fuseltr(rmgff,ltrgff_p,ltr_maxlength,ltr_flank,"
},
{
"path": "scripts/repeatCraft/helper/fusetem.py",
"chars": 4385,
"preview": "import sys\nimport subprocess\nfrom collections import defaultdict\n\nnested_dict = lambda: defaultdict(nested_dict)\n\ndef fu"
},
{
"path": "scripts/repeatCraft/helper/rcStatm.py",
"chars": 2680,
"preview": "import sys\nimport re\n\ndef rcstat(rclabelp,rmergep,outfile,ltrgroup=True):\n\n\trlabel = rclabelp\n\trmerge = rmergep\n\n\t# pri"
},
{
"path": "scripts/repeatCraft/helper/reformatm.py",
"chars": 1124,
"preview": "import sys\nimport re\n\ndef reformat(rmgff,rmout,outfile):\n\n\t# print track\n\tstdout = sys.stdout\n\tsys.stdout = open(outfile"
},
{
"path": "scripts/repeatCraft/helper/repeatcraftHelper.py",
"chars": 28761,
"preview": "#!/usr/bin/env python3\n\nimport sys\nimport subprocess\nimport re\nimport os\n\nfrom collections import defaultdict\nnested_dic"
},
{
"path": "scripts/repeatCraft/helper/truemergeltrm.py",
"chars": 2926,
"preview": "import sys\nimport re\n\ndef trumergeLTR(rmgff,outfile):\n\n\tgff = rmgff\n\n\t# print track\n\tstdout = sys.stdout\n\tsys.stdout = o"
},
{
"path": "scripts/repeatCraft/helper/truemergeltrmErrorManagement.py",
"chars": 2689,
"preview": "import sys\nimport re\n\ndef trumergeLTR(rmgff,outfile):\n\n\tgff = rmgff\n\n\t# print track\n\tstdout = sys.stdout\n\tsys.stdout = o"
},
{
"path": "scripts/repeatCraft/helper/truemergetem.py",
"chars": 3426,
"preview": "import sys\nimport re\n\ndef truemergete(rmgff,outfile):\n\n\tgff = rmgff\n\n\t# print track\n\tstdout = sys.stdout\n\tsys.stdout = o"
},
{
"path": "scripts/repeatCraft/repeatcraft.py",
"chars": 7943,
"preview": "#!/usr/bin/env python3\n\nimport argparse\nimport sys\nimport re\nimport os\nimport subprocess\nfrom subprocess import call\n\nsy"
},
{
"path": "scripts/repeatCraft/repeatcraftErrorManagement.py",
"chars": 7958,
"preview": "#!/usr/bin/env python3\n\nimport argparse\nimport sys\nimport re\nimport os\nimport subprocess\nfrom subprocess import call\n\nsy"
},
{
"path": "scripts/repeatCraft/test/ci.sh",
"chars": 316,
"preview": "#!/bin/bash\n#strict mode\npython repeatcraft.py -r example/example_input.gff -u example/example_input.out -c example/repe"
},
{
"path": "scripts/repeatCraft/test/requirements.txt",
"chars": 9,
"preview": "argparse\n"
},
{
"path": "scripts/repeatcraft.py",
"chars": 7943,
"preview": "#!/usr/bin/env python3\n\nimport argparse\nimport sys\nimport re\nimport os\nimport subprocess\nfrom subprocess import call\n\nsy"
},
{
"path": "scripts/repeatcraftErrorManagement.py",
"chars": 7958,
"preview": "#!/usr/bin/env python3\n\nimport argparse\nimport sys\nimport re\nimport os\nimport subprocess\nfrom subprocess import call\n\nsy"
}
]
// ... and 2 more files (download for full content)
About this extraction
This page contains the full source code of the TobyBaril/EarlGrey GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 89 files (11.4 MB), approximately 3.0M tokens, and a symbol index with 43 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.