Showing preview only (339K chars total). Download the full file or copy to clipboard to get everything.
Repository: SunPengChuan/wgdi
Branch: master
Commit: 00375818da64
Files: 115
Total size: 311.6 KB
Directory structure:
gitextract_p42u6yxa/
├── LICENSE
├── README.md
├── __init__.py
├── build/
│ └── lib/
│ └── wgdi/
│ ├── __init__.py
│ ├── align_dotplot.py
│ ├── ancestral_karyotype.py
│ ├── ancestral_karyotype_repertoire.py
│ ├── base.py
│ ├── block_correspondence.py
│ ├── block_info.py
│ ├── block_ks.py
│ ├── circos.py
│ ├── collinearity.py
│ ├── dotplot.py
│ ├── example/
│ │ ├── __init__.py
│ │ ├── align.conf
│ │ ├── alignmenttrees.conf
│ │ ├── ancestral_karyotype.conf
│ │ ├── ancestral_karyotype_repertoire.conf
│ │ ├── blockinfo.conf
│ │ ├── blockks.conf
│ │ ├── circos.conf
│ │ ├── collinearity.conf
│ │ ├── conf.ini
│ │ ├── corr.conf
│ │ ├── dotplot.conf
│ │ ├── fusion_positions_database.conf
│ │ ├── fusions_detection.conf
│ │ ├── karyotype.conf
│ │ ├── karyotype_mapping.conf
│ │ ├── ks.conf
│ │ ├── ks_fit_result.csv
│ │ ├── ksfigure.conf
│ │ ├── kspeaks.conf
│ │ ├── peaksfit.conf
│ │ ├── pindex.conf
│ │ ├── polyploidy_classification.conf
│ │ ├── retain.conf
│ │ └── shared_fusion.conf
│ ├── fusion_positions_database.py
│ ├── fusions_detection.py
│ ├── karyotype.py
│ ├── karyotype_mapping.py
│ ├── ks.py
│ ├── ks_peaks.py
│ ├── ksfigure.py
│ ├── peaksfit.py
│ ├── pindex.py
│ ├── polyploidy_classification.py
│ ├── retain.py
│ ├── run.py
│ ├── run_colliearity.py
│ ├── shared_fusion.py
│ └── trees.py
├── command.txt
├── dist/
│ └── wgdi-0.75-py3-none-any.whl
├── setup.py
├── wgdi/
│ ├── __init__.py
│ ├── align_dotplot.py
│ ├── ancestral_karyotype.py
│ ├── ancestral_karyotype_repertoire.py
│ ├── base.py
│ ├── block_correspondence.py
│ ├── block_info.py
│ ├── block_ks.py
│ ├── circos.py
│ ├── collinearity.py
│ ├── dotplot.py
│ ├── example/
│ │ ├── __init__.py
│ │ ├── align.conf
│ │ ├── alignmenttrees.conf
│ │ ├── ancestral_karyotype.conf
│ │ ├── ancestral_karyotype_repertoire.conf
│ │ ├── blockinfo.conf
│ │ ├── blockks.conf
│ │ ├── circos.conf
│ │ ├── collinearity.conf
│ │ ├── conf.ini
│ │ ├── corr.conf
│ │ ├── dotplot.conf
│ │ ├── fusion_positions_database.conf
│ │ ├── fusions_detection.conf
│ │ ├── karyotype.conf
│ │ ├── karyotype_mapping.conf
│ │ ├── ks.conf
│ │ ├── ks_fit_result.csv
│ │ ├── ksfigure.conf
│ │ ├── kspeaks.conf
│ │ ├── peaksfit.conf
│ │ ├── pindex.conf
│ │ ├── polyploidy_classification.conf
│ │ ├── retain.conf
│ │ └── shared_fusion.conf
│ ├── fusion_positions_database.py
│ ├── fusions_detection.py
│ ├── karyotype.py
│ ├── karyotype_mapping.py
│ ├── ks.py
│ ├── ks_peaks.py
│ ├── ksfigure.py
│ ├── peaksfit.py
│ ├── pindex.py
│ ├── polyploidy_classification.py
│ ├── retain.py
│ ├── run.py
│ ├── run_colliearity.py
│ ├── shared_fusion.py
│ └── trees.py
└── wgdi.egg-info/
├── PKG-INFO
├── SOURCES.txt
├── dependency_links.txt
├── entry_points.txt
├── requires.txt
├── top_level.txt
└── zip-safe
================================================
FILE CONTENTS
================================================
================================================
FILE: LICENSE
================================================
Copyright (c) 2018-2018, Pengchuan Sun
All rights reserved.
Redistribution and use in source and binary forms, with or without modification,
are permitted provided that the following conditions are met:
Redistributions of source code must retain the above copyright notice, this list
of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this
list of conditions and the following disclaimer in the documentation and/or
other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
================================================
FILE: README.md
================================================
# WGDI
 [](https://pepy.tech/project/wgdi) [](http://bioconda.github.io/recipes/wgdi/README.html)
| | |
| --- | --- |
| Author | Pengchuan Sun ([sunpengchuan](https//github.com/sunpengchuan)) |
| Email | <sunpengchuan@gmail.com> |
| License | [BSD](http://creativecommons.org/licenses/BSD/) |
## Description
**WGDI (Whole-Genome Duplication Integrated analysis)** is a Python-based command-line tool designed to simplify the analysis of whole-genome duplications (WGD) and cross-species genome alignments. It offers three main workflows that enhance the detection and study of WGD events:
## Key Features
### 1. Polyploid Inference
- Identifies and confirms polyploid events with high accuracy.
### 2. Genomic Homology Inference
- Traces the evolutionary history of duplicated regions across species, with a focus on distinguishing subgenomes.
### 3. Ancestral Karyotyping
- Reconstructs protochromosomes and traces common chromosomal rearrangements to understand chromosome evolution.
## Installation
Python package and command line interface (IDLE) for the analysis of whole genome duplications (WGDI). WGDI can be deployed in Windows, Linux, and Mac OS operating systems and can be installed via pip and conda.
#### Bioconda
```
conda install -c bioconda wgdi
```
#### Pypi
```
pip3 install wgdi
```
Documentation for installation along with a user tutorial, a default parameter file, and test data are provided. please consult the docs at <http://wgdi.readthedocs.io/en/latest/>.
## Tips
Here are some videos with simple examples of WGDI.
###### [WGDI的简单使用(一)](https://www.bilibili.com/video/BV1qK4y1U7eK) or https://youtu.be/k-S6FVcBIQw
###### [WGDI的简单使用(二)](https://www.bilibili.com/video/BV195411P7L1) or https://youtu.be/QiZYFYGclyE
chatting group QQ : 966612552
## Citating WGDI
If you use wgdi in your work, please cite:
> Sun P., Jiao B., Yang Y., Shan L., Li T., Li X., Xi Z., Wang X., and Liu J. (2022). WGDI: A user-friendly toolkit for evolutionary analyses of whole-genome duplications and ancestral karyotypes. Mol. Plant. doi: https://doi.org/10.1016/j.molp.2022.10.018.
## News
## 0.75
* Fixed some issues (-fpd, -km).
* Introduced a threads parameter for the iqtree command within alignmenttrees (-at).
## 0.74
* Improved the the fusion positions dataset (-fpd).
* Fixed some issues (-pc).
## 0.7.1
* Added extract the fusion positions dataset (-fpd).
* Added determine whether these fusion events occur in other genomes (-fd).
* Improved the karyotype_mapping (-km) effect.
* Fixed the problem caused by the Python version, now it is compatible with version 3.12.
## 0.6.5
* Fixed some issues (-sf).
* Added new tips to avoid some errors.
## 0.6.4
* Fixed the problem caused by the Python version, now it is compatible with version 3.11.3.
## 0.6.3
* Fixed some issues (-ks, -sf).
## 0.6.2
* Added find shared fusions between species (-sf).
## 0.6.1
* Fixed issue with alignment (-a). Only version 0.6.0 has this bug.
## 0.6.0
* Fixed issue with improved collinearity (-icl).
* Added a parameter 'tandem_ratio' to blockinfo (-bi).
## 0.5.9
* Update the improved collinearity (-icl). Faster than before, but lower than MCscanX, JCVI.
* Fixed issue with ancestral karyotype repertoire (-akr).
## 0.5.8
* Fixed issue with gene names (-ks).
## 0.5.7
- Fixed issue with chromosome order (-ak).
- Fixed issue with gene names (-ks). This version is not fixed, please install the latest version.
## 0.5.5 and 0.5.6
* Add ancestral karyotype (-ak)
* Add ancestral karyotype repertoire (-akr)
## 0.5.4
* Improved the karyotype_mapping (-km) effect.
* little change (-at).
## 0.5.3
* Fixed legend issue with (-kf).
* Fixed calculate Ks issue with (-ks).
* Improved the karyotype_mapping (-km) effect.
* Improved the alignmenttrees (-at) effect.
## 0.5.2
* Fixed some bugs.
## 0.5.1
* Fixed the error of the command (-conf).
* Improved the karyotype_mapping (-km) effect.
* Added the available data set of alignmenttree (-at). Low copy data set (for example, single-copy_groups.tsv of sonicparanoid2 software).
## 0.4.9
* The latest version adds karyotype_mapping (-km) and karyotype (-k) display.
* The latest version changes the calculation of extracting pvalue from collinearity (-icl), making this parameter more sensitive. Therefore, it is recommended to set to 0.2 instead of 0.05.
* The latest version has also changed the drawing display of ksfigure (-kf) to make it more beautiful.
================================================
FILE: __init__.py
================================================
================================================
FILE: build/lib/wgdi/__init__.py
================================================
================================================
FILE: build/lib/wgdi/align_dotplot.py
================================================
import re
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import wgdi.base as base
class align_dotplot:
def __init__(self, options):
# Default values
self.position = 'order'
self.figsize = 'default'
self.classid = 'class1'
# Initialize from options
for k, v in options:
setattr(self, str(k), v)
print(f'{k} = {v}')
self.ks_area = [float(k) for k in getattr(self, 'ks_area', '-1,3').split(',')]
self.colors = [str(k) for k in getattr(self, 'colors', 'red,blue,green,black,orange').split(',')]
self.ancestor_top = None if getattr(self, 'ancestor_top', 'none') == 'none' else self.ancestor_top
self.ancestor_left = None if getattr(self, 'ancestor_left', 'none') == 'none' else self.ancestor_left
self.blockinfo_reverse = base.str_to_bool(self.blockinfo_reverse)
def pair_position(self, alignment, loc1, loc2, colors):
alignment.index = alignment.index.map(loc1)
data = []
for i, k in enumerate(alignment.columns):
df = alignment[k].map(loc2).dropna()
for idx, row in df.items():
data.append([idx, row, colors[i]])
return pd.DataFrame(data, columns=['loc1', 'loc2', 'color'])
def run(self):
axis = [0, 1, 1, 0]
# Lens generation and figure size
lens1 = base.newlens(self.lens1, self.position)
lens2 = base.newlens(self.lens2, self.position)
if re.search(r'\d', self.figsize):
self.figsize = [float(k) for k in self.figsize.split(',')]
else:
self.figsize = np.array([1, float(lens1.sum()) / float(lens2.sum())]) * 10
plt.rcParams['ytick.major.pad'] = 0
# Create plot
fig, ax = plt.subplots(figsize=self.figsize)
ax.xaxis.set_ticks_position('top')
step1, step2 = 1 / float(lens1.sum()), 1 / float(lens2.sum())
# Process Ancestor Data
if self.ancestor_left:
axis[0] = -0.02
lens_ancestor_left = self.process_ancestor(self.ancestor_left, lens1.index)
if self.ancestor_top:
axis[3] = -0.02
lens_ancestor_top = self.process_ancestor(self.ancestor_top, lens2.index)
base.dotplot_frame(fig, ax, lens1, lens2, step1, step2,
self.genome1_name, self.genome2_name, [0, 1])
# Process GFF files
gff1, gff2 = base.newgff(self.gff1), base.newgff(self.gff2)
gff1 = base.gene_location(gff1, lens1, step1, self.position)
gff2 = base.gene_location(gff2, lens2, step2, self.position)
if self.ancestor_top:
self.ancestor_position(ax, gff2, lens_ancestor_top, 'top')
if self.ancestor_left:
self.ancestor_position(ax, gff1, lens_ancestor_left, 'left')
# Process block info and alignment
bkinfo = self.process_blockinfo(lens1,lens2)
align = self.alignment(gff1, gff2, bkinfo)
alignment = align[gff1.columns[-len(bkinfo[self.classid].drop_duplicates()):]]
alignment.to_csv(self.savefile, header=False)
# Create scatter plot
df = self.pair_position(alignment, gff1['loc'], gff2['loc'], self.colors)
plt.scatter(df['loc2'], df['loc1'], s=float(self.markersize), c=df['color'],
alpha=0.5, edgecolors=None, linewidths=0, marker='o')
ax.axis(axis)
plt.subplots_adjust(left=0.07, right=0.97, top=0.93, bottom=0.03)
plt.savefig(self.savefig, dpi=500)
plt.show()
def process_ancestor(self, ancestor_file, lens_index):
df = pd.read_csv(ancestor_file, sep="\t", header=None)
df[0] = df[0].astype(str)
df[3] = df[3].astype(str)
df[4] = df[4].astype(int)
df[4] = df[4] / df[4].max()
return df[df[0].isin(lens_index)]
def process_blockinfo(self, lens1, lens2):
bkinfo = pd.read_csv(self.blockinfo, index_col='id')
if self.blockinfo_reverse == True:
bkinfo[['chr1', 'chr2']] = bkinfo[['chr2', 'chr1']]
bkinfo[['block1', 'block2']] = bkinfo[['block2', 'block1']]
bkinfo['chr1'] = bkinfo['chr1'].astype(str)
bkinfo['chr2'] = bkinfo['chr2'].astype(str)
bkinfo[self.classid] = bkinfo[self.classid].astype(str)
return bkinfo[bkinfo['chr1'].isin(lens1.index) & (bkinfo['chr2'].isin(lens2.index))]
def alignment(self, gff1, gff2, bkinfo):
gff1['uid'] = gff1['chr'] + 'g' + gff1['order'].astype(str)
gff2['uid'] = gff2['chr'] + 'g' + gff2['order'].astype(str)
gff1['id'] = gff1.index
gff2['id'] = gff2.index
for cl, group in bkinfo.groupby(self.classid):
name = f'l{cl}'
gff1[name] = ''
group = group.sort_values(by=['length'], ascending=True)
for _, row in group.iterrows():
block = self.create_block_dataframe(row)
if block.empty:
continue
block1_min, block1_max = block['block1'].agg(['min', 'max'])
area = gff1[(gff1['chr'] == row['chr1']) &
(gff1['order'] >= block1_min) &
(gff1['order'] <= block1_max)].index
block['id1'] = (row['chr1'] + 'g' + block['block1'].astype(str)).map(
dict(zip(gff1['uid'], gff1.index)))
block['id2'] = (row['chr2'] + 'g' + block['block2'].astype(str)).map(
dict(zip(gff2['uid'], gff2.index)))
gff1.loc[block['id1'].values, name] = block['id2'].values
gff1.loc[gff1.index.isin(area) & gff1[name].eq(''), name] = '.'
return gff1
def create_block_dataframe(self, row):
b1, b2, ks = row['block1'].split('_'), row['block2'].split('_'), row['ks'].split('_')
ks = list(map(float, ks[1:])) if ks[0] == '' else list(map(float, ks))
block = pd.DataFrame(np.array([b1, b2, ks]).T, columns=['block1', 'block2', 'ks'])
block['block1'] = block['block1'].astype(int)
block['block2'] = block['block2'].astype(int)
block['ks'] = block['ks'].astype(float)
return block[(block['ks'] <= self.ks_area[1]) &
(block['ks'] >= self.ks_area[0])].drop_duplicates(subset=['block1'], keep='first')
def ancestor_position(self, ax, gff, lens, mark):
for _, row in lens.iterrows():
loc1 = gff[(gff['chr'] == row[0]) & (gff['order'] == int(row[1]))].index
loc2 = gff[(gff['chr'] == row[0]) & (gff['order'] == int(row[2]))].index
loc1, loc2 = gff.loc[[loc1[0], loc2[0]], 'loc']
if mark == 'top':
width = abs(loc1-loc2)
loc = [min(loc1, loc2), 0]
height = -0.02
if mark == 'left':
height = abs(loc1-loc2)
loc = [-0.02, min(loc1, loc2), ]
width = 0.02
base.Rectangle(ax, loc, height, width, row[3], row[4])
================================================
FILE: build/lib/wgdi/ancestral_karyotype.py
================================================
import pandas as pd
from Bio import SeqIO
import wgdi.base as base
class ancestral_karyotype:
def __init__(self, options):
self.mark = 'aak'
# Set attributes from options
for k, v in options:
setattr(self, str(k), v)
print(f"{k} = {v}")
def run(self):
# Load and filter data
gff = base.newgff(self.gff)
ancestor = base.read_classification(self.ancestor)
gff = gff[gff['chr'].isin(ancestor[0].values.tolist())]
# Create new gff copy and initialize required variables
newgff = gff.copy()
data, num = [], 1
# Create dictionary mapping chromosome to order
chr_arr = ancestor[3].drop_duplicates().to_list()
chr_dict = {chr: idx + 1 for idx, chr in enumerate(chr_arr)}
ancestor['order'] = ancestor[3].map(chr_dict)
dict1, dict2 = {}, {}
# Process ancestor and gff information
for (cla, order), group in ancestor.groupby([4, 'order'], sort=[False, False]):
for index, row in group.iterrows():
index1 = gff[(gff['chr'] == row[0]) & (gff['order'] >= row[1]) & (gff['order'] <= row[2])].index
newgff.loc[index1, 'chr'] = str(num)
# Store results in data
for k in index1:
data.append(newgff.loc[k, :].values.tolist() + [k])
dict1[str(num)] = cla
dict2[str(num)] = group[3].values[0]
num += 1
# Create dataframe from the data collected
df = pd.DataFrame(data)
# Filter based on peptide file
pep = SeqIO.to_dict(SeqIO.parse(self.pep_file, "fasta"))
df = df[df[6].isin(pep.keys())]
# Assign new names and order
for name, group in df.groupby(0):
df.loc[group.index, 'order'] = range(1, len(group) + 1)
df.loc[group.index, 'newname'] = [f"{self.mark}{name}g{i:05d}" for i in range(1, len(group) + 1)]
# Set data types and sort
df['order'] = df['order'].astype(int)
df = df[[0, 'newname', 1, 2, 3, 'order', 6]].sort_values(by=[0, 'order'])
# Save output files
df.to_csv(self.ancestor_gff, sep="\t", index=False, header=None)
lens = df.groupby(0).max()[[2, 'order']]
lens.to_csv(self.ancestor_lens, sep="\t", header=None)
# Add extra columns and save final results
lens[1] = 1
lens['color'] = lens.index.map(dict2)
lens['class'] = lens.index.map(dict1)
lens[[1, 'order', 'color', 'class']].to_csv(self.ancestor_file, sep="\t", header=None)
# Update peptide sequences with new IDs and save
id_dict = df.set_index(6).to_dict()['newname']
seqs = []
for seq_record in SeqIO.parse(self.pep_file, "fasta"):
if seq_record.id in id_dict:
seq_record.id = id_dict[seq_record.id]
seqs.append(seq_record)
SeqIO.write(seqs, self.ancestor_pep, "fasta")
================================================
FILE: build/lib/wgdi/ancestral_karyotype_repertoire.py
================================================
import numpy as np
import pandas as pd
from Bio import SeqIO
import wgdi.base as base
class ancestral_karyotype_repertoire():
def __init__(self, options):
self.gap = 5
self.direction = 0.01
self.mark = 'aak1s'
self.blockinfo_reverse = False
for k, v in options:
setattr(self, str(k), v)
print(k, ' = ', v)
self.blockinfo_reverse = base.str_to_bool(self.blockinfo_reverse)
def run(self):
gff1 = base.newgff(self.gff1)
gff2 = base.newgff(self.gff2)
bkinfo = pd.read_csv(self.blockinfo, index_col='id')
if self.blockinfo_reverse == True:
bkinfo[['chr1', 'chr2']] = bkinfo[['chr2', 'chr1']]
bkinfo[['block1', 'block2']] = bkinfo[['block2', 'block1']]
for index, row in bkinfo.iterrows():
block1, block2 = row['block1'].split('_'), row['block2'].split('_')
block1, block2 = [int(k) for k in block1], [int(k) for k in block2]
if int(block1[1])-int(block1[0]) < 0:
self.direction = -0.01
for i in range(1, len(block2)):
if abs(block1[i]-block1[i-1]) == 1 and abs(block2[i]-block2[i-1]) < int(self.gap):
gff1_id = gff1[(gff1['chr'] == str(row['chr1'])) & (
gff1['order'] == block1[i])].index[0]
order = gff1.loc[gff1_id, 'order']
gff1_row = gff1.loc[gff1_id, :].copy()
for num in range(block2[i-1], block2[i]):
order = order + self.direction
id = gff2[(gff2['chr'] == str(row['chr2']))
& (gff2['order'] == num)].index[0]
gff1_row['order'] = order
gff1.loc[id, :] = gff1_row
df = gff1.copy()
df = df.sort_values(by=['chr', 'order'])
for name, group in df.groupby(['chr']):
df.loc[group.index, 'order'] = list(range(1, len(group)+1))
df.loc[group.index, 'newname'] = list(
[str(self.mark)+str(name)+'g'+str(i).zfill(5) for i in range(1, len(group)+1)])
df['order'] = df['order'].astype(int)
df['oldname'] = df.index
columns = ['chr', 'newname', 'start',
'end', 'strand', 'order', 'oldname']
df[columns].to_csv(self.ancestor_gff, sep="\t",
index=False, header=None)
lens = df.groupby('chr').max()[['end', 'order']]
lens['end'] = lens['end'].astype(np.int64)
lens.to_csv(self.ancestor_lens, sep="\t", header=None)
ancestor = base.read_classification(self.ancestor)
for index, row in ancestor.iterrows():
ancestor.at[index, 1] = 1
ancestor.at[index, 2] = lens.at[str(row[0]),'order']
ancestor.to_csv(self.ancestor_new, sep="\t", index=False, header=None)
id_dict = df['newname'].to_dict()
seqs = []
for seq_record in SeqIO.parse(self.ancestor_pep, "fasta"):
if seq_record.id in id_dict:
seq_record.id = id_dict[seq_record.id]
else:
continue
seq_record.description = ''
seqs.append(seq_record)
SeqIO.write(seqs, self.ancestor_pep_new, "fasta")
================================================
FILE: build/lib/wgdi/base.py
================================================
import configparser
import hashlib
import os
import re
import matplotlib
import matplotlib.patches as mpatches
import numpy as np
import pandas as pd
from Bio import SeqIO
import wgdi
def gen_md5_id(item):
"""Generate MD5 hash for the given item."""
return hashlib.md5(item.encode('utf-8')).hexdigest()
def config():
"""Read configuration from the example conf.ini file."""
conf = configparser.ConfigParser()
conf.read(os.path.join(wgdi.__path__[0], 'example/conf.ini'))
return conf.items('ini')
def load_conf(file, section):
"""Load configuration items from the specified section."""
conf = configparser.ConfigParser()
conf.read(file)
return conf.items(section)
def rewrite(file, section):
"""Rewrite the configuration file to keep only the specified section."""
conf = configparser.ConfigParser()
conf.read(file)
if conf.has_section(section):
for k in conf.sections():
if k != section:
conf.remove_section(k)
conf.write(open(os.path.join(wgdi.__path__[0], 'example/conf.ini'), 'w'))
print('Option ini has been modified')
else:
print('Option ini no change')
def read_colinearscan(file):
"""Read colinearscan output and parse into data structure."""
data, b, flag, num = [], [], 0, 1
with open(file) as f:
for line in f:
line = line.strip()
if re.match(r"the", line):
num = re.search(r'\d+', line).group()
b = []
flag = 1
continue
if re.match(r"\>LOCALE", line):
flag = 0
p = re.split(':', line)
if b:
data.append([num, b, p[1]])
b = []
continue
if flag == 1:
a = re.split(r"\s", line)
b.append(a)
if b:
data.append([num, b, p[1]])
return data
def read_mcscanx(fn):
"""Read mcscanx output and parse into data structure."""
with open(fn) as f1:
data, b = [], []
flag, num = 0, 0
for line in f1:
line = line.strip()
if re.match(r"## Alignment", line):
flag = 1
if not b:
arr = re.findall(r"[\d+\.]+", line)[0]
continue
data.append([num, b, 0])
b = []
num = re.findall(r"\d+", line)[0]
continue
if flag == 0:
continue
a = re.split(r"\:", line)
c = re.split(r"\s+", a[1])
b.append([c[1], c[1], c[2], c[2]])
if b:
data.append([num, b, 0])
return data
def read_jcvi(fn):
"""Read jcvi output and parse into data structure."""
with open(fn) as f1:
data, b = [], []
num = 1
for line in f1:
line = line.strip()
if re.match(r"###", line):
if b:
data.append([num, b, 0])
b = []
num += 1
continue
a = re.split(r"\t", line)
b.append([a[0], a[0], a[1], a[1]])
if b:
data.append([num, b, 0])
return data
def read_collinearity(fn):
"""Read collinearity output and parse into data structure."""
with open(fn) as f1:
data, b = [], []
flag, arr = 0, []
for line in f1:
line = line.strip()
if re.match(r"# Alignment", line):
flag = 1
if not b:
arr = re.findall(r'[\.\d+]+', line)
continue
data.append([arr[0], b, arr[2]])
b = []
arr = re.findall(r'[\.\d+]+', line)
continue
if flag == 0:
continue
b.append(re.split(r"\s", line))
if b:
data.append([arr[0], b, arr[2]])
return data
def read_ks(file, col):
"""Read KS values from file and select specified column."""
ks = pd.read_csv(file, sep='\t')
ks.drop_duplicates(subset=['id1', 'id2'], keep='first', inplace=True)
ks[col] = ks[col].astype(float)
ks = ks[ks[col] >= 0]
ks.index = ks['id1'] + ',' + ks['id2']
return ks[col]
def get_median(data):
"""Calculate the median of the data list."""
if not data:
return 0
data_sorted = sorted(data)
half = len(data_sorted) // 2
return (data_sorted[half] + data_sorted[-(half + 1)]) / 2
def cds_to_pep(cds_file, pep_file, fmt='fasta'):
"""Translate CDS sequences to peptide sequences and write to file."""
records = list(SeqIO.parse(cds_file, fmt))
for rec in records:
rec.seq = rec.seq.translate()
SeqIO.write(records, pep_file, 'fasta')
return True
def newblast(file, score, evalue, gene_loc1, gene_loc2, reverse):
"""Filter BLAST results based on score, evalue, and gene locations."""
blast = pd.read_csv(file, sep="\t", header=None)
if reverse == 'true':
blast[[0, 1]] = blast[[1, 0]]
blast = blast[(blast[11] >= score) & (blast[10] < evalue) & (blast[1] != blast[0])]
blast = blast[(blast[0].isin(gene_loc1.index)) & (blast[1].isin(gene_loc2.index))]
blast.drop_duplicates(subset=[0, 1], keep='first', inplace=True)
blast[0] = blast[0].astype(str)
blast[1] = blast[1].astype(str)
return blast
def newgff(file):
"""Read GFF file and rename columns with appropriate data types."""
gff = pd.read_csv(file, sep="\t", header=None, index_col=1)
gff.rename(columns={0: 'chr', 2: 'start', 3: 'end', 4: 'strand', 5: 'order'}, inplace=True)
gff['chr'] = gff['chr'].astype(str)
gff['start'] = gff['start'].astype(np.int64)
gff['end'] = gff['end'].astype(np.int64)
gff['strand'] = gff['strand'].astype(str)
gff['order'] = gff['order'].astype(int)
return gff
def newlens(file, position):
"""Read lens file and select position based on 'order' or 'end'."""
lens = pd.read_csv(file, sep="\t", header=None, index_col=0)
lens.index = lens.index.astype(str)
if position == 'order':
lens = lens[2]
elif position == 'end':
lens = lens[1]
return lens
def read_classification(file):
"""Read classification data and convert columns to appropriate types."""
classification = pd.read_csv(file, sep="\t", header=None)
classification[0] = classification[0].astype(str)
classification[1] = classification[1].astype(int)
classification[2] = classification[2].astype(int)
classification[3] = classification[3].astype(str)
classification[4] = classification[4].astype(int)
return classification
def gene_location(gff, lens, step, position):
"""Calculate gene locations based on lens and step."""
gff = gff[gff['chr'].isin(lens.index)].copy()
if gff.empty:
print('Stoped! \n\nChromosomes in gff file and lens file do not correspond.')
exit(0)
dict_chr = dict(zip(lens.index, np.append(np.array([0]), lens.cumsum()[:-1].values)))
gff['loc'] = ''
for name, group in gff.groupby('chr'):
gff.loc[group.index, 'loc'] = (dict_chr[name] + group[position]) * step
return gff
def dotplot_frame(fig, ax, lens1, lens2, step1, step2, genome1_name, genome2_name, arr, pad = 0):
"""Set up the dotplot frame with grid lines and labels."""
for k in lens1.cumsum()[:-1] * step1:
ax.axhline(y=k, alpha=0.8, color='black', lw=0.5)
for k in lens2.cumsum()[:-1] * step2:
ax.axvline(x=k, alpha=0.8, color='black', lw=0.5)
align = dict(family='DejaVu Sans', style='italic', horizontalalignment="center", verticalalignment="center")
yticks = lens1.cumsum() * step1 - 0.5 * lens1 * step1
ax.set_yticks(yticks)
ax.set_yticklabels(lens1.index, fontsize = 13, family='DejaVu Sans', style='normal')
ax.tick_params(axis='y', which='major', pad = pad)
ax.tick_params(axis='x', which='major', pad = pad)
xticks = lens2.cumsum() * step2 - 0.5 * lens2 * step2
ax.set_xticks(xticks)
ax.set_xticklabels(lens2.index, fontsize = 13, family='DejaVu Sans', style='normal')
ax.xaxis.set_ticks_position('none')
ax.yaxis.set_ticks_position('none')
if arr[0] <= 0:
ax.text(-0.06, 0.5, genome1_name, weight='semibold', fontsize=16, rotation=90, **align)
else:
ax.text(-0.06, 0.5, genome1_name, weight='semibold', fontsize=16, rotation=90, **align)
if arr[1] < 0:
ax.text(0.5, -0.06, genome2_name, weight='semibold', fontsize=16, **align)
else:
ax.text(0.5, -0.06, genome2_name, weight='semibold', fontsize=16, **align)
def Bezier3(plist, t):
"""Calculate Bezier curve of degree 3."""
p0, p1, p2 = plist
return p0 * (1 - t) ** 2 + 2 * p1 * t * (1 - t) + p2 * t ** 2
def Bezier4(plist, t):
"""Calculate Bezier curve of degree 4."""
p0, p1, p2, p3, p4 = plist
return p0 * (1 - t) ** 4 + 4 * p1 * t * (1 - t) ** 3 + 6 * p2 * t ** 2 * (1 - t) ** 2 + 4 * p3 * (1 - t) * t ** 3 + p4 * t ** 4
def Rectangle(ax, loc, height, width, color, alpha):
"""Draw a rectangle on the axes with specified properties."""
p = mpatches.Rectangle(loc, width, height, edgecolor=None, facecolor=color, alpha=alpha)
ax.add_patch(p)
def str_to_bool(s):
if isinstance(s, bool):
return s
return str(s).strip().lower() == 'true'
================================================
FILE: build/lib/wgdi/block_correspondence.py
================================================
import re
import numpy as np
import pandas as pd
import wgdi.base as base
class block_correspondence():
def __init__(self, options):
# Default values
self.tandem = True
self.pvalue = 0.2
self.position = 'order'
self.block_length = 5
self.tandem_length = 200
self.tandem_ratio = 1
self.ks_hit = 0.5
# Set user-defined options
for k, v in options:
setattr(self, str(k), v)
print(k, ' = ', v)
# Parse ks_area and homo if present
self.ks_area = [float(k) for k in getattr(self, 'ks_area', '-1,3').split(',')]
self.homo = [float(k) for k in self.homo.split(',')]
self.tandem_ratio = float(self.tandem_ratio)
self.tandem = base.str_to_bool(self.tandem)
def run(self):
lens1 = base.newlens(self.lens1, self.position)
lens2 = base.newlens(self.lens2, self.position)
# Load block information from CSV
bkinfo = pd.read_csv(self.blockinfo)
bkinfo = self.preprocess_blockinfo(bkinfo, lens1, lens2)
# Initialize correspondence DataFrame
cor = self.initialize_correspondence(lens1, lens2)
# If no tandem allowed, remove tandem regions
if not self.tandem:
bkinfo = self.remove_tandem(bkinfo)
# Remove low KS hits
bkinfo = self.remove_ks_hit(bkinfo)
# Find collinearity regions and save results
collinear_indices = self.collinearity_region(cor, bkinfo, lens1)
bkinfo.loc[bkinfo.index.isin(collinear_indices), :].to_csv(self.savefile, index=False)
def preprocess_blockinfo(self, bkinfo, lens1, lens2):
bkinfo['chr1'] = bkinfo['chr1'].astype(str)
bkinfo['chr2'] = bkinfo['chr2'].astype(str)
# Filter by length, chromosome indices, and p-value
bkinfo = bkinfo[(bkinfo['length'] >= int(self.block_length)) &
(bkinfo['chr1'].isin(lens1.index)) &
(bkinfo['chr2'].isin(lens2.index)) &
(bkinfo['pvalue'] <= float(self.pvalue))]
# Filter by tandem ratio if the column exists
if 'tandem_ratio' in bkinfo.columns:
bkinfo = bkinfo[bkinfo['tandem_ratio'] <= self.tandem_ratio]
return bkinfo
def initialize_correspondence(self, lens1, lens2):
# Create correspondence DataFrame with initial values
cor = [[k, i, 0, lens1[i], j, 0, lens2[j], float(self.homo[0]), float(self.homo[1])]
for k in range(1, int(self.multiple) + 1)
for i in lens1.index
for j in lens2.index]
cor = pd.DataFrame(cor, columns=['sub', 'chr1', 'start1', 'end1', 'chr2', 'start2', 'end2', 'homo1', 'homo2'])
cor['chr1'] = cor['chr1'].astype(str)
cor['chr2'] = cor['chr2'].astype(str)
return cor
def remove_tandem(self, bkinfo):
# Remove tandem regions from the DataFrame
group = bkinfo[bkinfo['chr1'] == bkinfo['chr2']].copy()
group['start'] = group['start1'] - group['start2']
group['end'] = group['end1'] - group['end2']
tandem_condition = (group['start'].abs() <= int(self.tandem_length)) | (group['end'].abs() <= int(self.tandem_length))
index_to_remove = group[tandem_condition].index
return bkinfo.drop(index_to_remove)
def remove_ks_hit(self, bkinfo):
# Remove records with insufficient KS hits
for index, row in bkinfo.iterrows():
ks = self.get_ks_value(row['ks'])
ks_ratio = len([k for k in ks if self.ks_area[0] <= k <= self.ks_area[1]]) / len(ks)
if ks_ratio < self.ks_hit:
bkinfo.drop(index, inplace=True)
return bkinfo
def get_ks_value(self, ks_str):
# Extract and return KS values as floats
ks = ks_str.split('_')
ks = list(map(float, ks[1:])) if ks[0] == '' else list(map(float, ks))
return ks
def collinearity_region(self, cor, bkinfo, lens):
collinear_indices = []
for (chr1, chr2), group in bkinfo.groupby(['chr1', 'chr2']):
group = group.sort_values(by=['length'], ascending=False)
df = pd.Series(0, index=range(1, int(lens[str(chr1)]) + 1))
for index, row in group.iterrows():
# Check homology conditions
if not self.is_valid_homo(row):
continue
# Update the block series and compute ratio
b1 = [int(k) for k in row['block1'].split('_')]
df1 = df.copy()
df1[b1] += 1
ratio = (len(df1[df1 > 0]) - len(df[df > 0])) / len(b1)
if ratio < 0.5:
continue
df[b1] += 1
collinear_indices.append(index)
return collinear_indices
def is_valid_homo(self, row):
# Check if the homology values are within the specified range
return self.homo[0] <= row['homo' + self.multiple] <= self.homo[1]
================================================
FILE: build/lib/wgdi/block_info.py
================================================
import numpy as np
import pandas as pd
import wgdi.base as base
class block_info:
def __init__(self, options):
self.repeat_number = 20
self.ks_col = 'ks_NG86'
self.blast_reverse = False
for k, v in options:
setattr(self, str(k), v)
print(f"{k} = {v}")
self.repeat_number = int(self.repeat_number)
self.blast_reverse = base.str_to_bool(self.blast_reverse)
def block_position(self, collinearity, blast, gff1, gff2, ks):
data = []
for block in collinearity:
blk_homo, blk_ks = [], []
# Skip blocks with missing gene coordinates in GFF files
if block[1][0][0] not in gff1.index or block[1][0][2] not in gff2.index:
continue
# Extract chromosome info
chr1, chr2 = gff1.at[block[1][0][0], 'chr'], gff2.at[block[1][0][2], 'chr']
# Extract start and end positions
array1, array2 = [float(i[1]) for i in block[1]], [float(i[3]) for i in block[1]]
start1, end1 = array1[0], array1[-1]
start2, end2 = array2[0], array2[-1]
block1, block2 = [], []
for k in block[1]:
block1.append(int(float(k[1])))
block2.append(int(float(k[3])))
# Check for KS values
pair_ks = self.get_ks_value(ks, k)
blk_ks.append(pair_ks)
# Retrieve blast homo data
if k[0]+","+k[2] in blast.index:
blk_homo.append(blast.loc[k[0]+","+k[2], [f'homo{i}' for i in range(1, 6)]].values.tolist())
ks_median, ks_average = self.calculate_ks_statistics(blk_ks)
homo = self.calculate_homo_statistics(blk_homo)
blkks = '_'.join([str(k) for k in blk_ks])
block1 = '_'.join([str(k) for k in block1])
block2 = '_'.join([str(k) for k in block2])
# Calculate tandem ratio
tandem_ratio = self.tandem_ratio(blast, gff2, block[1])
# Store the results
data.append([
block[0], chr1, chr2, start1, end1, start2, end2, block[2], len(block[1]),
ks_median, ks_average, *homo, block1, block2, blkks, tandem_ratio
])
# Create a DataFrame with the results
data_df = pd.DataFrame(data, columns=[
'id', 'chr1', 'chr2', 'start1', 'end1', 'start2', 'end2', 'pvalue', 'length',
'ks_median', 'ks_average', 'homo1', 'homo2', 'homo3', 'homo4', 'homo5',
'block1', 'block2', 'ks', 'tandem_ratio'
])
# Calculate density
data_df['density1'] = data_df['length'] / ((data_df['end1'] - data_df['start1']).abs() + 1)
data_df['density2'] = data_df['length'] / ((data_df['end2'] - data_df['start2']).abs() + 1)
return data_df
def get_ks_value(self, ks, k):
"""Return KS value for the given pair of genes."""
pair = f"{k[0]},{k[2]}"
if pair in ks.index:
return ks[pair]
pair_rev = f"{k[2]},{k[0]}"
if pair_rev in ks.index:
return ks[pair_rev]
return -1
def calculate_ks_statistics(self, blk_ks):
"""Calculate KS statistics: median and average."""
ks_arr = [k for k in blk_ks if k >= 0]
if len(ks_arr) == 0:
return -1, -1
ks_median = base.get_median(ks_arr)
ks_average = sum(ks_arr) / len(ks_arr)
return ks_median, ks_average
def calculate_homo_statistics(self, blk_homo):
"""Calculate homo statistics by averaging across all blocks."""
df = pd.DataFrame(blk_homo)
homo = df.mean().values if len(df) > 0 else [-1, -1, -1, -1, -1]
return homo
def blast_homo(self, blast, gff1, gff2, repeat_number):
"""Assign homo values based on blast data."""
index = [group.sort_values(by=11, ascending=False)[:repeat_number].index.tolist() for name, group in blast.groupby([0])]
blast = blast.loc[np.concatenate([k[:repeat_number] for k in index], dtype=object), [0, 1]]
blast = blast.assign(homo1=np.nan, homo2=np.nan, homo3=np.nan, homo4=np.nan, homo5=np.nan)
# Assign homo values
for i in range(1, 6):
bluenum = i + 5
redindex = np.concatenate([k[:i] for k in index], dtype=object)
blueindex = np.concatenate([k[i:bluenum] for k in index], dtype=object)
grayindex = np.concatenate([k[bluenum:repeat_number] for k in index], dtype=object)
blast.loc[redindex, f'homo{i}'] = 1
blast.loc[blueindex, f'homo{i}'] = 0
blast.loc[grayindex, f'homo{i}'] = -1
blast['chr1_order'] = blast[0].map(gff1['order'])
blast['chr2_order'] = blast[1].map(gff2['order'])
return blast
def tandem_ratio(self, blast, gff2, block):
"""Calculate tandem ratio for a block."""
block = pd.DataFrame(block)[[0, 2]].rename(columns={0: 'id1', 2: 'id2'})
block['order2'] = block['id2'].map(gff2['order'])
# Filter block_blast data
block_blast = blast[(blast[0].isin(block['id1'].values)) & (blast[1].isin(block['id2'].values))].copy()
block_blast = pd.merge(block_blast, block, left_on=0, right_on='id1', how='left')
block_blast['difference'] = (block_blast['chr2_order'] - block_blast['order2']).abs()
# Filter based on difference and calculate ratio
block_blast = block_blast[(block_blast['difference'] <= self.repeat_number) & (block_blast['difference'] > 0)]
return len(block_blast[0].unique()) / len(block) * len(block_blast) / (len(block) + len(block_blast))
def run(self):
"""Main function to run the analysis."""
# Initialize required datasets
lens1 = base.newlens(self.lens1, self.position)
lens2 = base.newlens(self.lens2, self.position)
gff1 = base.newgff(self.gff1)
gff2 = base.newgff(self.gff2)
# Filter GFF files based on chromosome indices
gff1 = gff1[gff1['chr'].isin(lens1.index)]
gff2 = gff2[gff2['chr'].isin(lens2.index)]
# Load blast data
blast = base.newblast(self.blast, int(self.score), float(self.evalue), gff1, gff2, self.blast_reverse)
blast = self.blast_homo(blast, gff1, gff2, self.repeat_number)
blast.index = blast[0] + ',' + blast[1]
# Get collinearity data
collinearity = self.auto_file(gff1, gff2)
# Load ks data if necessary
ks = pd.Series([]) if self.ks == 'none' or self.ks == '' or not hasattr(self, 'ks') else base.read_ks(self.ks, self.ks_col)
# Get the block position data
data = self.block_position(collinearity, blast, gff1, gff2, ks)
data['class1'] = 0
data['class2'] = 0
# Save results
data.to_csv(self.savefile, index=None)
def auto_file(self, gff1, gff2):
"""Auto-detect and read collinearity file."""
with open(self.collinearity) as f:
p = ' '.join(f.readlines()[0:30])
# Handle different file formats
if 'path length' in p or 'MAXIMUM GAP' in p:
return base.read_colinearscan(self.collinearity)
elif 'MATCH_SIZE' in p or '## Alignment' in p:
return self.process_mcscanx(gff1, gff2)
elif '# Alignment' in p:
return base.read_collinearity(self.collinearity)
elif '###' in p:
return self.process_jcvi(gff1, gff2)
def process_mcscanx(self, gff1, gff2):
"""Process MCScanX format collinearity data."""
col = base.read_mcscanx(self.collinearity)
collinearity = []
for block in col:
newblock = [k for k in block[1] if k[0] in gff1.index and k[2] in gff2.index]
if newblock:
for k in newblock:
k[1], k[3] = gff1.at[k[0], 'order'], gff2.at[k[2], 'order']
collinearity.append([block[0], newblock, block[2]])
return collinearity
def process_jcvi(self, gff1, gff2):
"""Process JCVI format collinearity data."""
col = base.read_jcvi(self.collinearity)
collinearity = []
for block in col:
newblock = [k for k in block[1] if k[0] in gff1.index and k[2] in gff2.index]
if newblock:
for k in newblock:
k[1], k[3] = gff1.at[k[0], 'order'], gff2.at[k[2], 'order']
collinearity.append([block[0], newblock, block[2]])
return collinearity
================================================
FILE: build/lib/wgdi/block_ks.py
================================================
import re
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import wgdi.base as base
class block_ks:
def __init__(self, options):
# Default parameters
self.markersize = 0.8
self.figsize = 'default'
self.tandem_length = 200
self.blockinfo_reverse = False
self.tandem = False
self.area = [0, 3]
self.position = 'order'
self.ks_col = 'ks_NG86'
self.pvalue = 0.01
# Overriding default parameters with options
for k, v in options:
setattr(self, str(k), v)
print(f"{k} = {v}")
# Parsing area as a float list
self.area = [float(k) for k in str(self.area).split(',')]
self.markersize = float(self.markersize)
self.tandem_length = int(self.tandem_length)
self.blockinfo_reverse = base.str_to_bool(self.blockinfo_reverse)
self.remove_tandem = base.str_to_bool(self.remove_tandem)
def block_position(self, bkinfo, lens1, lens2, step1, step2):
pos, pairs = [], []
# Create mappings for chromosome positions
dict_y_chr = dict(zip(lens1.index, np.append([0], lens1.cumsum()[:-1].values)))
dict_x_chr = dict(zip(lens2.index, np.append([0], lens2.cumsum()[:-1].values)))
# Iterate through block information
for _, row in bkinfo.iterrows():
block1 = row['block1'].split('_')
block2 = row['block2'].split('_')
ks = row['ks'].split('_')
locy_median = (dict_y_chr[row['chr1']] + 0.5 * (row['end1'] + row['start1'])) * step1
locx_median = (dict_x_chr[row['chr2']] + 0.5 * (row['end2'] + row['start2'])) * step2
pos.append([locx_median, locy_median, row['ks_median']])
# Ensure ks length matches block length
if len(block1) != len(ks):
ks = ks[1:]
for i in range(len(block1)):
locy = (dict_y_chr[row['chr1']] + float(block1[i])) * step1
locx = (dict_x_chr[row['chr2']] + float(block2[i])) * step2
pairs.append([locx, locy, float(ks[i])])
return pos, pairs
def remove_tandem(self, bkinfo):
# Filter for same-chromosome blocks
group = bkinfo[bkinfo['chr1'] == bkinfo['chr2']].copy()
# Calculate block start and end differences
group['start'] = group['start1'] - group['start2']
group['end'] = group['end1'] - group['end2']
# Remove tandems based on threshold
index = group[(group['start'].abs() <= self.tandem_length) |
(group['end'].abs() <= self.tandem_length)].index
return bkinfo.drop(index)
def run(self):
# Initialize axis and chromosome lens
axis = [0, 1, 1, 0]
lens1 = base.newlens(self.lens1, self.position)
lens2 = base.newlens(self.lens2, self.position)
# Parse figsize
if re.search(r'\d', self.figsize):
self.figsize = [float(k) for k in self.figsize.split(',')]
else:
self.figsize = np.array([1, float(lens1.sum()) / float(lens2.sum())]) * 10
# Calculate step sizes
step1 = 1 / float(lens1.sum())
step2 = 1 / float(lens2.sum())
# Create figure and axes
fig, ax = plt.subplots(figsize=self.figsize)
plt.rcParams['ytick.major.pad'] = 0
ax.xaxis.set_ticks_position('top')
# Plot dotplot frame
base.dotplot_frame(fig, ax, lens1, lens2, step1, step2,
self.genome1_name, self.genome2_name, [0, 1])
# Load block information
bkinfo = pd.read_csv(self.blockinfo)
# Handle reverse block information
if self.blockinfo_reverse == True:
bkinfo[['chr1', 'chr2']] = bkinfo[['chr2', 'chr1']]
bkinfo[['block1', 'block2']] = bkinfo[['block2', 'block1']]
# Filter block information
bkinfo['chr1'] = bkinfo['chr1'].astype(str)
bkinfo['chr2'] = bkinfo['chr2'].astype(str)
bkinfo = bkinfo[(bkinfo['length'] >= int(self.block_length)) &
(bkinfo['chr1'].isin(lens1.index)) &
(bkinfo['chr2'].isin(lens2.index)) &
(bkinfo['pvalue'] < float(self.pvalue))]
# Remove tandem duplicates if required
if self.tandem == False:
bkinfo = self.remove_tandem(bkinfo)
# Calculate positions and pairs
pos, pairs = self.block_position(bkinfo, lens1, lens2, step1, step2)
# Filter pairs by ks value
df = pd.DataFrame(pairs, columns=['loc1', 'loc2', 'ks'])
df = df[(df['ks'] >= self.area[0]) & (df['ks'] <= self.area[1])]
df.drop_duplicates(inplace=True)
# Plot scatter
cm = plt.cm.get_cmap('gist_rainbow')
sc = plt.scatter(df['loc1'], df['loc2'], s=self.markersize, c=df['ks'],
alpha=0.9, edgecolors=None, linewidths=0, marker='o',
vmin=self.area[0], vmax=self.area[1], cmap=cm)
# Add colorbar
cbar = fig.colorbar(sc, shrink=0.5, pad=0.03, fraction=0.1)
align = dict(family='DejaVu Sans', style='normal',
horizontalalignment="center", verticalalignment="center")
cbar.set_label('Ks', labelpad=12.5, fontsize=16, **align)
# Set axis and save figure
ax.axis(axis)
plt.subplots_adjust(left=0.09, right=0.96, top=0.93, bottom=0.03)
plt.savefig(self.savefig, dpi=500)
plt.show()
================================================
FILE: build/lib/wgdi/circos.py
================================================
import re
import sys
import matplotlib as mpl
import matplotlib.patches as mpatches
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import wgdi.base as base
class circos():
def __init__(self, options):
self.figsize = '10,10'
self.position = 'order'
self.label_size = 9
self.label_radius = 0.015
self.column_names = [None]*100
for k, v in options:
setattr(self, str(k), v)
print(k, ' = ', v)
self.figsize = [float(k) for k in self.figsize.split(',')]
self.ring_width = float(self.ring_width)
if hasattr(self, 'legend_square'):
self.legend_square = [float(k)
for k in self.legend_square.split(',')]
else:
self.legend_square = 0.04, 0.04
def plot_circle(self, loc_chr, radius, color='black', lw=1, alpha=1, linestyle='-'):
for k in loc_chr:
start, end = loc_chr[k]
t = np.arange(start, end, 0.005)
x, y = (radius) * np.cos(t), (radius) * np.sin(t)
plt.plot(x, y, linestyle=linestyle,
color=color, lw=lw, alpha=alpha)
def plot_labels(self, root, labels, loc_chr, radius, horizontalalignment="center", verticalalignment="center", fontsize=6,
color='black'):
for k in loc_chr:
loc = sum(loc_chr[k]) * 0.5
x, y = radius * np.cos(loc), radius * np.sin(loc)
self.Wedge(root, (x, y), self.label_radius, 0,
360, self.label_radius, 'white', 1)
if 1 * np.pi < loc < 2 * np.pi:
loc += np.pi
plt.text(x, y, labels[k], horizontalalignment=horizontalalignment, verticalalignment=verticalalignment,
fontsize=fontsize, color=color, rotation=0)
def Wedge(self, ax, loc, radius, start, end, width, color, alpha):
p = mpatches.Wedge(loc, radius, start, end, width=width,
edgecolor=None, facecolor=color, alpha=alpha)
ax.add_patch(p)
def plot_bar(self, df, radius, length, lw, color, alpha):
for k in df[df.columns[0]].drop_duplicates().values:
if str(k) not in color.keys():
color[str(k)] = 'black'
if k in ['', np.nan]:
continue
df_chr = df.groupby(df.columns[0]).get_group(k)
x1, y1 = radius * \
np.cos(df_chr['rad']), radius * np.sin(df_chr['rad'])
x2, y2 = (radius + length) * \
np.cos(df_chr['rad']), (radius + length) * \
np.sin(df_chr['rad'])
x = np.array(
[x1.values, x2.values, [np.nan] * x1.size]).flatten('F')
y = np.array(
[y1.values, y2.values, [np.nan] * x1.size]).flatten('F')
plt.plot(x, y, linestyle='-',
color=color[str(k)], lw=lw, alpha=alpha)
def chr_location(self, lens, angle_gap, angle):
start, end, loc_chr = 0, 0.2*angle_gap, {}
for k in lens.index:
end += angle_gap + angle * (float(lens[k]))
start = end - angle * (float(lens[k]))
loc_chr[k] = [float(start), float(end)]
return loc_chr
def deal_alignment(self, alignment, gff, lens, loc_chr, angle):
alignment.replace('\s+', '', inplace=True)
alignment.replace('.', '', inplace=True)
print(alignment.dropna(subset=[2, 3],how='all'))
# exit(0)
newalignment = alignment.copy()
for i in range(len(alignment.columns)):
alignment[i] = alignment[i].astype(str)
newalignment[i] = alignment[i].map(gff['chr'].to_dict())
newalignment['loc'] = alignment[0].map(gff[self.position].to_dict())
newalignment[0] = newalignment[0].astype('str')
newalignment['loc'] = newalignment['loc'].astype('float')
newalignment = newalignment[newalignment[0].isin(lens.index) == True]
newalignment['rad'] = np.nan
for name, group in newalignment.groupby(0):
if str(name) not in loc_chr:
continue
newalignment.loc[group.index, 'rad'] = loc_chr[str(
name)][0]+angle * group['loc']
print(newalignment.dropna(subset=[2, 3,4],how='all'))
return newalignment
def deal_ancestor(self, alignment, gff, lens, loc_chr, angle, al):
alignment.replace('\s+', '', inplace=True)
alignment.replace('.', np.nan, inplace=True)
newalignment = pd.merge(alignment, gff, left_on=0, right_on=gff.index)
newalignment['rad'] = np.nan
for name, group in newalignment.groupby('chr'):
if str(name) not in loc_chr:
continue
newalignment.loc[group.index, 'rad'] = loc_chr[str(
name)][0]+angle * group[self.position]
newalignment.index = newalignment[0]
newalignment[0] = newalignment[0].map(newalignment['rad'].to_dict())
data = []
for index_al, row_al in al.iterrows():
for k in alignment.columns[1:]:
alignment[k] = alignment[k].astype(str)
group = newalignment[(newalignment['chr'] == row_al['chr']) & (
newalignment['order'] >= row_al['start']) & (newalignment['order'] <= row_al['end'])].copy()
group.loc[:, k] = group.loc[:, k].map(
newalignment['rad']).values
group.dropna(subset=[k], inplace=True)
group.index = group.index.map(newalignment['rad'].to_dict())
group['color'] = row_al['color']
group = group[group[k].notnull()]
data += group[[0, k, 'color']].values.tolist()
df = pd.DataFrame(data, columns=['loc1', 'loc2', 'color'])
return df
def plot_collinearity(self, data, radius, lw=0.02, alpha=1):
for name, group in data.groupby('color'):
x, y = np.array([]), np.array([])
for index, row in group.iterrows():
ex1x, ex1y = radius * \
np.cos(row['loc1']), radius*np.sin(row['loc1'])
ex2x, ex2y = radius * \
np.cos(row['loc2']), radius*np.sin(row['loc2'])
ex3x, ex3y = radius * (1-abs(row['loc1']-row['loc2'])/np.pi) * np.cos((row['loc1']+row['loc2'])*0.5), radius * (
1-abs(row['loc1']-row['loc2'])/np.pi) * np.sin((row['loc1']+row['loc2'])*0.5)
x1 = [ex1x, 0.5*ex3x, ex2x]
y1 = [ex1y, 0.5*ex3y, ex2y]
step = .002
t = np.arange(0, 1+step, step)
xt = base.Bezier3(x1, t)
yt = base.Bezier3(y1, t)
x = np.hstack((x, xt, np.nan))
y = np.hstack((y, yt, np.nan))
plt.plot(x, y, color=name, lw=lw, alpha=alpha)
def plot_legend(self, ax, chr_color, width, height):
(x1, x2) = ax.get_xlim()
(y1, y2) = ax.get_ylim()
a = 1000
for k, v in enumerate(chr_color.keys(), 0):
h = y1-k//a*height*2
k = k % a
if x1 + width * k > x2-width:
a = k
h = y1-k//a*height*2
k = k % a
loc = [x1 + width * k, h]
base.Rectangle(ax, loc, height, width, chr_color[v], 1)
plt.text(loc[0] + width*0.382, h-0.618*height, v, fontsize=12)
ax.set_ylim(h-2*height, y2)
def run(self):
fig, ax = plt.subplots(figsize=self.figsize)
mpl.rcParams['agg.path.chunksize'] = 100000000
lens = base.newlens(self.lens, self.position)
radius, angle_gap = float(self.radius), float(self.angle_gap)
angle = (2 * np.pi - (int(len(lens))+1.5)
* angle_gap) / (int(lens.sum()))
loc_chr = self.chr_location(lens, angle_gap, angle)
list_colors = [str(k).strip() for k in re.split(',|:', self.colors)]
chr_color = dict(zip(list_colors[::2], list_colors[1::2]))
gff = base.newgff(self.gff)
if hasattr(self, 'ancestor'):
ancestor = pd.read_csv(self.ancestor, header=None)
al = pd.read_csv(self.ancestor_location, sep='\t', header=None)
al.rename(columns={0: 'chr', 1: 'start',
2: 'end', 3: 'color'}, inplace=True)
al['chr'] = al['chr'].astype(str)
data = self.deal_ancestor(ancestor, gff, lens, loc_chr, angle, al)
self.plot_collinearity(data, radius, lw=0.1, alpha=0.8)
if hasattr(self, 'alignment'):
alignment = pd.read_csv(self.alignment, header=None)
print(alignment)
newalignment = self.deal_alignment(
alignment, gff, lens, loc_chr, angle)
if ',' in self.column_names:
names = [str(k) for k in self.column_names.split(',')]
else:
names = [None]*len(newalignment.columns)
n = 0
align = dict(family='Arial', verticalalignment="center",
horizontalalignment="center")
print(newalignment)
for k, v in enumerate(newalignment.columns[1:-2]):
r = radius + self.ring_width*(k+1)
print(k,v,r)
self.plot_circle(loc_chr, r, lw=0.5, alpha=1, color='grey')
self.plot_bar(newalignment[[v, 'rad']], r + self.ring_width *
0.15, self.ring_width*0.7, 0.15, chr_color, 1)
if n % 2 == 0:
loc = 0.05
x, y = (r+self.ring_width*0.5) * \
np.cos(loc), (r+self.ring_width*0.5) * np.sin(loc)
plt.text(x, y, names[n], rotation=loc *
180 / np.pi, fontsize=self.label_size, **align)
else:
loc = -0.08
x, y = (r+self.ring_width*0.5) * \
np.cos(loc), (r+self.ring_width*0.5) * np.sin(loc)
plt.text(x, y, names[n], fontsize=self.label_size,
rotation=loc * 180 / np.pi, **align)
n += 1
if hasattr(self, 'ancestor'):
colors = al['color'].drop_duplicates().values.tolist()
ancestor_chr_color = dict(zip(range(1, len(colors)+1), colors))
self.plot_legend(ax, ancestor_chr_color,
self.legend_square[0], self.legend_square[1])
if hasattr(self, 'alignment'):
del chr_color['nan']
self.plot_legend(
ax, chr_color, self.legend_square[0], self.legend_square[1])
labels = self.chr_label + lens.index
labels = dict(zip(lens.index, labels))
self.plot_labels(ax, labels, loc_chr, radius +
self.ring_width*0.3, fontsize=self.label_size)
plt.axis('off')
a = (ax.get_ylim()[1]-ax.get_ylim()[0]) / \
(ax.get_xlim()[1]-ax.get_xlim()[0])
fig.set_size_inches(self.figsize[0], self.figsize[0]*a, forward=True)
plt.savefig(self.savefig, dpi=500)
plt.show()
sys.exit(0)
================================================
FILE: build/lib/wgdi/collinearity.py
================================================
import numpy as np
import pandas as pd
class collinearity:
def __init__(self, options, points):
# Default values
self.gap_penalty = -1
self.over_length = 0
self.mg1 = 40
self.mg2 = 40
self.pvalue = 1
self.over_gap = 3
self.points = points
self.p_value = 0
self.coverage_ratio = 0.8
# Set user-defined options
for k, v in options:
setattr(self, str(k), v)
# Initialize grading and mg values
self.grading = [50, 40, 25] if not hasattr(self, 'grading') else [int(k) for k in self.grading.split(',')]
self.mg1, self.mg2 = [40, 40] if not hasattr(self, 'mg') else [int(k) for k in self.mg.split(',')]
# Convert string values to floats
self.pvalue = float(self.pvalue)
self.coverage_ratio = float(self.coverage_ratio)
def get_matrix(self):
"""Initialize the matrix for the collinearity points."""
self.points['usedtimes1'] = 0
self.points['usedtimes2'] = 0
self.points['times'] = 1
self.points['score1'] = self.points['grading']
self.points['score2'] = self.points['grading']
self.points['path1'] = self.points.index.to_numpy().reshape(len(self.points), 1).tolist()
self.points['path2'] = self.points['path1']
self.points_init = self.points.copy()
self.mat_points = self.points
def run(self):
"""Run the main collinearity processing."""
self.get_matrix()
self.score_matrix()
data = []
# Process points for maxPath in the positive direction
points1 = self.points[['loc1', 'loc2', 'score1', 'path1', 'usedtimes1']].sort_values(by=['score1'], ascending=False)
points1.drop(index=points1[points1['usedtimes1'] < 1].index, inplace=True)
points1.columns = ['loc1', 'loc2', 'score', 'path', 'usedtimes']
while (self.over_length >= self.over_gap or len(points1) >= self.over_gap):
if self.max_path(points1):
if self.p_value > self.pvalue:
continue
data.append([self.path, self.p_value, self.score])
# Process points for maxPath in the negative direction
points2 = self.points[['loc1', 'loc2', 'score2', 'path2', 'usedtimes2']].sort_values(by=['score2'], ascending=False)
points2.drop(index=points2[points2['usedtimes2'] < 1].index, inplace=True)
points2.columns = ['loc1', 'loc2', 'score', 'path', 'usedtimes']
while (self.over_length >= self.over_gap) or (len(points2) >= self.over_gap):
if self.max_path(points2):
if self.p_value > self.pvalue:
continue
data.append([self.path, self.p_value, self.score])
return data
def score_matrix(self):
"""Calculate the scoring matrix for the points."""
for index, row, col in self.points[['loc1', 'loc2']].itertuples():
# Get points within a certain range
points = self.points[(self.points['loc1'] > row) &
(self.points['loc2'] > col) &
(self.points['loc1'] < row + self.mg1) &
(self.points['loc2'] < col + self.mg2)]
row_i_old, gap = row, self.mg2
for index_ij, row_i, col_j, grading in points[['loc1', 'loc2', 'grading']].itertuples():
if col_j - col > gap and row_i > row_i_old:
break
score = grading + (row_i - row + col_j - col) * self.gap_penalty
score1 = score + self.points.at[index, 'score1']
if score > 0 and self.points.at[index_ij, 'score1'] < score1:
self.points.at[index_ij, 'score1'] = score1
self.points.at[index, 'usedtimes1'] += 1
self.points.at[index_ij, 'usedtimes1'] += 1
self.points.at[index_ij, 'path1'] = self.points.at[index, 'path1'] + [index_ij]
gap = min(col_j - col, gap)
row_i_old = row_i
# Reverse processing to handle negative direction
points_reverse = self.points.sort_values(by=['loc1', 'loc2'], ascending=[False, True])
for index, row, col in points_reverse[['loc1', 'loc2']].itertuples():
points = points_reverse[(points_reverse['loc1'] < row) &
(points_reverse['loc2'] > col) &
(points_reverse['loc1'] > row - self.mg1) &
(points_reverse['loc2'] < col + self.mg2)]
row_i_old, gap = row, self.mg2
for index_ij, row_i, col_j, grading in points[['loc1', 'loc2', 'grading']].itertuples():
if col_j - col > gap and row_i < row_i_old:
break
score = grading + (row - row_i + col_j - col) * self.gap_penalty
score2 = score + self.points.at[index, 'score2']
if score > 0 and self.points.at[index_ij, 'score2'] < score2:
self.points.at[index_ij, 'score2'] = score2
self.points.at[index, 'usedtimes2'] += 1
self.points.at[index_ij, 'usedtimes2'] += 1
self.points.at[index_ij, 'path2'] = self.points.at[index, 'path2'] + [index_ij]
gap = min(col_j - col, gap)
row_i_old = row_i
def max_path(self, points):
"""Find the maximum path for the given points."""
if len(points) == 0:
self.over_length = 0
return False
# Initialize path score and index
self.score, self.path_index = points.loc[points.index[0], ['score', 'path']]
self.path = points[points.index.isin(self.path_index)]
self.over_length = len(self.path_index)
# Check if the block overlaps with other blocks
if self.over_length >= self.over_gap and len(self.path) / self.over_length > self.coverage_ratio:
points.drop(index=self.path.index, inplace=True)
[loc1_min, loc2_min], [loc1_max, loc2_max] = self.path[['loc1', 'loc2']].agg(['min', 'max']).to_numpy()
# Calculate p-value
gap_init = self.points_init[(loc1_min <= self.points_init['loc1']) &
(self.points_init['loc1'] <= loc1_max) &
(loc2_min <= self.points_init['loc2']) &
(self.points_init['loc2'] <= loc2_max)].copy()
self.p_value = self.p_value_estimated(gap_init, loc1_max - loc1_min + 1, loc2_max - loc2_min + 1)
self.path = self.path.sort_values(by=['loc1'], ascending=[True])[['loc1', 'loc2']]
return True
else:
points.drop(index=points.index[0], inplace=True)
return False
def p_value_estimated(self, gap, L1, L2):
"""Estimate p-value based on the given gap and lengths."""
N1 = gap['times'].sum()
N = len(gap)
self.points_init.loc[gap.index, 'times'] += 1
m = len(self.path)
a = (1 - self.score / m / self.grading[0]) * (N1 - m + 1) / N * (L1 - m + 1) * (L2 - m + 1) / L1 / L2
return round(a, 4)
================================================
FILE: build/lib/wgdi/dotplot.py
================================================
import re
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import wgdi.base as base
class dotplot():
def __init__(self, options):
self.multiple = 1
self.score = 100
self.evalue = 1e-5
self.repeat_number = 20
self.markersize = 0.5
self.figsize = 'default'
self.position = 'order'
self.ancestor_top = None
self.ancestor_left = None
self.blast_reverse = False
for k, v in options:
setattr(self, str(k), v)
print(k, ' = ', v)
if self.ancestor_top == 'none' or self.ancestor_top == '':
self.ancestor_top = None
if self.ancestor_left == 'none' or self.ancestor_left == '':
self.ancestor_left = None
base.str_to_bool(self.blast_reverse)
def pair_positon(self, blast, gff1, gff2, rednum, repeat_number):
blast['color'] = ''
blast['loc1'] = blast[0].map(gff1['loc'])
blast['loc2'] = blast[1].map(gff2['loc'])
bluenum = 5+rednum
index = [group.sort_values(by=[11], ascending=[False])[:repeat_number].index.tolist()
for name, group in blast.groupby([0])]
reddata = np.array([k[:rednum] for k in index], dtype=object)
bluedata = np.array([k[rednum:bluenum] for k in index], dtype=object)
graydata = np.array([k[bluenum:repeat_number] for k in index], dtype=object)
if len(reddata):
redindex = np.concatenate(reddata)
else:
redindex = []
if len(bluedata):
blueindex = np.concatenate(bluedata)
else:
blueindex = []
if len(graydata):
grayindex = np.concatenate(graydata)
else:
grayindex = []
blast.loc[redindex, 'color'] = 'red'
blast.loc[blueindex, 'color'] = 'blue'
blast.loc[grayindex, 'color'] = 'gray'
return blast[blast['color'].str.contains(r'\w')]
def run(self):
axis = [0, 1, 1, 0]
left, right, top, bottom = 0.07, 0.97, 0.93, 0.03
lens1 = base.newlens(self.lens1, self.position)
lens2 = base.newlens(self.lens2, self.position)
step1 = 1 / float(lens1.sum())
step2 = 1 / float(lens2.sum())
if self.ancestor_left != None:
axis[0] = -0.02
lens_ancestor_left = pd.read_csv(
self.ancestor_left, sep="\t", header=None)
lens_ancestor_left[0] = lens_ancestor_left[0].astype(str)
lens_ancestor_left[3] = lens_ancestor_left[3].astype(str)
lens_ancestor_left[4] = lens_ancestor_left[4].astype(int)
lens_ancestor_left[4] = lens_ancestor_left[4] / lens_ancestor_left[4].max()
lens_ancestor_left = lens_ancestor_left[lens_ancestor_left[0].isin(
lens1.index)]
if self.ancestor_top != None:
axis[3] = -0.02
lens_ancestor_top = pd.read_csv(
self.ancestor_top, sep="\t", header=None)
lens_ancestor_top[0] = lens_ancestor_top[0].astype(str)
lens_ancestor_top[3] = lens_ancestor_top[3].astype(str)
lens_ancestor_top[4] = lens_ancestor_top[4].astype(int)
lens_ancestor_top[4] = lens_ancestor_top[4] / lens_ancestor_top[4].max()
lens_ancestor_top = lens_ancestor_top[lens_ancestor_top[0].isin(
lens2.index)]
if re.search(r'\d', self.figsize):
self.figsize = [float(k) for k in self.figsize.split(',')]
else:
self.figsize = np.array(
[1, float(lens1.sum())/float(lens2.sum())])*10
plt.rcParams['ytick.major.pad'] = 0
fig, ax = plt.subplots(figsize=self.figsize)
ax.xaxis.set_ticks_position('top')
base.dotplot_frame(fig, ax, lens1, lens2, step1, step2,
self.genome1_name, self.genome2_name, [axis[0], axis[3]])
gff1 = base.newgff(self.gff1)
gff2 = base.newgff(self.gff2)
gff1 = base.gene_location(gff1, lens1, step1, self.position)
gff2 = base.gene_location(gff2, lens2, step2, self.position)
if self.ancestor_top != None:
top = top
self.aree_left = self.ancestor_posion(ax, gff2, lens_ancestor_top, 'top')
if self.ancestor_left != None:
left = left
self.aree_top = self.ancestor_posion(ax, gff1, lens_ancestor_left, 'left')
print('read gffs')
blast = base.newblast(self.blast, int(self.score),
float(self.evalue), gff1, gff2, self.blast_reverse)
if len(blast) ==0:
print('Stoped! \n\nThe gene id in blast file does not correspond to gff1 and gff2.')
exit(0)
print('read blast')
df = self.pair_positon(blast, gff1, gff2,
int(self.multiple), int(self.repeat_number))
print('deal blast')
ax.scatter(df['loc2'], df['loc1'], s=float(self.markersize), c=df['color'],
alpha=0.5, edgecolors=None, linewidths=0, marker='o')
ax.axis(axis)
plt.subplots_adjust(left=left, right=right, top=top, bottom=bottom)
plt.savefig(self.savefig, dpi=300)
plt.show()
def ancestor_posion(self, ax, gff, lens, mark):
data = []
for index, row in lens.iterrows():
loc1 = gff[(gff['chr'] == row[0]) & (
gff['order'] == int(row[1]))].index
loc2 = gff[(gff['chr'] == row[0]) & (
gff['order'] == int(row[2])-1)].index
loc1, loc2 = gff.loc[[loc1[0], loc2[0]], 'loc']
if mark == 'top':
width = abs(loc1-loc2)
loc = [min(loc1, loc2), 0]
height = -0.02
base.Rectangle(ax, loc, height, width, row[3], row[4])
if mark == 'left':
height = abs(loc1-loc2)
loc = [-0.02, min(loc1, loc2), ]
width = 0.02
base.Rectangle(ax, loc, height, width, row[3], row[4])
data.append([loc, height, width, row[3], row[4]])
return data
================================================
FILE: build/lib/wgdi/example/__init__.py
================================================
================================================
FILE: build/lib/wgdi/example/align.conf
================================================
[alignment]
blockinfo = block information file (.csv)
blockinfo_reverse = false
classid = class1
gff1 = gff1 file
gff2 = gff2 file
lens1 = lens1 file
lens2 = lens2 file
genome1_name = Genome1 name
genome2_name = Genome2 name
markersize = 0.5
ks_area = -1,3
position = order
colors = red,blue,green
figsize = 10,10
savefile = savefile(.csv)
savefig= save image(.png, .pdf, .svg)
================================================
FILE: build/lib/wgdi/example/alignmenttrees.conf
================================================
[alignmenttrees]
alignment = alignment file (.csv)
gff = gff file (reference genome, If alignment has no reference species, delete it)
lens = lens file (If alignment has no reference species, delete it)
dir = output folder
sequence_file = sequence file (.fa)
cds_file = cds file (.fa)
codon_positon = 1,2,3 (1,2 mean codon1&2; 1,2,3 mean no codon removed)
trees_file = trees (.nwk)
align_software = (mafft,muscle)
tree_software = (iqtree,fasttree)
threads = 1 (Number,AUTO)
model = MFP
trimming = (trimal,divvier)
minimum = 4
delete_detail = true
================================================
FILE: build/lib/wgdi/example/ancestral_karyotype.conf
================================================
[ancestral_karyotype]
gff = gff file (cat the relevant 'gff' files into a file)
pep_file = pep file (cat the relevant 'pep.fa' files into a file)
ancestor = ancestor file (this file requires you to provide)
mark = aak
ancestor_gff = result file
ancestor_lens = result file
ancestor_pep = result file
ancestor_file = result file
================================================
FILE: build/lib/wgdi/example/ancestral_karyotype_repertoire.conf
================================================
[ancestral_karyotype_repertoire]
blockinfo = block information (*.csv)
# blockinfo: processed *.csv
blockinfo_reverse = False
gff1 = gff1 file (ancestor's gff)
gff2 = gff2 file (the other species's gff)
gap = 5
mark = aak1s
ancestor = ancestor file
#current ancestor file
ancestor_new = result file
ancestor_pep = ancestor pep file
#cat all pep files together
ancestor_pep_new = result file
ancestor_gff = result file
ancestor_lens = result file
================================================
FILE: build/lib/wgdi/example/blockinfo.conf
================================================
[blockinfo]
blast = blast file
gff1 = gff1 file
gff2 = gff2 file
lens1 = lens1 file
lens2 = lens2 file
collinearity = collinearity file
score = 100
evalue = 1e-5
repeat_number = 20
position = order
ks = ks file
ks_col = ks_NG86
savefile = block information (*.csv)
================================================
FILE: build/lib/wgdi/example/blockks.conf
================================================
[blockks]
lens1 = lens1 file
lens2 = lens2 file
genome1_name = Genome1 name
genome2_name = Genome2 name
blockinfo = block information (*.csv)
pvalue = 0.2
tandem = true
tandem_length = 200
markersize = 1
area = 0,2
block_length = minimum length
figsize = 8,8
savefig = save image(.png, .pdf, .svg)
================================================
FILE: build/lib/wgdi/example/circos.conf
================================================
[circos]
gff = gff file
lens = lens file
radius = 0.2
angle_gap = 0.05
ring_width = 0.015
colors = 1:c,2:m,3:blue,4:gold,5:red,6:lawngreen,7:darkgreen,8:k,9:darkred,10:gray
alignment = alignment file
chr_label = chr
ancestor = ancestor alignment file
ancestor_location = ancestor file
figsize = 10,10
label_size = 9
position = order
legend_square = 0.04, 0.04
column_names = 1,2,3,4,5
savefig = result(.png, .pdf, .svg)
================================================
FILE: build/lib/wgdi/example/collinearity.conf
================================================
[collinearity]
gff1 = gff1 file
gff2 = gff2 file
lens1 = lens1 file
lens2 = lens2 file
blast = blast file
blast_reverse = false
comparison = genomes
multiple = 1
process = 8
evalue = 1e-5
score = 100
grading = 50,30,25
mg = 25,25
pvalue = 1
repeat_number = 20
positon = order
savefile = collinearity file
================================================
FILE: build/lib/wgdi/example/conf.ini
================================================
[ini]
mafft_path = /home/sunpc/micromamba/envs/wgdi/bin/mafft
pal2nal_path = /home/sunpc/micromamba/envs/wgdi/bin/pal2nal.pl
yn00_path = /home/sunpc/micromamba/envs/wgdi/bin/yn00
muscle_path = /home/sunpc/micromamba/envs/wgdi/bin/muscle
iqtree_path = /home/sunpc/micromamba/envs/wgdi/bin/iqtree
trimal_path = /home/sunpc/micromamba/envs/wgdi/bin/trimal
fasttree_path = /home/sunpc/micromamba/envs/wgdi/bin/fasttree
divvier_path = /home/sunpc/micromamba/envs/wgdi/bin/divvier
================================================
FILE: build/lib/wgdi/example/corr.conf
================================================
[correspondence]
blockinfo = blockinfo file(.csv)
lens1 = lens1 file
lens2 = lens2 file
tandem = true
tandem_length = 200
pvalue = 0.2
block_length = 5
tandem_ratio = 0.5
multiple = 1
homo = -1,1
savefile = savefile(.csv)
================================================
FILE: build/lib/wgdi/example/dotplot.conf
================================================
[dotplot]
blast = blast file
gff1 = gff1 file
gff2 = gff2 file
lens1 = lens1 file
lens2 = lens2 file
genome1_name = Genome1 name
genome2_name = Genome2 name
multiple = 1
score = 100
evalue = 1e-5
repeat_number = 10
position = order
blast_reverse = false
ancestor_left = ancestor file or none
ancestor_top = ancestor file or none
markersize = 0.5
figsize = 10,10
savefig = savefile(.png, .pdf, .svg)
================================================
FILE: build/lib/wgdi/example/fusion_positions_database.conf
================================================
[fusion_positions_database]
pep = pep file
gff = gff file
fusion_positions = fusion_positions file
# Number of gene sets on each side of the breakpoint
ancestor_gff = result file
ancestor_lens = result file
ancestor_pep = result file
ancestor_file = result file
================================================
FILE: build/lib/wgdi/example/fusions_detection.conf
================================================
[fusions_detection]
blockinfo = block information (*.csv)
ancestor = ancestor file
#The number of genes spanned by a synteny block on both sides of a breakpoint.
min_genes_per_side = 5
density = 0.3
filtered_blockinfo = result blockinfo (.csv)
================================================
FILE: build/lib/wgdi/example/karyotype.conf
================================================
[karyotype]
ancestor = ancestor chromosome file
width = 0.5
figsize = 10,6.18
savefig = save image(.png, .pdf, .svg)
================================================
FILE: build/lib/wgdi/example/karyotype_mapping.conf
================================================
[karyotype_mapping]
blast = blast file
blast_reverse = false
gff1 = gff1 file
gff2 = gff2 file
score = 100
evalue = 1e-5
repeat_number = 5
ancestor_left = ancestor location file (Only one of ('left', 'top') can be reserved)
ancestor_top = ancestor location file
the_other_lens = the other lens file
blockinfo = block information (*.csv)
blockinfo_reverse = false
limit_length = 5
the_other_ancestor_file = result file
================================================
FILE: build/lib/wgdi/example/ks.conf
================================================
[ks]
cds_file = cds file
#cat all cds files together
pep_file = pep file
#cat all pep files together
align_software = muscle
pairs_file = gene pairs file
ks_file = ks result
================================================
FILE: build/lib/wgdi/example/ks_fit_result.csv
================================================
,color,linewidth,linestyle,,,,,,
csa_csa,red,2,-,2.532090116,1.510453744,0.229652282,1.638111687,2.048906176,0.345639862
vvi_vvi,blue,2,-,3.00367275,1.288717936,0.177816426,,,
vvi_oin_gamma,orange,2,-,1.910418336,1.328469514,0.262257112,,,
vvi_oin,orange,2,--,4.948194212,0.882608858,0.10426873,,,
vvi_csa,green,2,--,2.470770292464022,1.4131842495219498,0.21391959288821544,,,
================================================
FILE: build/lib/wgdi/example/ksfigure.conf
================================================
[ksfigure]
ksfit = ksfit result(*.csv)
labelfontsize = 15
legendfontsize = 15
xlabel = none
ylabel = none
title = none
area = 0,2
figsize = 10,6.18
shadow = true (true/false)
savefig = save image(.png, .pdf, .svg)
================================================
FILE: build/lib/wgdi/example/kspeaks.conf
================================================
[kspeaks]
blockinfo = block information (*.csv)
pvalue = 0.2
tandem = true
block_length = int number
ks_area = 0,10
multiple = 1
homo = 0,1
fontsize = 9
area = 0,3
figsize = 10,6.18
savefig = saving image(.png,.pdf)
savefile = ks medain savefile
================================================
FILE: build/lib/wgdi/example/peaksfit.conf
================================================
[peaksfit]
blockinfo = block information (*.csv)
mode = median
bins_number = 200
ks_area = 0,10
fontsize = 9
area = 0,3
figsize = 10,6.18
shadow = true
savefig = saving image(.png,.pdf,.svg)
================================================
FILE: build/lib/wgdi/example/pindex.conf
================================================
[pindex]
alignment = alignment file (.csv)
gff = gff file
lens =lens file
gap = 50
retention = 0.05
diff = 0.05
remove_delta = (true/false)
savefile = result file(.csv)
================================================
FILE: build/lib/wgdi/example/polyploidy_classification.conf
================================================
[polyploidy classification]
blockinfo = block information (*.csv)
ancestor_left = ancestor file
ancestor_top = ancestor file
classid = class1,class2
same_protochromosome = False
same_subgenome = False
savefile = result file(.csv)
================================================
FILE: build/lib/wgdi/example/retain.conf
================================================
[retain]
alignment = alignment file
gff = gff file
lens = lens file
colors = red,blue,green
refgenome = shorthand
figsize = 10,12
step = 50
ylabel = y label
savefile = retain file (result)
savefig = result(.png, .pdf, .svg)
================================================
FILE: build/lib/wgdi/example/shared_fusion.conf
================================================
[shared_fusion]
blockinfo = block information (*.csv)
# The new lens file is the output filtered by lens file.
lens1 = lens file, new lens file
lens2 = lens file, new lens file
ancestor_left = ancestor file
ancestor_top = ancestor file
classid = class1,class2
limit_length = 5
filtered_blockinfo = result blockinfo (.csv)
================================================
FILE: build/lib/wgdi/fusion_positions_database.py
================================================
import pandas as pd
import os
from Bio import SeqIO
class fusion_positions_database:
def __init__(self, options):
for k, v in options:
setattr(self, k, v)
print(f'{k} = {v}')
def run(self):
# Load and remove duplicates from data
gff = pd.read_csv(self.gff, sep="\t", header=None, dtype={0: str, 5: int}).drop_duplicates()
pep = SeqIO.to_dict(SeqIO.parse(self.pep, "fasta"))
df = pd.read_csv(self.fusion_positions, sep="\t", header=None, dtype={0: str, 1: int, 2:int, 3:str}).drop_duplicates()
# Load ancestral sequence file if it exists
seqs = SeqIO.to_dict(SeqIO.parse(self.ancestor_pep, "fasta")) if os.path.exists(self.ancestor_pep) else {}
sf_gff, sf_lens = [], []
# Process fusion positions
for _, row in df.iterrows():
newchr = row[3]
newgff = gff[(gff[0] == row[0]) &
(gff[5] >= row[1] - row[2]) &
(gff[5] < row[1] + row[2])].copy()
newgff['id'] = [f"{newchr}s{str(row[0]).zfill(2)}g{str(i).zfill(3)}" for i in range(1, len(newgff) + 1)]
sf_position = row[1] - newgff.iloc[0, 5]
sf_lens.append([newchr, sf_position, len(newgff)])
# For each gene in the filtered GFF region
for _, gff_row in newgff.iterrows():
if gff_row[1] in pep and gff_row['id'] not in seqs:
gene = pep[gff_row[1]][:]
gene.id, gene.description = gff_row['id'], ''
seqs[gff_row['id']] = gene
# Collect data for the final GFF output
sf_gff.append([gff_row['id'], newchr, sf_position, gff_row[2], gff_row[3], gff_row[4], gff_row[1]])
# Write sequences to FASTA file
SeqIO.write(seqs.values(), self.ancestor_pep, 'fasta')
# Save filtered GFF data
if sf_gff:
sf_gff = pd.DataFrame(sf_gff)
sf_gff.rename(columns={3: 'start', 4: 'end', 5: 'strand'}, inplace=True)
sf_gff['order'] = sf_gff[0].str[-3:].astype(int)
sf_gff[[1, 0, 'start', 'end', 'strand', 'order', 6]].to_csv(self.ancestor_gff, sep="\t", mode='a', index=False, header=None)
sf_lens = pd.DataFrame(sf_lens).drop_duplicates()
sf_lens.to_csv(self.ancestor_lens, sep="\t", mode='a', index=False, header=None)
# Generate ancestral sequence data
ancestor = []
for _, row in sf_lens.iterrows():
ancestor.append([row[0], 1, row[1], 'red', 1])
ancestor.append([row[0], row[1] + 1, row[2], 'blue', 1])
pd.DataFrame(ancestor).to_csv(self.ancestor_file, sep="\t", mode='a', index=False, header=None)
# Remove duplicates from the output files
for file in [self.ancestor_gff, self.ancestor_lens, self.ancestor_file]:
df = pd.read_csv(file, header=None).drop_duplicates().to_csv(file, index=False, header=None)
================================================
FILE: build/lib/wgdi/fusions_detection.py
================================================
import pandas as pd
from tabulate import tabulate
class fusions_detection:
def __init__(self, options):
self.min_genes_per_side = 5
self.density = 0.3
for k, v in options:
setattr(self, k, v)
print(f"{k} = {v}")
self.min_genes_per_side = int(self.min_genes_per_side)
self.density = float(self.density)
def run(self):
# Load the ancestor file and process the positions
ancestor = pd.read_csv(self.ancestor, sep='\t', header=None)
position = ancestor.groupby(0)[2].unique().apply(pd.Series)
bkinfo = pd.read_csv(self.blockinfo)
newbkinfo = bkinfo.head(0)
# Iterate over each row in the position dataframe
for index, row in position.iterrows():
# Filter the bkinfo dataframe based on chr2 and density
filtered_group = bkinfo[(bkinfo['chr2'] == index) & (bkinfo['density2'] >= self.density)].copy()
# Split the block2 column and stack the resulting series
df = filtered_group['block2'].str.split('_', expand=True).stack().astype(int)
# Count the number of genes greater and less than the current position
filtered_group['greater'] = (df > row[0]).groupby(level=0).sum()
filtered_group['less'] = (df < row[0]).groupby(level=0).sum()
# Filter the group based on the minimum number of genes per side
filtered_group = filtered_group[(filtered_group['greater'] >= self.min_genes_per_side) & (filtered_group['less'] >= self.min_genes_per_side)]
# Concatenate the filtered group with the newbkinfo dataframe
newbkinfo = pd.concat([newbkinfo, filtered_group])
if len(newbkinfo) ==0:
print("\nNo shared fusion breakpoints detected")
exit(0)
# Get and print the shared fusion positions
newbkinfo.to_csv(self.filtered_blockinfo, header=True, index=False)
non_overlap_counts = newbkinfo.groupby('chr2').apply(self.count_non_overlapping)
data = [(chr2, count) for chr2, count in non_overlap_counts.items()]
print("\nThe following are the shared fusion breakpoints and counts:")
print(tabulate(data, headers=["Fusion Breakpoint", "Count"], tablefmt="github"))
def count_non_overlapping(self, group):
if len(group) == 1:
return 1
grouped = group.groupby('chr1')
total_count = 0
for chr1, chr_group in grouped:
chr_group = chr_group.sort_values(by='start1').reset_index(drop=True)
count = 0
current_end = -1
for _, row in chr_group.iterrows():
start1, end1 = row['start1'], row['end1']
if start1 > current_end:
count += 1
current_end = end1
total_count += count
return total_count
================================================
FILE: build/lib/wgdi/karyotype.py
================================================
import matplotlib.pyplot as plt
import pandas as pd
import wgdi.base as base
class karyotype():
def __init__(self, options):
self.width = 0.5
for k, v in options:
setattr(self, str(k), v)
print(str(k), ' = ', v)
if hasattr(self, 'figsize'):
self.figsize = [float(k) for k in self.figsize.split(',')]
else:
self.figsize = 10, 6.18
if hasattr(self, 'width'):
self.width = float(self.width)
else:
self.width = 0.5
def run(self):
fig, ax = plt.subplots(figsize=self.figsize)
ancestor_lens = pd.read_csv(
self.ancestor, sep="\t", header=None)
ancestor_lens[0] = ancestor_lens[0].astype(str)
ancestor_lens[3] = ancestor_lens[3].astype(str)
ancestor_lens[4] = ancestor_lens[4].astype(int)
ancestor_lens[4] = ancestor_lens[4] / ancestor_lens[4].max()
chrs = ancestor_lens[0].drop_duplicates().to_list()
ax.bar(chrs, 10, color='white', alpha=0)
for index, row in ancestor_lens.iterrows():
base.Rectangle(ax, [chrs.index(row[0])-self.width*0.5,
row[1]], row[2]-row[1], self.width, row[3], row[4])
ax.tick_params(labelsize=15)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['bottom'].set_visible(False)
ax.set_xticks([])
ax.set_yticks([])
plt.savefig(self.savefig, dpi=500)
plt.show()
================================================
FILE: build/lib/wgdi/karyotype_mapping.py
================================================
import numpy as np
import pandas as pd
import wgdi.base as base
class karyotype_mapping:
def __init__(self, options):
# Initialize default attributes
self.blast_reverse = False
self.blockinfo_reverse = False
self.position = 'order'
self.block_length = 5
self.limit_length = 5
self.repeat_number = 20
self.score = 100
self.evalue = 1e-5
# Update attributes with provided keyword arguments and print them
for k, v in options:
setattr(self, k, v)
print(f"{k} = {v}")
self.blast_reverse = base.str_to_bool(self.blast_reverse)
self.blockinfo_reverse = base.str_to_bool(self.blockinfo_reverse)
self.limit_length = int(self.limit_length)
def karyotype_left(self, pairs, ancestor, gff1, gff2):
# Loop through each row in ancestor to set color and classification in gff1
for _, row in ancestor.iterrows():
loc_min, loc_max = sorted([row[1], row[2]])
index1 = gff1[(gff1['chr'] == row[0]) &
(gff1['order'] >= loc_min) &
(gff1['order'] <= loc_max)].index
gff1.loc[index1, ['color', 'classification']] = row[3], row[4]
# Merge pairs with gff1 and update gff2 with color and classification
data = pd.merge(pairs, gff1, left_on=0, right_index=True, how='left')
data.drop_duplicates(subset=[1], inplace=True)
data.set_index(1, inplace=True)
gff2.loc[data.index, ['color', 'classification']] = data[['color', 'classification']]
return gff2
def karyotype_top(self, pairs, ancestor, gff1, gff2):
# Loop through each row in ancestor to set color and classification in gff2
for _, row in ancestor.iterrows():
loc_min, loc_max = sorted([row[1], row[2]])
index1 = gff2[(gff2['chr'] == row[0]) &
(gff2['order'] >= loc_min) &
(gff2['order'] <= loc_max)].index
gff2.loc[index1, ['color', 'classification']] = row[3], row[4]
# Merge pairs with gff2 and update gff1 with color and classification
data = pd.merge(pairs, gff2, left_on=1, right_index=True, how='left')
data.drop_duplicates(subset=[0], inplace=True)
data.set_index(0, inplace=True)
gff1.loc[data.index, ['color', 'classification']] = data[['color', 'classification']]
return gff1
def karyotype_map(self, gff, lens):
# Filter gff based on lens index and non-null color
gff = gff[gff['chr'].isin(lens.index) & gff['color'].notnull()]
ancestor = []
# Group by chromosome and process each group to create ancestor records
for chr, group in gff.groupby('chr'):
color, class_id, arr = '', 1, []
for _, row in group.iterrows():
if color == row['color'] and class_id == row['classification']:
arr.append(row['order'])
else:
if len(arr) >= self.limit_length:
ancestor.append([chr, min(arr), max(arr), color, class_id, len(arr)])
color, class_id = row['color'], row['classification']
arr = []
if len(ancestor) >= 1 and color == ancestor[-1][3] and class_id == ancestor[-1][4] and chr == ancestor[-1][0]:
arr.append(ancestor[-1][1])
arr += np.random.randint(ancestor[-1][1], ancestor[-1][2], size=ancestor[-1][5]-1).tolist()
ancestor.pop()
arr.append(row['order'])
if len(arr) >= self.limit_length:
ancestor.append([chr, min(arr), max(arr), color, class_id, len(arr)])
ancestor = pd.DataFrame(ancestor)
# Adjust min and max positions for each chromosome group
for chr, group in ancestor.groupby(0):
ancestor.loc[group.index[0], 1] = 1
ancestor.loc[group.index[-1], 2] = lens[chr]
ancestor[4] = ancestor[4].astype(int)
return ancestor[[0, 1, 2, 3, 4, 5]]
def colinear_gene_pairs(self, bkinfo, gff1, gff2):
gff1 = gff1.reset_index()
gff2 = gff2.reset_index()
gff1_indexed = gff1.set_index(['chr', 'order'])
gff2_indexed = gff2.set_index(['chr', 'order'])
data = []
for _, row in bkinfo.iterrows():
b1 = list(map(int, row['block1'].split('_')))
b2 = list(map(int, row['block2'].split('_')))
for order1, order2 in zip(b1, b2):
a = gff1_indexed.loc[(row['chr1'], order1), 1]
b = gff2_indexed.loc[(row['chr2'], order2), 1]
data.append([a, b])
return pd.DataFrame(data)
def new_ancestor(self, ancestor, gff1, gff2, blast):
# Iterate through ancestor rows to adjust positions based on neighboring rows
for i in range(1, len(ancestor)):
if ancestor.iloc[i, 0] == ancestor.iloc[i-1, 0]:
area = ancestor.iloc[i, 1] - ancestor.iloc[i-1, 2]
if area <= 5:
ancestor.iloc[i-1, 2] = ancestor.iloc[i, 1] - 1
else:
index1 = gff1[(gff1['chr'] == ancestor.iloc[i, 0]) &
(gff1['order'] >= ancestor.iloc[i-1, 2]+1) &
(gff1['order'] <= ancestor.iloc[i, 1]-1)].index
index2 = gff2[gff2['color'] == ancestor.iloc[i-1, 3]].index
index3 = gff2[gff2['color'] == ancestor.iloc[i, 3]].index
newblast1 = blast[(blast[0].isin(index1)) & (blast[1].isin(index2))]
newblast2 = blast[(blast[0].isin(index1)) & (blast[1].isin(index3))]
if len(newblast1) >= len(newblast2):
ancestor.iloc[i-1, 2] = ancestor.iloc[i, 1] - 1
else:
ancestor.iloc[i, 1] = ancestor.iloc[i-1, 2] + 1
for chr, group in ancestor.groupby(0):
if len(group) == 1:
continue
newgff1 = gff1[gff1['chr'] == chr]
for i in range(1, len(group)):
if group.iloc[i, 5] > 200:
continue
index_left = newgff1[(newgff1['order'] >= group.iloc[i, 1]) &
(newgff1['order'] <= group.iloc[i, 2])].index
blast_left = blast[blast[0].isin(index_left)]
index_prev = gff2[gff2['color'] == group.iloc[i-1, 3]].index
blast_prev = blast_left[blast_left[1].isin(index_prev)]
index_curr = gff2[gff2['color'] == group.iloc[i, 3]].index
blast_curr = blast_left[blast_left[1].isin(index_curr)]
if len(blast_curr) <= len(blast_prev):
ancestor.loc[group.index[i],3] = ancestor.loc[group.index[i]-1,3]
if i < len(group)-1:
index_next = gff2[gff2['color'] == group.iloc[i+1, 3]].index
blast_next = blast_left[blast_left[1].isin(index_next)]
if len(blast_next) > max(len(blast_prev),len(blast_curr)):
ancestor.loc[group.index[i],3] = ancestor.loc[group.index[i]+1,3]
ancestor['group'] = (ancestor[0].shift(1) != ancestor[0]) | (ancestor[3].shift(1) != ancestor[3]) | (ancestor[4].shift(1) != ancestor[4])
ancestor['group'] = ancestor['group'].cumsum()
result = ancestor.groupby('group').agg({
0: 'first',
1: 'min',
2: 'max',
3: 'first',
4: 'first',
}).reset_index(drop=True)
return result
def run(self):
# Read and process block information
bkinfo = pd.read_csv(self.blockinfo, index_col='id')
bkinfo['chr1'] = bkinfo['chr1'].astype(str)
bkinfo['chr2'] = bkinfo['chr2'].astype(str)
if self.blockinfo_reverse == True:
bkinfo[['chr1', 'chr2']] = bkinfo[['chr2', 'chr1']]
bkinfo[['block1', 'block2']] = bkinfo[['block2', 'block1']]
bkinfo = bkinfo[bkinfo['length'] > int(self.block_length)]
# Read GFF and lens data
gff1 = base.newgff(self.gff1)
gff2 = base.newgff(self.gff2)
lens = base.newlens(self.the_other_lens, self.position)
blast = base.newblast(self.blast, int(self.score), float(self.evalue), gff1, gff2, self.blast_reverse)
# blast.drop_duplicates(subset=[0], keep='first', inplace=True)
# Find colinear gene pairs
pairs = self.colinear_gene_pairs(bkinfo, gff1, gff2)
# Depending on available attributes, call either karyotype_top or karyotype_left
if hasattr(self, 'ancestor_top'):
ancestor = base.read_classification(self.ancestor_top)
data = self.karyotype_top(pairs, ancestor, gff1, gff2)
elif hasattr(self, 'ancestor_left'):
ancestor = base.read_classification(self.ancestor_left)
data = self.karyotype_left(pairs, ancestor, gff1, gff2)
gff1, gff2 = gff2, gff1
blast.iloc[:, :2] = blast.iloc[:, [1, 0]].to_numpy()
else:
print('Missing ancestor file.')
exit(0)
# Map the data and create the final ancestor file
the_other_ancestor_file = self.karyotype_map(data, lens)
the_other_ancestor_file = self.new_ancestor(the_other_ancestor_file, gff1, gff2, blast)
the_other_ancestor_file.to_csv(self.the_other_ancestor_file, sep='\t', header=False, index=False)
================================================
FILE: build/lib/wgdi/ks.py
================================================
import os
import sys
import numpy as np
import pandas as pd
from Bio import SeqIO
import subprocess
from Bio.Phylo.PAML import yn00
import wgdi.base as base
class ks:
def __init__(self, options):
base_conf = base.config()
self.pair_pep_file = 'pair.pep'
self.pair_cds_file = 'pair.cds'
self.prot_align_file = 'prot.aln'
self.mrtrans = 'pair.mrtrans'
self.pair_yn = 'pair.yn'
for k, v in base_conf:
setattr(self, str(k), v)
for k, v in options:
setattr(self, str(k), v)
print(f'{str(k)} = {v}')
def auto_file(self):
pairs = []
with open(self.pairs_file) as f:
p = ' '.join(f.readlines()[:30])
# Detect file format and process accordingly
if 'path length' in p or 'MAXIMUM GAP' in p:
collinearity = base.read_colinearscan(self.pairs_file)
pairs = [[v[0], v[2]] for k in collinearity for v in k[1]]
elif 'MATCH_SIZE' in p or '## Alignment' in p:
collinearity = base.read_mcscanx(self.pairs_file)
pairs = [[v[0], v[2]] for k in collinearity for v in k[1]]
elif '# Alignment' in p:
collinearity = base.read_collinearity(self.pairs_file)
pairs = [[v[0], v[2]] for k in collinearity for v in k[1]]
elif '###' in p:
collinearity = base.read_jcvi(self.pairs_file)
pairs = [[v[0], v[2]] for k in collinearity for v in k[1]]
elif ',' in p:
collinearity = pd.read_csv(self.pairs_file, header=None)
pairs = collinearity.values.tolist()
else:
collinearity = pd.read_csv(self.pairs_file, header=None, sep='\t')
pairs = collinearity.values.tolist()
df = pd.DataFrame(pairs).drop_duplicates()
df[0] = df[0].astype(str)
df[1] = df[1].astype(str)
df.index = df[0] + ',' + df[1]
return df
def run(self):
# Load sequence data
cds = SeqIO.to_dict(SeqIO.parse(self.cds_file, "fasta"))
pep = SeqIO.to_dict(SeqIO.parse(self.pep_file, "fasta"))
df_pairs = self.auto_file()
# Check if ks file exists and load it, otherwise create a new one
if os.path.exists(self.ks_file):
ks = pd.read_csv(self.ks_file, sep='\t').drop_duplicates()
kscopy = ks.copy()
names = ks.columns.tolist()
names[0], names[1] = names[1], names[0]
kscopy.columns = names
ks = pd.concat([ks, kscopy])
ks['id'] = ks['id1'] + ',' + ks['id2']
df_pairs.drop(np.intersect1d(df_pairs.index, ks['id'].to_numpy()), inplace=True)
ks_file = open(self.ks_file, 'a+')
else:
ks_file = open(self.ks_file, 'w')
ks_file.write('\t'.join(['id1', 'id2', 'ka_NG86', 'ks_NG86', 'ka_YN00', 'ks_YN00']) + '\n')
# Filter valid pairs based on sequence data
df_pairs = df_pairs[
(df_pairs[0].isin(cds.keys())) & (df_pairs[1].isin(cds.keys())) &
(df_pairs[0].isin(pep.keys())) & (df_pairs[1].isin(pep.keys()))
]
pairs = df_pairs[[0, 1]].to_numpy()
if len(pairs) > 0 and pairs[0][0][:3] == pairs[0][1][:3]:
allpairs = []
pair_hash = {}
for k in pairs:
if k[0] + ',' + k[1] in pair_hash or k[1] + ',' + k[0] in pair_hash:
continue
else:
pair_hash[k[0] + ',' + k[1]] = 1
pair_hash[k[1] + ',' + k[0]] = 1
allpairs.append(k)
pairs = allpairs
for k in pairs:
cds_gene1, cds_gene2 = cds[k[0]], cds[k[1]]
cds_gene1.id, cds_gene2.id = 'gene1', 'gene2'
pep_gene1, pep_gene2 = pep[k[0]], pep[k[1]]
pep_gene1.id, pep_gene2.id = 'gene1', 'gene2'
# Write sequences to files
SeqIO.write([cds[k[0]], cds[k[1]]], self.pair_cds_file, "fasta")
SeqIO.write([pep[k[0]], pep[k[1]]], self.pair_pep_file, "fasta")
# Compute Ka/Ks values
kaks = self.pair_kaks(['gene1', 'gene2'])
if kaks is None:
continue
ks_file.write('\t'.join([str(i) for i in list(k) + list(kaks)]) + '\n')
ks_file.close()
# Clean up temporary files
for file in [
self.pair_pep_file, self.pair_cds_file, self.mrtrans, self.pair_yn,
self.prot_align_file, '2YN.dN', '2YN.dS', '2YN.t', 'rst', 'rst1', 'yn00.ctl', 'rub'
]:
try:
os.remove(file)
except OSError:
pass
def pair_kaks(self, k):
self.align()
pal = self.pal2nal()
if not pal:
return []
kaks = self.run_yn00()
if kaks is None:
return []
kaks_new = [
kaks[k[0]][k[1]]['NG86']['dN'], kaks[k[0]][k[1]]['NG86']['dS'],
kaks[k[0]][k[1]]['YN00']['dN'], kaks[k[0]][k[1]]['YN00']['dS']
]
return kaks_new
def align(self):
if self.align_software == 'mafft':
try:
command = [self.mafft_path, '--quiet', self.pair_pep_file, '>', self.prot_align_file]
subprocess.run(" ".join(command), shell=True, check=True)
except subprocess.CalledProcessError as e:
print(f"Error while running MAFFT: {e}")
elif self.align_software == 'muscle':
try:
command = [self.muscle_path, '-align', self.pair_pep_file, '-output', self.prot_align_file, '-quiet']
subprocess.run(" ".join(command), shell=True, check=True)
except subprocess.CalledProcessError as e:
print(f"Error while running Muscle: {e}")
def pal2nal(self):
args = ['perl', self.pal2nal_path, self.prot_align_file, self.pair_cds_file, '-output paml -nogap', '>' + self.mrtrans]
command = ' '.join(args)
try:
os.system(command)
except:
return False
return True
def run_yn00(self):
yn = yn00.Yn00()
yn.alignment = self.mrtrans
yn.out_file = self.pair_yn
yn.set_options(icode=0, commonf3x4=0, weighting=0, verbose=1)
try:
run_result = yn.run(command=self.yn00_path)
except:
run_result = None
return run_result
================================================
FILE: build/lib/wgdi/ks_peaks.py
================================================
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.stats.kde import gaussian_kde
import wgdi.base as base
class kspeaks:
def __init__(self, options):
# Default values
self.tandem_length = 200
self.figsize = 10, 6.18
self.fontsize = 9
self.block_length = 3
self.area = 0, 3
self.tandem = True
# Set options passed in
for k, v in options:
setattr(self, str(k), v)
print(f'{str(k)} = {v}')
# Convert string values to lists of floats
self.homo = [float(k) for k in self.homo.split(',')]
self.ks_area = [float(k) for k in self.ks_area.split(',')]
self.figsize = [float(k) for k in self.figsize.split(',')]
self.area = [float(k) for k in self.area.split(',')]
self.pvalue = float(self.pvalue)
self.block_length = int(self.block_length)
self.tandem = base.str_to_bool(self.tandem)
def remove_tandem(self, bkinfo):
"""
Remove tandem duplications based on start and end position differences.
"""
group = bkinfo[bkinfo['chr1'] == bkinfo['chr2']].copy()
group.loc[:, 'start'] = group.loc[:, 'start1'] - group.loc[:, 'start2']
group.loc[:, 'end'] = group.loc[:, 'end1'] - group.loc[:, 'end2']
# Drop rows where start or end difference is within tandem length
index = group[(group['start'].abs() <= self.tandem_length) |
(group['end'].abs() <= self.tandem_length)].index
bkinfo = bkinfo.drop(index)
return bkinfo
def ks_kde(self, df):
"""
Perform kernel density estimation (KDE) on Ks data.
"""
# Clean up 'ks' column by removing leading underscores
df.loc[df['ks'].str.startswith('_'), 'ks'] = df.loc[df['ks'].str.startswith('_'), 'ks'].str[1:]
ks = df['ks'].str.split('_')
arr = []
ks_ave = []
# Collect individual Ks values and calculate average Ks per row
for v in ks.values:
v = [float(k) for k in v if float(k) >= 0]
if len(v) == 0:
continue
arr.extend(v)
ks_ave.append(sum(v) / len(v)) # Mean of each row's Ks values
# KDE for three distributions: median, average, total
kdemedian = gaussian_kde(df['ks_median'].values)
kdemedian.set_bandwidth(bw_method=kdemedian.factor / 3.)
kdeaverage = gaussian_kde(ks_ave)
kdeaverage.set_bandwidth(bw_method=kdeaverage.factor / 3.)
kdetotal = gaussian_kde(arr)
kdetotal.set_bandwidth(bw_method=kdetotal.factor / 3.)
return [kdemedian, kdeaverage, kdetotal]
def run(self):
"""
Main method to process the data, perform KDE, and generate the plot.
"""
plt.rcParams['ytick.major.pad'] = 0
fig, ax = plt.subplots(figsize=self.figsize)
# Read the block info file
bkinfo = pd.read_csv(self.blockinfo)
bkinfo['chr1'] = bkinfo['chr1'].astype(str)
bkinfo['chr2'] = bkinfo['chr2'].astype(str)
bkinfo['length'] = bkinfo['length'].astype(int)
# Filter based on block length and p-value
bkinfo = bkinfo[(bkinfo['length'] > self.block_length) &
(bkinfo['pvalue'] < self.pvalue)]
# Remove tandem duplications if needed
if self.tandem == False:
bkinfo = self.remove_tandem(bkinfo)
# Further filtering based on homozygous range and Ks area
bkinfo = bkinfo[bkinfo[f'homo{self.multiple}'] >= self.homo[0]]
bkinfo = bkinfo[bkinfo[f'homo{self.multiple}'] <= self.homo[1]]
bkinfo = bkinfo[bkinfo['ks_median'] >= self.ks_area[0]]
bkinfo = bkinfo[bkinfo['ks_median'] <= self.ks_area[1]]
# Perform KDE on the Ks data
kdemedian, kdeaverage, kdetotal = self.ks_kde(bkinfo)
# Define the range for the x-axis (Ks values)
dist_space = np.linspace(self.area[0], self.area[1], 500)
# Plot the KDE results
ax.plot(dist_space, kdemedian(dist_space), color='red', label='block median')
ax.plot(dist_space, kdeaverage(dist_space), color='black', label='block average')
ax.plot(dist_space, kdetotal(dist_space), color='blue', label='all pairs')
# Set plot labels, grid, and limits
ax.grid()
ax.set_xlabel(r'${K_{s}}$', fontsize=20)
ax.set_ylabel('Frequency', fontsize=20)
ax.tick_params(labelsize=18)
ax.set_xlim(self.area)
ax.legend(fontsize=20)
# Adjust layout for better display
plt.subplots_adjust(left=0.09, right=0.96, top=0.93, bottom=0.12)
# Save the figure
plt.savefig(self.savefig, dpi=500)
plt.show()
# Save the filtered data to CSV
bkinfo.to_csv(self.savefile, index=False)
================================================
FILE: build/lib/wgdi/ksfigure.py
================================================
import re
import sys
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import wgdi.base as base
from scipy import stats
class ksfigure():
def __init__(self, options):
self.figsize = 10, 6.18
self.legendfontsize = 30
self.labelfontsize = 9
self.area = 0, 3
self.shadow = True
self.mode = 'median'
for k, v in options:
setattr(self, str(k), v)
print(str(k), ' = ', v)
if self.xlabel == 'none' or self.xlabel == '':
self.xlabel = r'Synonymous nucleotide subsititution (${K_{s}}$)'
if self.ylabel == 'none' or self.ylabel == '':
self.ylabel = 'kernel density of syntenic blocks'
if self.title == 'none' or self.title == '':
self.title = ''
self.figsize = [float(k) for k in self.figsize.split(',')]
self.area = [float(k) for k in self.area.split(',')]
self.shadow = base.str_to_bool(self.shadow)
def Gaussian_distribution(self, t, k):
y = np.zeros(len(t))
for i in range(0, int((len(k) - 1) / 3)+1):
if np.isnan(k[3 * i + 2]):
continue
k[3 * i + 2] = float(k[3 * i + 2])/np.sqrt(2)
k[3 * i + 0] = float(k[3 * i + 0]) * \
np.sqrt(2*np.pi)*float(k[3 * i + 2])
y1 = stats.norm.pdf(
t, float(k[3 * i + 1]), float(k[3 * i + 2])) * float(k[3 * i + 0])
y = y+y1
return y
def run(self):
plt.rcParams['ytick.major.pad'] = 0
fig, ax = plt.subplots(figsize=self.figsize)
ksfit = pd.read_csv(self.ksfit, index_col=0)
t = np.arange(self.area[0], self.area[1], 0.0005)
col = [k for k in ksfit.columns if re.match('Unnamed:', k)]
for index, row in ksfit.iterrows():
ax.plot(t, self.Gaussian_distribution(
t, row[col].values), linestyle=row['linestyle'], color=row['color'],alpha=0.8, label=index, linewidth=row['linewidth'])
if self.shadow == True:
ax.fill_between(t, 0, self.Gaussian_distribution(t, row[col].values), color=row['color'], alpha=0.15, interpolate=True, edgecolor=None, label=index,)
align = dict(family='Arial', verticalalignment="center",
horizontalalignment="center")
ax.set_xlabel(self.xlabel, fontsize=self.labelfontsize,
labelpad=20, **align)
ax.set_ylabel(self.ylabel, fontsize=self.labelfontsize,
labelpad=20, **align)
ax.set_title(self.title, weight='bold',
fontsize=self.labelfontsize, **align)
plt.tick_params(labelsize=10)
handles,labels = ax.get_legend_handles_labels()
df = pd.DataFrame({ 'handles': handles, 'labels': labels})
df.drop_duplicates(subset='labels', keep='first', inplace=True)
handles, labels = df['handles'].tolist(), df['labels'].tolist()
if self.shadow == True:
plt.legend(handles=handles,labels=labels,loc='upper right', prop={
'family': 'Arial', 'style': 'italic', 'size': self.legendfontsize})
else:
plt.legend(handles=handles,labels=labels,loc='upper right', prop={
'family': 'Arial', 'style': 'italic', 'size': self.legendfontsize})
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.savefig(self.savefig, dpi=500)
plt.show()
sys.exit(0)
================================================
FILE: build/lib/wgdi/peaksfit.py
================================================
import re
import sys
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.optimize import curve_fit
from scipy.stats import gaussian_kde, linregress
import wgdi.base as base
class peaksfit():
def __init__(self, options):
self.figsize = 10, 6.18
self.fontsize = 9
self.area = 0, 3
self.mode = 'median'
self.histogram_only = False
for k, v in options:
setattr(self, str(k), v)
print(str(k), ' = ', v)
self.figsize = [float(k) for k in self.figsize.split(',')]
self.area = [float(k) for k in self.area.split(',')]
self.bins_number = int(self.bins_number)
self.peaks = 1
self.histogram_only = base.str_to_bool(self.histogram_only)
def ks_values(self, df):
df.loc[df['ks'].str.startswith('_'),'ks']= df.loc[df['ks'].str.startswith('_'),'ks'].str[1:]
ks = df['ks'].str.split('_')
ks_total = []
ks_average = []
for v in ks.values:
ks_total.extend([float(k) for k in v])
ks_average = df['ks_average'].values
ks_median = df['ks_median'].values
return [ks_median, ks_average, ks_total]
def gaussian_fuc(self, x, *params):
y = np.zeros_like(x)
for i in range(0, len(params), 3):
amp = float(params[i])
ctr = float(params[i+1])
wid = float(params[i+2])
y = y + amp * np.exp(-((x - ctr)/wid)**2)
return y
def kde_fit(self, data, x):
kde = gaussian_kde(data)
kde.set_bandwidth(bw_method=kde.factor/3.)
p = kde(x)
guess = [1,1, 1]*self.peaks
popt, pcov = curve_fit(self.gaussian_fuc, x, p, guess, maxfev = 80000)
popt = [abs(k) for k in popt]
data = []
y = self.gaussian_fuc(x, *popt)
for i in range(0, len(popt), 3):
array = [popt[i], popt[i+1], popt[i+2]]
data.append(self.gaussian_fuc(x, *array))
slope, intercept, r_value, p_value, std_err = linregress(p, y)
print("\nR-square: "+str(r_value**2))
print("The gaussian fitting curve parameters are :")
print(' | '.join([str(k) for k in popt]))
return y, data
def run(self):
plt.rcParams['ytick.major.pad'] = 0
fig, ax = plt.subplots(figsize=self.figsize)
bkinfo = pd.read_csv(self.blockinfo)
ks_median, ks_average, ks_total = self.ks_values(bkinfo)
data = eval('ks_'+self.mode)
data = [k for k in data if self.area[0] <= k <= self.area[1]]
x = np.linspace(self.area[0], self.area[1], self.bins_number)
n, bins, patches = ax.hist(data, int(
self.bins_number), density=1, facecolor='blue', alpha=0.3, label='Histogram')
if self.histogram_only == True:
pass
else:
y, fit = self.kde_fit(data, x)
ax.plot(x, y, color='black', linestyle='-', label='Gaussian fitting')
ax.grid()
align = dict(family='Arial', verticalalignment="center",
horizontalalignment="center")
ax.set_xlabel(r'${K_{s}}$', fontsize=20)
ax.set_ylabel('Frequency', fontsize=20)
ax.tick_params(labelsize=18)
ax.legend(fontsize=20)
ax.set_xlim(self.area)
plt.subplots_adjust(left=0.09, right=0.96, top=0.93, bottom=0.12)
plt.savefig(self.savefig, dpi=500)
plt.show()
sys.exit(0)
================================================
FILE: build/lib/wgdi/pindex.py
================================================
import os
import sys
import numpy as np
import pandas as pd
import wgdi.base as base
class pindex():
def __init__(self, options):
self.remove_delta = True
self.position = 'order'
self.retention = 0.05
self.diff = 0.05
self.gap = 50
for k, v in options:
setattr(self, str(k), v)
print(k, ' = ', v)
self.gap = int(self.gap)
self.retention = float(self.retention)
self.diff = float(self.diff)
def Pindex(self, sub1, sub2):
r1 = self.retain(sub1)
r2 = self.retain(sub2)
r = []
for i in range(len(r2)):
if(r1[i] < self.retention or r2[i] < self.retention):
r.append(0)
continue
d = (r1[i]-r2[i])/(r1[i]+r2[i])*0.5
if d > self.diff:
r.append(1)
elif -d > self.diff:
r.append(-1)
else:
r.append(0)
a, b, c = len([i for i in r if i == 1]), len(
[i for i in r if i == -1]), len([i for i in r if i == 0])
return [a, -b, c, len(r)]
def retain(self, arr):
a = []
for i in range(0, len(arr), 2*self.gap):
start, end = i-self.gap, i+self.gap
genenum, retainnum = 0, 0
for j in range(start, end):
if((j >= int(len(arr))) or (j < 0)):
continue
else:
retainnum += arr[j]
genenum += 1
a.append(float(retainnum/genenum))
return a
def run(self):
alignment = pd.read_csv(self.alignment, header=None, index_col=0)
alignment.replace(r'\w+', 1, regex=True, inplace=True)
alignment.replace('.', 0, inplace=True)
alignment.fillna(0, inplace=True)
gff = base.newgff(self.gff)
lens = base.newlens(self.lens, self.position)
gff = gff[gff['chr'].isin(lens.index)]
alignment = alignment.join(gff[['chr', self.position]], how='left')
alignment.dropna(axis=0, how='any', inplace=True)
p = self.cal_pindex(alignment)
print('Polyploidy-index: ', p)
sys.exit(0)
def cal_pindex(self, alignment):
data, df = [], []
columns = alignment.columns[:-2].tolist()
for i in range(len(columns)-1):
for j in range(i+1, len(columns)):
b = []
for chr, group in alignment.groupby('chr'):
sub1 = group.loc[:, columns[i]].tolist()
sub2 = group.loc[:, columns[j]].tolist()
p = self.Pindex(sub1, sub2)
b.append(p)
df.append([i, j, chr]+p)
sub_diver = sum([abs(k[0]+k[1]) for k in b])
if self.remove_delta == True:
sub_total = sum([abs(k[1])+abs(k[0]) for k in b])
if sub_total == 0:
c = 0
else:
c = sub_diver/sub_total
else:
sub_total = sum([abs(k[1])+abs(k[0])+abs(k[2]) for k in b])
c = sub_diver/sub_total
data.append(c)
df = pd.DataFrame(df, columns=[
'sub1', 'sub2', 'chr', 'sub1_high', 'sub2_high', 'No_diff', 'Total'])
df['sub2_high'] = df['sub2_high'].abs()
self.infomation(df)
print('\nPolyploidy-index between subgenomes are ', data)
return sum(data)/len(data)
def turn_percentage(self, x):
return '(%.2f%%)' % (x * 100)
def infomation(self, df):
data = []
for names, group in df.groupby(['sub1', 'sub2']):
newgroup = pd.concat([group.head(1), group],
axis=0, ignore_index=True)
cols = ['sub1_high', 'sub2_high', 'No_diff', 'Total']
newgroup.loc[0, cols] = group.loc[:, cols].sum()
group1 = newgroup.copy()
group1[cols] = group1[cols].astype(str)
newgroup['sub1_high'] = (
newgroup['sub1_high'] / newgroup['Total']).apply(self.turn_percentage)
newgroup['sub2_high'] = (
newgroup['sub2_high'] / newgroup['Total']).apply(self.turn_percentage)
newgroup['No_diff'] = (
newgroup['No_diff'] / newgroup['Total']).apply(self.turn_percentage)
newgroup['Total'] = (
newgroup['Total'] / group['Total'].sum()).apply(self.turn_percentage)
newgroup[cols] = group1[cols]+newgroup[cols]
group_list = []
a = newgroup[['chr']+cols].columns.to_numpy()
a[0] = 'Chromosome'
a[1], a[2] = 'Sub_'+str(names[0]+1), 'Sub_'+str(names[1]+1)
group_list.append(a)
b = newgroup[['chr']+cols].to_numpy()
b[0][0] = 'Total'
for k in b:
group_list.append(k)
group_list = np.array(group_list).T
for k in group_list:
data.append(k)
data = pd.DataFrame(data)
data.to_csv(self.savefile, header=None, index=None)
================================================
FILE: build/lib/wgdi/polyploidy_classification.py
================================================
import pandas as pd
import wgdi.base as base
class polyploidy_classification:
def __init__(self, options):
self.same_protochromosome = False
self.same_subgenome = False
for k, v in options:
setattr(self, str(k), v)
print(f"{k} = {v}")
self.same_protochromosome = base.str_to_bool(self.same_protochromosome)
self.same_subgenome = base.str_to_bool(self.same_subgenome)
# Initialize classid with a default value if not provided
self.classid = [str(k) for k in getattr(self, 'classid', 'class1,class2').split(',')]
def run(self):
# Read input files
ancestor_left = base.read_classification(self.ancestor_left)
ancestor_top = base.read_classification(self.ancestor_top)
bkinfo = pd.read_csv(self.blockinfo)
# Ensure chr1 and chr2 are treated as strings
bkinfo['chr1'] = bkinfo['chr1'].astype(str)
bkinfo['chr2'] = bkinfo['chr2'].astype(str)
# Filter rows where chr1 and chr2 match ancestor values
bkinfo = bkinfo[bkinfo['chr1'].isin(ancestor_left[0].values) & bkinfo['chr2'].isin(ancestor_top[0].values)]
# Initialize additional columns
bkinfo[self.classid[0]] = 0
bkinfo[self.classid[1]] = 0
bkinfo[self.classid[0] + '_color'] = ''
bkinfo[self.classid[1] + '_color'] = ''
bkinfo['diff'] = 0.0
# Processing the first classification (ancestor_left vs chr1)
for name, group in bkinfo.groupby('chr1'):
d1 = ancestor_left[ancestor_left[0] == name]
for index1, row1 in group.iterrows():
a, b = sorted([row1['start1'], row1['end1']])
a, b = int(a), int(b)
for index2, row2 in d1.iterrows():
c, d = sorted([row2[1], row2[2]])
h = len([k for k in range(a, b) if k in range(c, d)]) / (b - a)
if h > bkinfo.loc[index1, 'diff']:
bkinfo.loc[index1, 'diff'] = float(h)
bkinfo.loc[index1, self.classid[0]] = row2[4]
bkinfo.loc[index1, self.classid[0] + '_color'] = row2[3]
# Reset 'diff' and process the second classification (ancestor_top vs chr2)
bkinfo['diff'] = 0.0
for name, group in bkinfo.groupby('chr2'):
d2 = ancestor_top[ancestor_top[0] == name]
for index1, row1 in group.iterrows():
a, b = sorted([row1['start2'], row1['end2']])
a, b = int(a), int(b)
for index2, row2 in d2.iterrows():
c, d = sorted([row2[1], row2[2]])
h = len([k for k in range(a, b) if k in range(c, d)]) / (b - a)
if h > bkinfo.loc[index1, 'diff']:
bkinfo.loc[index1, 'diff'] = float(h)
bkinfo.loc[index1, self.classid[1]] = row2[4]
bkinfo.loc[index1, self.classid[1] + '_color'] = row2[3]
# Uncomment if you want to filter rows where both colors match
if self.same_protochromosome == True:
bkinfo = bkinfo[bkinfo[self.classid[1] + '_color'] == bkinfo[self.classid[0] + '_color']]
if self.same_subgenome == True:
bkinfo = bkinfo[bkinfo[self.classid[1]] == bkinfo[self.classid[0]]]
# Save the result to a CSV file
bkinfo.to_csv(self.savefile, index=False)
================================================
FILE: build/lib/wgdi/retain.py
================================================
import matplotlib.pyplot as plt
import pandas as pd
import wgdi.base as base
class retain:
def __init__(self, options):
self.position = 'order'
# Initialize the options by setting attributes dynamically
for k, v in options:
setattr(self, str(k), v)
print(f"{str(k)} = {v}")
# Handle the ylim parameter, which defines the y-axis limits
self.ylim = [float(k) for k in self.ylim.split(',')] if hasattr(self, 'ylim') else [0, 1]
# Handle the colors and figsize parameters
self.colors = [str(k) for k in self.colors.split(',')]
self.figsize = [float(k) for k in self.figsize.split(',')]
def run(self):
# Load GFF and lens data
gff = base.newgff(self.gff)
lens = base.newlens(self.lens, self.position)
# Filter GFF data based on lens chromosome index
gff = gff[gff['chr'].isin(lens.index)]
# Load alignment data and join with GFF
alignment = pd.read_csv(self.alignment, header=None, index_col=0)
alignment = alignment.join(gff[['chr', self.position]], how='left')
# Perform alignment processing
self.retain = self.align_chr(alignment)
# Save the processed data to a file
self.retain[self.retain.columns[:-2]].to_csv(self.savefile, sep='\t', header=None)
# Create a figure for plotting
fig, axs = plt.subplots(len(lens), 1, sharex=True, sharey=True, figsize=tuple(self.figsize))
fig.add_subplot(111, frameon=False)
align = dict(family='DejaVu Sans', verticalalignment="center", horizontalalignment="center")
# Hide all the spines and ticks on the plot
for spine in plt.gca().spines.values():
spine.set_visible(False)
plt.tick_params(top=False, bottom=False, left=False, right=False, labelleft=False, labelbottom=False)
# Group the retain data by chromosome and plot each chromosome's data
groups = self.retain.groupby('chr')
for i, chr_name in enumerate(lens.index):
group = groups.get_group(chr_name)
if len(lens) == 1:
for j, col in enumerate(self.retain.columns[:-2]):
axs.plot(group['order'].values, group[col].values,
linestyle='-', color=self.colors[j], linewidth=1)
axs.spines['right'].set_visible(False)
axs.spines['top'].set_visible(False)
axs.set_ylim(self.ylim)
axs.tick_params(labelsize=12)
else:
# Plot each column's data for the current chromosome
for j, col in enumerate(self.retain.columns[:-2]):
axs[i].plot(group['order'].values, group[col].values,
linestyle='-', color=self.colors[j], linewidth=1)
# Hide the right and top spines for each subplot
axs[i].spines['right'].set_visible(False)
axs[i].spines['top'].set_visible(False)
axs[i].set_ylim(self.ylim)
axs[i].tick_params(labelsize=12)
for i, chr_name in enumerate(lens.index):
if len(lens) == 1:
x, y = axs.get_xlim()[1] * 0.90, axs.get_ylim()[1] * 0.8
axs.text(x, y, f"{self.refgenome} {chr_name}", fontsize=14, **align)
else:
# Add a label for the reference genome and chromosome
x, y = axs[i].get_xlim()[1] * 0.90, axs[i].get_ylim()[1] * 0.8
axs[i].text(x, y, f"{self.refgenome} {chr_name}", fontsize=14, **align)
# Adjust layout and save the figure as an image
plt.ylabel(f"{self.ylabel}\n\n\n\n", fontsize=18, **align)
plt.subplots_adjust(left=0.1, right=0.95, top=0.95, bottom=0.05)
plt.savefig(self.savefig, dpi=500)
plt.show()
def align_chr(self, alignment):
"""
Perform the alignment processing for each chromosome by updating the values.
"""
for i in alignment.columns[:-2]:
# Update values: set '1' for valid values, '0' for invalid, and fill NaN with 0
alignment.loc[alignment[i].str.contains(r'\w', na=False), i] = 1
alignment.loc[alignment[i] == '.', i] = 0
alignment.loc[alignment[i] == ' ', i] = 0
alignment[i] = alignment[i].astype('float64').fillna(0)
# Apply the moving average function to each group by chromosome
for chr_name, group in alignment.groupby(['chr']):
a = self.moving_average(group[i].values.tolist())
alignment.loc[group.index, i] = a
return alignment
def moving_average(self, arr):
"""
Calculate a moving average over a specified window size.
This function smooths the input array using a sliding window.
"""
a = []
for i in range(len(arr)):
# Define the window range
start, end = max(0, i - int(self.step)), min(len(arr), i + int(self.step))
ave = sum(arr[start:end]) / (end - start)
a.append(ave)
return a
================================================
FILE: build/lib/wgdi/run.py
================================================
import argparse
import os
import shutil
import sys
import wgdi
import wgdi.base as base
from wgdi.align_dotplot import align_dotplot
from wgdi.block_correspondence import block_correspondence
from wgdi.block_info import block_info
from wgdi.block_ks import block_ks
from wgdi.circos import circos
from wgdi.dotplot import dotplot
from wgdi.karyotype import karyotype
from wgdi.karyotype_mapping import karyotype_mapping
from wgdi.ks import ks
from wgdi.ks_peaks import kspeaks
from wgdi.ksfigure import ksfigure
from wgdi.peaksfit import peaksfit
from wgdi.pindex import pindex
from wgdi.polyploidy_classification import polyploidy_classification
from wgdi.retain import retain
from wgdi.run_colliearity import mycollinearity
from wgdi.trees import trees
from wgdi.ancestral_karyotype import ancestral_karyotype
from wgdi.ancestral_karyotype_repertoire import ancestral_karyotype_repertoire
from wgdi.shared_fusion import shared_fusion
from wgdi.fusion_positions_database import fusion_positions_database
from wgdi.fusions_detection import fusions_detection
# Argument parser setup
parser = argparse.ArgumentParser(
prog='wgdi', usage='%(prog)s [options]', epilog="",
formatter_class=argparse.RawDescriptionHelpFormatter
)
parser.description = '''\
WGDI(Whole-Genome Duplication Integrated): A user-friendly toolkit for evolutionary analyses of whole-genome duplications and ancestral karyotypes.
https://wgdi.readthedocs.io/en/latest/
--------------------------------------
'''
parser.add_argument("-v", "--version", action='version', version='0.75')
parser.add_argument("-d", dest="dotplot", help="Show homologous gene dotplot")
parser.add_argument("-icl", dest="improvedcollinearity", help="Improved version of ColinearScan ")
parser.add_argument("-ks", dest="calks", help="Calculate Ka/Ks for homologous gene pairs by YN00")
parser.add_argument("-bk", dest="blockks", help="Show Ks of blocks in a dotplot")
parser.add_argument("-bi", dest="blockinfo", help="Collinearity and Ks speculate whole genome duplication")
parser.add_argument("-c", dest="correspondence", help="Extract event-related genomic alignment")
parser.add_argument("-kp", dest="kspeaks", help="A simple way to get ks peaks")
parser.add_argument("-kf", dest="ksfigure", help="A simple way to draw ks distribution map")
parser.add_argument("-pf", dest="peaksfit", help="Gaussian fitting of ks distribution")
parser.add_argument("-pc", dest="polyploidy_classification", help="Polyploid distinguish among subgenomes")
parser.add_argument("-a", dest="alignment", help="Show event-related genomic alignment in a dotplot")
parser.add_argument("-k", dest="karyotype", help="Show genome evolution from reconstructed ancestors")
parser.add_argument("-ak", dest="ancestral_karyotype", help="Generation of ancestral karyotypes from chromosomes that retain same structures in genomes")
parser.add_argument("-akr", dest="ancestral_karyotype_repertoire", help="Incorporate genes from collinearity blocks into the ancestral karyotype repertoire")
parser.add_argument("-km", dest="karyotype_mapping", help="Mapping from the known karyotype result to this species")
parser.add_argument("-fpd", dest="fusion_positions_database", help="Extract the fusion positions dataset")
parser.add_argument("-fd", dest="fusions_detection", help="Determine whether these fusion events occur in other genomes")
parser.add_argument("-sf", dest="shared_fusion", help="Quickly find shared fusions between species")
parser.add_argument("-at", dest="alignmenttrees", help="Collinear genes construct phylogenetic trees")
parser.add_argument("-p", dest="pindex", help="Polyploidy-index characterize the degree of divergence between subgenomes of a polyploidy")
parser.add_argument("-r", dest="retain", help="Show subgenomes in gene retention or genome fractionation")
parser.add_argument("-ci", dest="circos", help="A simple way to run circos")
parser.add_argument("-conf", dest="configure", help="Display and modify the environment variable")
args = parser.parse_args()
# Function to run subprograms based on options
def run_subprogram(program, conf, name):
options = base.load_conf(conf, name)
r = program(options)
r.run()
# Function to configure environment
def run_configure():
base.rewrite(args.configure, 'ini')
# Main function to decide which module to run based on input arguments
def module_to_run(argument, conf):
switcher = {
'dotplot': (dotplot, conf, 'dotplot'),
'correspondence': (block_correspondence, conf, 'correspondence'),
'alignment': (align_dotplot, conf, 'alignment'),
'retain': (retain, conf, 'retain'),
'blockks': (block_ks, conf, 'blockks'),
'blockinfo': (block_info, conf, 'blockinfo'),
'calks': (ks, conf, 'ks'),
'circos': (circos, conf, 'circos'),
'kspeaks': (kspeaks, conf, 'kspeaks'),
'peaksfit': (peaksfit, conf, 'peaksfit'),
'ksfigure': (ksfigure, conf, 'ksfigure'),
'pindex': (pindex, conf, 'pindex'),
'alignmenttrees': (trees, conf, 'alignmenttrees'),
'improvedcollinearity': (mycollinearity, conf, 'collinearity'),
'configure': run_configure,
'polyploidy_classification': (polyploidy_classification, conf, 'polyploidy classification'),
'karyotype': (karyotype, conf, 'karyotype'),
'ancestral_karyotype': (ancestral_karyotype, conf, 'ancestral_karyotype'),
'karyotype_mapping': (karyotype_mapping, conf, 'karyotype_mapping'),
'ancestral_karyotype_repertoire': (ancestral_karyotype_repertoire, conf, 'ancestral_karyotype_repertoire'),
'shared_fusion': (shared_fusion, conf, 'shared_fusion'),
'fusion_positions_database': (fusion_positions_database, conf, 'fusion_positions_database'),
'fusions_detection': (fusions_detection, conf, 'fusions_detection'),
}
if argument == 'configure':
run_configure()
else:
program, conf, name = switcher.get(argument)
if program:
run_subprogram(program, conf, name)
# Main entry point
def main():
path = wgdi.__path__[0]
options = {
'dotplot': 'dotplot.conf',
'correspondence': 'corr.conf',
'alignment': 'align.conf',
'retain': 'retain.conf',
'blockks': 'blockks.conf',
'blockinfo': 'blockinfo.conf',
'calks': 'ks.conf',
'circos': 'circos.conf',
'kspeaks': 'kspeaks.conf',
'ksfigure': 'ksfigure.conf',
'pindex': 'pindex.conf',
'alignmenttrees': 'alignmenttrees.conf',
'peaksfit': 'peaksfit.conf',
'configure': 'conf.ini',
'improvedcollinearity': 'collinearity.conf',
'polyploidy_classification': 'polyploidy_classification.conf',
'karyotype': 'karyotype.conf',
'ancestral_karyotype': 'ancestral_karyotype.conf',
'ancestral_karyotype_repertoire': 'ancestral_karyotype_repertoire.conf',
'karyotype_mapping': 'karyotype_mapping.conf',
'shared_fusion': 'shared_fusion.conf',
'fusion_positions_database': 'fusion_positions_database.conf',
'fusions_detection': 'fusions_detection.conf',
}
for arg in vars(args):
value = getattr(args, arg)
if value is not None:
if value in ['?', 'help', 'example']:
with open(os.path.join(path, 'example', options[arg])) as f:
print(f.read())
if arg == 'ksfigure' and not os.path.exists('ks_fit_result.csv'):
shutil.copy2(os.path.join(wgdi.__path__[0], 'example/ks_fit_result.csv'), os.getcwd())
elif not os.path.exists(value):
print(f'{value} not exists')
sys.exit(0)
else:
module_to_run(arg, value)
if __name__ == "__main__":
main()
================================================
FILE: build/lib/wgdi/run_colliearity.py
================================================
import gc
import re
import sys
from multiprocessing import Pool
import numpy as np
import pandas as pd
import wgdi.base as base
import wgdi.collinearity as improvedcollinearity
class mycollinearity():
def __init__(self, options):
# Initialize parameters with default values
self.repeat_number = 10
self.multiple = 1
self.score = 100
self.evalue = 1e-5
self.blast_reverse = False
self.over_gap = 5
self.comparison = 'genomes'
self.options = options
for k, v in options:
setattr(self, str(k), v)
print(f"{str(k)} = {v}")
self.position = 'order'
# Parse grading values
if hasattr(self, 'grading'):
self.grading = [int(k) for k in self.grading.split(',')]
else:
self.grading = [50, 40, 25]
# Ensure process is an integer
if hasattr(self, 'process'):
self.process = int(self.process)
else:
self.process = 4
self.over_gap = int(self.over_gap )
base.str_to_bool(self.blast_reverse)
def deal_blast_for_chromosomes(self, blast, rednum, repeat_number):
bluenum = rednum
blast = blast.sort_values(by=[0, 11], ascending=[True, False])
def assign_grading(group):
group['cumcount'] = group.groupby(1).cumcount()
group = group[group['cumcount'] <= repeat_number]
group['grading'] = pd.cut(
group['cumcount'],
bins=[-1, 0, bluenum, repeat_number],
labels=self.grading,
right=True
)
return group
newblast = blast.groupby(['chr1', 'chr2']).apply(assign_grading).reset_index(drop=True)
newblast['grading'] = newblast['grading'].astype(int)
return newblast[newblast['grading'] > 0]
def deal_blast_for_genomes(self, blast, rednum, repeat_number):
# Initialize the grading column
blast['grading'] = 0
# Define the blue number as the sum of rednum and the predefined constant
bluenum = 4 + rednum
# Get the indices for each group by sorting the 11th column in descending order
index = [group.sort_values(by=[11], ascending=[False])[:repeat_number].index.tolist()
for name, group in blast.groupby([0])]
# Split the indices into red, blue, and gray groups
reddata = np.array([k[:rednum] for k in index], dtype=object)
bluedata = np.array([k[rednum:bluenum] for k in index], dtype=object)
graydata = np.array([k[bluenum:repeat_number] for k in index], dtype=object)
# Concatenate the results into flat lists
redindex = np.concatenate(reddata) if reddata.size else []
blueindex = np.concatenate(bluedata) if bluedata.size else []
grayindex = np.concatenate(graydata) if graydata.size else []
# Update the grading column based on the group indices
blast.loc[redindex, 'grading'] = self.grading[0]
blast.loc[blueindex, 'grading'] = self.grading[1]
blast.loc[grayindex, 'grading'] = self.grading[2]
# Return only the rows with non-zero grading
return blast[blast['grading'] > 0]
def run(self):
# Read and process lens files
lens1 = base.newlens(self.lens1, 'order')
lens2 = base.newlens(self.lens2, 'order')
# Read and process gff files
gff1 = base.newgff(self.gff1)
gff2 = base.newgff(self.gff2)
# Filter gff data based on lens indices
gff1 = gff1[gff1['chr'].isin(lens1.index)]
gff2 = gff2[gff2['chr'].isin(lens2.index)]
# Process blast data
blast = base.newblast(self.blast, int(self.score), float(self.evalue),gff1, gff2, self.blast_reverse)
# Map positions and chromosome information
blast['loc1'] = blast[0].map(gff1[self.position])
blast['loc2'] = blast[1].map(gff2[self.position])
blast['chr1'] = blast[0].map(gff1['chr'])
blast['chr2'] = blast[1].map(gff2['chr'])
# Apply blast filtering and grading
if self.comparison.lower() == 'genomes':
blast = self.deal_blast_for_genomes(blast, int(self.multiple), int(self.repeat_number))
if self.comparison.lower() == 'chromosomes':
blast = self.deal_blast_for_chromosomes(blast, int(self.multiple), int(self.repeat_number))
print(f"The filtered homologous gene pairs are {len(blast)}.\n")
if len(blast) < 1:
print("Stopped!\n\nIt may be that the id1 and id2 in the BLAST file do not match with (gff1, lens1) and (gff2, lens2).")
sys.exit(1)
# Group blast data by 'chr1' and 'chr2'
total = []
for (chr1, chr2), group in blast.groupby(['chr1', 'chr2']):
total.append([chr1, chr2, group])
del blast, group
gc.collect()
# Determine chunk size for multiprocessing
n = int(np.ceil(len(total) / float(self.process)))
result, data = '', []
try:
# Initialize multiprocessing Pool
pool = Pool(self.process)
for i in range(0, len(total), n):
# Apply single_pool function asynchronously
data.append(pool.apply_async(
self.single_pool, args=(total[i:i + n], gff1, gff2, lens1, lens2)
))
pool.close()
pool.join()
except:
pool.terminate()
for k in data:
# Collect results from async tasks
text = k.get()
if text:
result += text
# Write final output to file
result = re.split('\n', result)
fout = open(self.savefile, 'w')
num = 1
for line in result:
if re.match(r"# Alignment", line):
# Replace alignment number
s = f'# Alignment {num}:'
fout.write(s + line.split(':')[1] + '\n')
num += 1
continue
if len(line) > 0:
fout.write(line + '\n')
fout.close()
sys.exit(0)
def single_pool(self, group, gff1, gff2, lens1, lens2):
text = ''
for bk in group:
chr1, chr2 = str(bk[0]), str(bk[1])
print(f'Running {chr1} vs {chr2}')
# Extract and sort points
points = bk[2][['loc1', 'loc2', 'grading']].sort_values(
by=['loc1', 'loc2'], ascending=[True, True]
)
# Initialize collinearity analysis
collinearity = improvedcollinearity.collinearity(
self.options, points)
data = collinearity.run()
if not data:
continue
# Extract gene information
gf1 = gff1[gff1['chr'] == chr1].reset_index().set_index('order')[[1, 'strand']]
gf2 = gff2[gff2['chr'] == chr2].reset_index().set_index('order')[[1, 'strand']]
n = 1
for block, evalue, score in data:
if len(block) < self.over_gap:
continue
# Map gene names and strands
block['name1'] = block['loc1'].map(gf1[1])
block['name2'] = block['loc2'].map(gf2[1])
block['strand1'] = block['loc1'].map(gf1['strand'])
block['strand2'] = block['loc2'].map(gf2['strand'])
block['strand'] = np.where(
block['strand1'] == block['strand2'], '1', '-1'
)
# Prepare text output
block['text'] = block.apply(
lambda x: f"{x['name1']} {x['loc1']} {x['name2']} {x['loc2']} {x['strand']}\n",
axis=1
)
# Determine alignment mark
a, b = block['loc2'].head(2).values
mark = 'plus' if a < b else 'minus'
# Append alignment information
text += f'# Alignment {n}: score={score} pvalue={evalue} N={len(block)} {chr1}&{chr2} {mark}\n'
text += ''.join(block['text'].values)
n += 1
return text
================================================
FILE: build/lib/wgdi/shared_fusion.py
================================================
import pandas as pd
import wgdi.base as base
class shared_fusion:
def __init__(self, options):
for k, v in options:
setattr(self, str(k), v)
print(f"{k} = {v}")
# Handle classid and limit_length options
self.classid = [str(k) for k in self.classid.split(',')] if hasattr(self, 'classid') else ['class1', 'class2']
self.limit_length = int(self.limit_length) if hasattr(self, 'limit_length') else 20
# Clean and split lens files
self.lens1 = self.lens1.replace(' ', '').split(',')
self.lens2 = self.lens2.replace(' ', '').split(',')
def run(self):
# Read classification files and block information
ancestor_left = base.read_classification(self.ancestor_left)
ancestor_top = base.read_classification(self.ancestor_top)
bkinfo = pd.read_csv(self.blockinfo)
# Preprocess blockinfo columns
bkinfo['chr1'] = bkinfo['chr1'].astype(str)
bkinfo['chr2'] = bkinfo['chr2'].astype(str)
bkinfo['start1'] = bkinfo['start1'].astype(int)
bkinfo['end1'] = bkinfo['end1'].astype(int)
bkinfo['start2'] = bkinfo['start2'].astype(int)
bkinfo['end2'] = bkinfo['end2'].astype(int)
# Filter based on ancestor chromosomes
bkinfo = bkinfo[(bkinfo['chr1'].isin(ancestor_left[0].values)) &
(bkinfo['chr2'].isin(ancestor_top[0].values))]
# Read lens files
lens1 = pd.read_csv(self.lens1[0], sep='\t', header=None)
lens2 = pd.read_csv(self.lens2[0], sep='\t', header=None)
lens1[0] = lens1[0].astype(str)
lens2[0] = lens2[0].astype(str)
# Perform block fusion analysis
blockinfoout = self.block_fusions(bkinfo, ancestor_left, ancestor_top)
# Apply filters based on breakpoints and length
blockinfoout = blockinfoout[(blockinfoout['breakpoints1'] == 1) &
(blockinfoout['breakpoints2'] == 1)]
blockinfoout = blockinfoout[(blockinfoout['break_length1'] >= self.limit_length) &
(blockinfoout['break_length2'] >= self.limit_length)]
# Save the filtered block info
blockinfoout.to_csv(self.filtered_blockinfo, index=False)
# Filter lens data based on the blockinfoout
lens1 = lens1[lens1[0].isin(blockinfoout['chr1'].values)]
lens2 = lens2[lens2[0].isin(blockinfoout['chr2'].values)]
# Save filtered lens data
lens1.to_csv(self.lens1[1], sep='\t', index=False, header=False)
lens2.to_csv(self.lens2[1], sep='\t', index=False, header=False)
def block_fusions(self, bkinfo, ancestor_left, ancestor_top):
# Initialize new columns in the bkinfo dataframe
bkinfo['breakpoints1'] = 0
bkinfo['breakpoints2'] = 0
bkinfo['break_length1'] = 0
bkinfo['break_length2'] = 0
for index, row in bkinfo.iterrows():
# Process species 1 (chr1)
a, b = sorted([row['start1'], row['end1']])
d1 = ancestor_left[(ancestor_left[0] == row['chr1']) &
(ancestor_left[2] >= a) & (ancestor_left[1] <= b)]
if len(d1) > 1:
bkinfo.loc[index, 'breakpoints1'] = 1
breaklength_max = 0
for _, row2 in d1.iterrows():
length_in = len([k for k in range(a, b) if k in range(row2[1], row2[2])])
length_out = (b - a) - length_in
breaklength_max = max(breaklength_max, min(length_in, length_out) + 1)
bkinfo.loc[index, 'break_length1'] = breaklength_max
# Process species 2 (chr2)
c, d = sorted([row['start2'], row['end2']])
d2 = ancestor_top[(ancestor_top[0] == row['chr2']) &
(ancestor_top[2] >= c) & (ancestor_top[1] <= d)]
if len(d2) > 1:
bkinfo.loc[index, 'breakpoints2'] = 1
breaklength_max = 0
for _, row2 in d2.iterrows():
length_in = len([k for k in range(c, d) if k in range(row2[1], row2[2])])
length_out = (d - c) - length_in
breaklength_max = max(breaklength_max, min(length_in, length_out) + 1)
bkinfo.loc[index, 'break_length2'] = breaklength_max
return bkinfo
================================================
FILE: build/lib/wgdi/trees.py
================================================
import os
import shutil
from io import StringIO
import numpy as np
import pandas as pd
from Bio import AlignIO, Seq, SeqIO, SeqRecord
import subprocess
import wgdi.base as base
class trees():
def __init__(self, options):
base_conf = base.config()
self.position = 'order'
self.alignfile = ''
self.align_trimming = ''
self.trimming = 'trimal'
self.threads = '1'
self.minimum = 4
self.tree_software = 'iqtree'
self.delete_detail = True
for k, v in base_conf:
setattr(self, str(k), v)
for k, v in options:
setattr(self, str(k), v)
print(str(k), ' = ', v)
if hasattr(self, 'codon_position'):
self.codon_position = [
int(k)-1 for k in self.codon_position.split(',')]
else:
self.codon_position = [0, 1, 2]
self.delete_detail = base.str_to_bool(self.delete_detail)
def grouping(self, alignment):
data = []
indexs = []
if not os.path.exists(self.dir):
os.makedirs(self.dir)
sequence = SeqIO.to_dict(SeqIO.parse(self.sequence_file, "fasta"))
if hasattr(self, 'cds_file'):
seq_cds = SeqIO.to_dict(SeqIO.parse(self.cds_file, "fasta"))
for index, row in alignment.iterrows():
file = base.gen_md5_id(str(row.values))
self.sequencefile = os.path.join(self.dir, file+'.fasta')
self.alignfile = os.path.join(self.dir, file+'.aln')
self.align_trimming = self.alignfile+'.trimming'
self.treefile = os.path.join(self.dir, file+'.aln.treefile')
if os.path.isfile(self.treefile) and os.path.isfile(self.alignfile):
data.append(self.treefile)
indexs.append(index)
continue
ids = []
ids_cds = []
for i in range(len(row)):
if type(row[i]) == float and np.isnan(row[i]):
continue
gene_sequence = sequence[row[i]]
gene_sequence.id = str(int(i)+1)
gene_sequence.description = ''
ids.append(gene_sequence)
SeqIO.write(ids, self.sequencefile, "fasta")
self.align()
if hasattr(self, 'cds_file'):
self.seqcdsfile = os.path.join(self.dir, file+'.cds.fasta')
for i in range(len(row)):
if type(row[i]) == float and np.isnan(row[i]):
continue
gene_cds = seq_cds[row[i]]
gene_cds.id = str(int(i)+1)
ids_cds.append(gene_cds)
SeqIO.write(ids_cds, self.seqcdsfile, "fasta")
self.pal2nal()
self.codon()
if self.trimming.upper() == 'TRIMAL':
self.trimal()
if self.trimming.upper() == 'DIVVIER':
self.divvier()
self.buildtrees()
if os.path.isfile(self.treefile):
data.append(self.treefile)
return data
def codon(self):
if self.codon_position == [0, 1, 2]:
shutil.move(self.alignfile+'.mrtrans', self.alignfile)
return True
records = list(SeqIO.parse(self.alignfile+'.mrtrans', 'fasta'))
if len(records) == 0:
return False
newrecords = []
def final_list(test_list, x, y): return [
test_list[i+j] for i in range(0, len(test_list), x) for j in y]
for k in records:
if len(k.seq) % 3 > 0:
return False
seq = final_list(k.seq, 3, self.codon_position)
k.seq = ''.join(seq)
newrecords.append(SeqRecord.SeqRecord(
Seq.Seq(k.seq), id=k.id, description=''))
SeqIO.write(newrecords, self.alignfile, 'fasta')
return True
def pal2nal(self):
args = ['perl', self.pal2nal_path, self.alignfile,
self.seqcdsfile, '-output fasta', '>'+self.alignfile+'.mrtrans']
command = ' '.join(args)
try:
os.system(command)
except:
return False
return True
def align(self):
if self.align_software == 'mafft':
try:
command = [self.mafft_path,'--quiet', self.sequencefile, '>', self.alignfile]
subprocess.run(" ".join(command), shell=True, check=True)
except subprocess.CalledProcessError as e:
print(f"Error while running MAFFT: {e}")
if self.align_software == 'muscle':
try:
command = [self.muscle_path,'-align', self.sequencefile, '-output', self.alignfile, '-quiet']
subprocess.run(" ".join(command), shell=True, check=True)
except subprocess.CalledProcessError as e:
print(f"Error while running Muscle: {e}")
def trimal(self):
args = [self.trimal_path, '-in', self.alignfile,
'-out', self.align_trimming, '-automated1']
command = ' '.join(args)
try:
os.system(command)
except:
return False
return True
def divvier(self):
args = [self.divvier_path, '-mincol', '4', '-divvygap', self.alignfile]
command = ' '.join(args)
try:
os.system(command)
os.rename(self.alignfile+'.divvy.fas', self.align_trimming)
except:
return False
return True
def buildtrees(self):
try:
if self.tree_software.upper() == 'IQTREE':
args = [self.iqtree_path, '-s', self.align_trimming,
'-m', self.model, '-T', self.threads, '--quiet']
command = ' '.join(args)
os.system(command)
os.rename(self.align_trimming+'.treefile', self.treefile)
elif self.tree_software.upper() == 'FASTTREE':
args = [self.fasttree_path,
self.align_trimming, '>', self.treefile]
command = ' '.join(args)
os.system(command)
except:
return False
if self.delete_detail == True:
for file in (self.sequencefile, self.align_trimming+'.bionj', self.align_trimming+'.iqtree', self.align_trimming+'.ckp.gz',
self.align_trimming+'.log', self.align_trimming+'.mldist', self.align_trimming+'.model.gz'):
try:
os.remove(file)
except OSError:
pass
return True
def run(self):
alignment = pd.read_csv(self.alignment, header=None)
alignment.replace('.', np.nan, inplace=True)
alignment.dropna(thresh=int(self.minimum), inplace=True)
if hasattr(self, 'gff') and hasattr(self, 'lens'):
gff = base.newgff(self.gff)
lens = base.newlens(self.lens, self.position)
alignment = pd.merge(
alignment, gff[['chr', self.position]], left_on=0, right_on=gff.index, how='left')
alignment.dropna(subset=['chr', 'order'], inplace=True)
alignment['order'] = alignment['order'].astype(int)
alignment = alignment[alignment['chr'].isin(lens.index)]
alignment.drop(alignment.columns[-2:], axis=1, inplace=True)
data = self.grouping(alignment)
fout = open(self.trees_file, 'w')
fout.close()
for i in range(0, len(data), 100):
trees = ' '.join([str(k) for k in data[i:i+100]])
args = ['cat', trees, '>>', self.trees_file]
command = ' '.join([str(k) for k in args])
os.system(command)
df = pd.read_csv(self.trees_file, header=None, sep='\t')
df[0].to_csv(self.trees_file, index=None, sep='\t', header=False)
print("done")
================================================
FILE: command.txt
================================================
python setup.py sdist bdist_wheel
twine upload dist/*
================================================
FILE: setup.py
================================================
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
from setuptools import find_packages, setup
with open("README.md", "r", encoding='utf-8') as fh:
long_description = fh.read()
required = ['pandas>=1.1.0', 'numpy', 'biopython', 'matplotlib', 'scipy', 'tabulate']
setup(
name="wgdi",
version="0.75",
author="Pengchuan Sun",
author_email="sunpengchuan@gmail.com",
description="A user-friendly toolkit for evolutionary analyses of whole-genome duplications and ancestral karyotypes",
license="BSD License",
long_description=long_description,
long_description_content_type="text/markdown",
url="https://github.com/SunPengChuan/wgdi",
packages=find_packages(),
package_data={'': ['*.conf','*.ini', '*.csv']},
classifiers=[
"Intended Audience :: Science/Research",
"Programming Language :: Python :: 3",
"License :: OSI Approved :: BSD License",
"Operating System :: OS Independent",
],
entry_points={
'console_scripts': [
'wgdi = wgdi.run:main',
]
},
zip_safe=True,
install_requires=required
)
================================================
FILE: wgdi/__init__.py
================================================
================================================
FILE: wgdi/align_dotplot.py
================================================
import re
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import wgdi.base as base
class align_dotplot:
def __init__(self, options):
# Default values
self.position = 'order'
self.figsize = 'default'
self.classid = 'class1'
# Initialize from options
for k, v in options:
setattr(self, str(k), v)
print(f'{k} = {v}')
self.ks_area = [float(k) for k in getattr(self, 'ks_area', '-1,3').split(',')]
self.colors = [str(k) for k in getattr(self, 'colors', 'red,blue,green,black,orange').split(',')]
self.ancestor_top = None if getattr(self, 'ancestor_top', 'none') == 'none' else self.ancestor_top
self.ancestor_left = None if getattr(self, 'ancestor_left', 'none') == 'none' else self.ancestor_left
self.blockinfo_reverse = base.str_to_bool(self.blockinfo_reverse)
def pair_position(self, alignment, loc1, loc2, colors):
alignment.index = alignment.index.map(loc1)
data = []
for i, k in enumerate(alignment.columns):
df = alignment[k].map(loc2).dropna()
for idx, row in df.items():
data.append([idx, row, colors[i]])
return pd.DataFrame(data, columns=['loc1', 'loc2', 'color'])
def run(self):
axis = [0, 1, 1, 0]
# Lens generation and figure size
lens1 = base.newlens(self.lens1, self.position)
lens2 = base.newlens(self.lens2, self.position)
if re.search(r'\d', self.figsize):
self.figsize = [float(k) for k in self.figsize.split(',')]
else:
self.figsize = np.array([1, float(lens1.sum()) / float(lens2.sum())]) * 10
plt.rcParams['ytick.major.pad'] = 0
# Create plot
fig, ax = plt.subplots(figsize=self.figsize)
ax.xaxis.set_ticks_position('top')
step1, step2 = 1 / float(lens1.sum()), 1 / float(lens2.sum())
# Process Ancestor Data
if self.ancestor_left:
axis[0] = -0.02
lens_ancestor_left = self.process_ancestor(self.ancestor_left, lens1.index)
if self.ancestor_top:
axis[3] = -0.02
lens_ancestor_top = self.process_ancestor(self.ancestor_top, lens2.index)
base.dotplot_frame(fig, ax, lens1, lens2, step1, step2,
self.genome1_name, self.genome2_name, [0, 1])
# Process GFF files
gff1, gff2 = base.newgff(self.gff1), base.newgff(self.gff2)
gff1 = base.gene_location(gff1, lens1, step1, self.position)
gff2 = base.gene_location(gff2, lens2, step2, self.position)
if self.ancestor_top:
self.ancestor_position(ax, gff2, lens_ancestor_top, 'top')
if self.ancestor_left:
self.ancestor_position(ax, gff1, lens_ancestor_left, 'left')
# Process block info and alignment
bkinfo = self.process_blockinfo(lens1,lens2)
align = self.alignment(gff1, gff2, bkinfo)
alignment = align[gff1.columns[-len(bkinfo[self.classid].drop_duplicates()):]]
alignment.to_csv(self.savefile, header=False)
# Create scatter plot
df = self.pair_position(alignment, gff1['loc'], gff2['loc'], self.colors)
plt.scatter(df['loc2'], df['loc1'], s=float(self.markersize), c=df['color'],
alpha=0.5, edgecolors=None, linewidths=0, marker='o')
ax.axis(axis)
plt.subplots_adjust(left=0.07, right=0.97, top=0.93, bottom=0.03)
plt.savefig(self.savefig, dpi=500)
plt.show()
def process_ancestor(self, ancestor_file, lens_index):
df = pd.read_csv(ancestor_file, sep="\t", header=None)
df[0] = df[0].astype(str)
df[3] = df[3].astype(str)
df[4] = df[4].astype(int)
df[4] = df[4] / df[4].max()
return df[df[0].isin(lens_index)]
def process_blockinfo(self, lens1, lens2):
bkinfo = pd.read_csv(self.blockinfo, index_col='id')
if self.blockinfo_reverse == True:
bkinfo[['chr1', 'chr2']] = bkinfo[['chr2', 'chr1']]
bkinfo[['block1', 'block2']] = bkinfo[['block2', 'block1']]
bkinfo['chr1'] = bkinfo['chr1'].astype(str)
bkinfo['chr2'] = bkinfo['chr2'].astype(str)
bkinfo[self.classid] = bkinfo[self.classid].astype(str)
return bkinfo[bkinfo['chr1'].isin(lens1.index) & (bkinfo['chr2'].isin(lens2.index))]
def alignment(self, gff1, gff2, bkinfo):
gff1['uid'] = gff1['chr'] + 'g' + gff1['order'].astype(str)
gff2['uid'] = gff2['chr'] + 'g' + gff2['order'].astype(str)
gff1['id'] = gff1.index
gff2['id'] = gff2.index
for cl, group in bkinfo.groupby(self.classid):
name = f'l{cl}'
gff1[name] = ''
group = group.sort_values(by=['length'], ascending=True)
for _, row in group.iterrows():
block = self.create_block_dataframe(row)
if block.empty:
continue
block1_min, block1_max = block['block1'].agg(['min', 'max'])
area = gff1[(gff1['chr'] == row['chr1']) &
(gff1['order'] >= block1_min) &
(gff1['order'] <= block1_max)].index
block['id1'] = (row['chr1'] + 'g' + block['block1'].astype(str)).map(
dict(zip(gff1['uid'], gff1.index)))
block['id2'] = (row['chr2'] + 'g' + block['block2'].astype(str)).map(
dict(zip(gff2['uid'], gff2.index)))
gff1.loc[block['id1'].values, name] = block['id2'].values
gff1.loc[gff1.index.isin(area) & gff1[name].eq(''), name] = '.'
return gff1
def create_block_dataframe(self, row):
b1, b2, ks = row['block1'].split('_'), row['block2'].split('_'), row['ks'].split('_')
ks = list(map(float, ks[1:])) if ks[0] == '' else list(map(float, ks))
block = pd.DataFrame(np.array([b1, b2, ks]).T, columns=['block1', 'block2', 'ks'])
block['block1'] = block['block1'].astype(int)
block['block2'] = block['block2'].astype(int)
block['ks'] = block['ks'].astype(float)
return block[(block['ks'] <= self.ks_area[1]) &
(block['ks'] >= self.ks_area[0])].drop_duplicates(subset=['block1'], keep='first')
def ancestor_position(self, ax, gff, lens, mark):
for _, row in lens.iterrows():
loc1 = gff[(gff['chr'] == row[0]) & (gff['order'] == int(row[1]))].index
loc2 = gff[(gff['chr'] == row[0]) & (gff['order'] == int(row[2]))].index
loc1, loc2 = gff.loc[[loc1[0], loc2[0]], 'loc']
if mark == 'top':
width = abs(loc1-loc2)
loc = [min(loc1, loc2), 0]
height = -0.02
if mark == 'left':
height = abs(loc1-loc2)
loc = [-0.02, min(loc1, loc2), ]
width = 0.02
base.Rectangle(ax, loc, height, width, row[3], row[4])
================================================
FILE: wgdi/ancestral_karyotype.py
================================================
import pandas as pd
from Bio import SeqIO
import wgdi.base as base
class ancestral_karyotype:
def __init__(self, options):
self.mark = 'aak'
# Set attributes from options
for k, v in options:
setattr(self, str(k), v)
print(f"{k} = {v}")
def run(self):
# Load and filter data
gff = base.newgff(self.gff)
ancestor = base.read_classification(self.ancestor)
gff = gff[gff['chr'].isin(ancestor[0].values.tolist())]
# Create new gff copy and initialize required variables
newgff = gff.copy()
data, num = [], 1
# Create dictionary mapping chromosome to order
chr_arr = ancestor[3].drop_duplicates().to_list()
chr_dict = {chr: idx + 1 for idx, chr in enumerate(chr_arr)}
ancestor['order'] = ancestor[3].map(chr_dict)
dict1, dict2 = {}, {}
# Process ancestor and gff information
for (cla, order), group in ancestor.groupby([4, 'order'], sort=[False, False]):
for index, row in group.iterrows():
index1 = gff[(gff['chr'] == row[0]) & (gff['order'] >= row[1]) & (gff['order'] <= row[2])].index
newgff.loc[index1, 'chr'] = str(num)
# Store results in data
for k in index1:
data.append(newgff.loc[k, :].values.tolist() + [k])
dict1[str(num)] = cla
dict2[str(num)] = group[3].values[0]
num += 1
# Create dataframe from the data collected
df = pd.DataFrame(data)
# Filter based on peptide file
pep = SeqIO.to_dict(SeqIO.parse(self.pep_file, "fasta"))
df = df[df[6].isin(pep.keys())]
# Assign new names and order
for name, group in df.groupby(0):
df.loc[group.index, 'order'] = range(1, len(group) + 1)
df.loc[group.index, 'newname'] = [f"{self.mark}{name}g{i:05d}" for i in range(1, len(group) + 1)]
# Set data types and sort
df['order'] = df['order'].astype(int)
df = df[[0, 'newname', 1, 2, 3, 'order', 6]].sort_values(by=[0, 'order'])
# Save output files
df.to_csv(self.ancestor_gff, sep="\t", index=False, header=None)
lens = df.groupby(0).max()[[2, 'order']]
lens.to_csv(self.ancestor_lens, sep="\t", header=None)
# Add extra columns and save final results
lens[1] = 1
lens['color'] = lens.index.map(dict2)
lens['class'] = lens.index.map(dict1)
lens[[1, 'order', 'color', 'class']].to_csv(self.ancestor_file, sep="\t", header=None)
# Update peptide sequences with new IDs and save
id_dict = df.set_index(6).to_dict()['newname']
seqs = []
for seq_record in SeqIO.parse(self.pep_file, "fasta"):
if seq_record.id in id_dict:
seq_record.id = id_dict[seq_record.id]
seqs.append(seq_record)
SeqIO.write(seqs, self.ancestor_pep, "fasta")
================================================
FILE: wgdi/ancestral_karyotype_repertoire.py
================================================
import numpy as np
import pandas as pd
from Bio import SeqIO
import wgdi.base as base
class ancestral_karyotype_repertoire():
def __init__(self, options):
self.gap = 5
self.direction = 0.01
self.mark = 'aak1s'
self.blockinfo_reverse = False
for k, v in options:
setattr(self, str(k), v)
print(k, ' = ', v)
self.blockinfo_reverse = base.str_to_bool(self.blockinfo_reverse)
def run(self):
gff1 = base.newgff(self.gff1)
gff2 = base.newgff(self.gff2)
bkinfo = pd.read_csv(self.blockinfo, index_col='id')
if self.blockinfo_reverse == True:
bkinfo[['chr1', 'chr2']] = bkinfo[['chr2', 'chr1']]
bkinfo[['block1', 'block2']] = bkinfo[['block2', 'block1']]
for index, row in bkinfo.iterrows():
block1, block2 = row['block1'].split('_'), row['block2'].split('_')
block1, block2 = [int(k) for k in block1], [int(k) for k in block2]
if int(block1[1])-int(block1[0]) < 0:
self.direction = -0.01
for i in range(1, len(block2)):
if abs(block1[i]-block1[i-1]) == 1 and abs(block2[i]-block2[i-1]) < int(self.gap):
gff1_id = gff1[(gff1['chr'] == str(row['chr1'])) & (
gff1['order'] == block1[i])].index[0]
order = gff1.loc[gff1_id, 'order']
gff1_row = gff1.loc[gff1_id, :].copy()
for num in range(block2[i-1], block2[i]):
order = order + self.direction
id = gff2[(gff2['chr'] == str(row['chr2']))
& (gff2['order'] == num)].index[0]
gff1_row['order'] = order
gff1.loc[id, :] = gff1_row
df = gff1.copy()
df = df.sort_values(by=['chr', 'order'])
for name, group in df.groupby(['chr']):
df.loc[group.index, 'order'] = list(range(1, len(group)+1))
df.loc[group.index, 'newname'] = list(
[str(self.mark)+str(name)+'g'+str(i).zfill(5) for i in range(1, len(group)+1)])
df['order'] = df['order'].astype(int)
df['oldname'] = df.index
columns = ['chr', 'newname', 'start',
'end', 'strand', 'order', 'oldname']
df[columns].to_csv(self.ancestor_gff, sep="\t",
index=False, header=None)
lens = df.groupby('chr').max()[['end', 'order']]
lens['end'] = lens['end'].astype(np.int64)
lens.to_csv(self.ancestor_lens, sep="\t", header=None)
ancestor = base.read_classification(self.ancestor)
for index, row in ancestor.iterrows():
ancestor.at[index, 1] = 1
ancestor.at[index, 2] = lens.at[str(row[0]),'order']
ancestor.to_csv(self.ancestor_new, sep="\t", index=False, header=None)
id_dict = df['newname'].to_dict()
seqs = []
for seq_record in SeqIO.parse(self.ancestor_pep, "fasta"):
if seq_record.id in id_dict:
seq_record.id = id_dict[seq_record.id]
else:
continue
seq_record.description = ''
seqs.append(seq_record)
SeqIO.write(seqs, self.ancestor_pep_new, "fasta")
================================================
FILE: wgdi/base.py
================================================
import configparser
import hashlib
import os
import re
import matplotlib
import matplotlib.patches as mpatches
import numpy as np
import pandas as pd
from Bio import SeqIO
import wgdi
def gen_md5_id(item):
"""Generate MD5 hash for the given item."""
return hashlib.md5(item.encode('utf-8')).hexdigest()
def config():
"""Read configuration from the example conf.ini file."""
conf = configparser.ConfigParser()
conf.read(os.path.join(wgdi.__path__[0], 'example/conf.ini'))
return conf.items('ini')
def load_conf(file, section):
"""Load configuration items from the specified section."""
conf = configparser.ConfigParser()
conf.read(file)
return conf.items(section)
def rewrite(file, section):
"""Rewrite the configuration file to keep only the specified section."""
conf = configparser.ConfigParser()
conf.read(file)
if conf.has_section(section):
for k in conf.sections():
if k != section:
conf.remove_section(k)
conf.write(open(os.path.join(wgdi.__path__[0], 'example/conf.ini'), 'w'))
print('Option ini has been modified')
else:
print('Option ini no change')
def read_colinearscan(file):
"""Read colinearscan output and parse into data structure."""
data, b, flag, num = [], [], 0, 1
with open(file) as f:
for line in f:
line = line.strip()
if re.match(r"the", line):
num = re.search(r'\d+', line).group()
b = []
flag = 1
continue
if re.match(r"\>LOCALE", line):
flag = 0
p = re.split(':', line)
if b:
data.append([num, b, p[1]])
b = []
continue
if flag == 1:
a = re.split(r"\s", line)
b.append(a)
if b:
data.append([num, b, p[1]])
return data
def read_mcscanx(fn):
"""Read mcscanx output and parse into data structure."""
with open(fn) as f1:
data, b = [], []
flag, num = 0, 0
for line in f1:
line = line.strip()
if re.match(r"## Alignment", line):
flag = 1
if not b:
arr = re.findall(r"[\d+\.]+", line)[0]
continue
data.append([num, b, 0])
b = []
num = re.findall(r"\d+", line)[0]
continue
if flag == 0:
continue
a = re.split(r"\:", line)
c = re.split(r"\s+", a[1])
b.append([c[1], c[1], c[2], c[2]])
if b:
data.append([num, b, 0])
return data
def read_jcvi(fn):
"""Read jcvi output and parse into data structure."""
with open(fn) as f1:
data, b = [], []
num = 1
for line in f1:
line = line.strip()
if re.match(r"###", line):
if b:
data.append([num, b, 0])
b = []
num += 1
continue
a = re.split(r"\t", line)
b.append([a[0], a[0], a[1], a[1]])
if b:
data.append([num, b, 0])
return data
def read_collinearity(fn):
"""Read collinearity output and parse into data structure."""
with open(fn) as f1:
data, b = [], []
flag, arr = 0, []
for line in f1:
line = line.strip()
if re.match(r"# Alignment", line):
flag = 1
if not b:
arr = re.findall(r'[\.\d+]+', line)
continue
data.append([arr[0], b, arr[2]])
b = []
arr = re.findall(r'[\.\d+]+', line)
continue
if flag == 0:
continue
b.append(re.split(r"\s", line))
if b:
data.append([arr[0], b, arr[2]])
return data
def read_ks(file, col):
"""Read KS values from file and select specified column."""
ks = pd.read_csv(file, sep='\t')
ks.drop_duplicates(subset=['id1', 'id2'], keep='first', inplace=True)
ks[col] = ks[col].astype(float)
ks = ks[ks[col] >= 0]
ks.index = ks['id1'] + ',' + ks['id2']
return ks[col]
def get_median(data):
"""Calculate the median of the data list."""
if not data:
return 0
data_sorted = sorted(data)
half = len(data_sorted) // 2
return (data_sorted[half] + data_sorted[-(half + 1)]) / 2
def cds_to_pep(cds_file, pep_file, fmt='fasta'):
"""Translate CDS sequences to peptide sequences and write to file."""
records = list(SeqIO.parse(cds_file, fmt))
for rec in records:
rec.seq = rec.seq.translate()
SeqIO.write(records, pep_file, 'fasta')
return True
def newblast(file, score, evalue, gene_loc1, gene_loc2, reverse):
"""Filter BLAST results based on score, evalue, and gene locations."""
blast = pd.read_csv(file, sep="\t", header=None)
if reverse == 'true':
blast[[0, 1]] = blast[[1, 0]]
blast = blast[(blast[11] >= score) & (blast[10] < evalue) & (blast[1] != blast[0])]
blast = blast[(blast[0].isin(gene_loc1.index)) & (blast[1].isin(gene_loc2.index))]
blast.drop_duplicates(subset=[0, 1], keep='first', inplace=True)
blast[0] = blast[0].astype(str)
blast[1] = blast[1].astype(str)
return blast
def newgff(file):
"""Read GFF file and rename columns with appropriate data types."""
gff = pd.read_csv(file, sep="\t", header=None, index_col=1)
gff.rename(columns={0: 'chr', 2: 'start', 3: 'end', 4: 'strand', 5: 'order'}, inplace=True)
gff['chr'] = gff['chr'].astype(str)
gff['start'] = gff['start'].astype(np.int64)
gff['end'] = gff['end'].astype(np.int64)
gff['strand'] = gff['strand'].astype(str)
gff['order'] = gff['order'].astype(int)
return gff
def newlens(file, position):
"""Read lens file and select position based on 'order' or 'end'."""
lens = pd.read_csv(file, sep="\t", header=None, index_col=0)
lens.index = lens.index.astype(str)
if position == 'order':
lens = lens[2]
elif position == 'end':
lens = lens[1]
return lens
def read_classification(file):
"""Read classification data and convert columns to appropriate types."""
classification = pd.read_csv(file, sep="\t", header=None)
classification[0] = classification[0].astype(str)
classification[1] = classification[1].astype(int)
classification[2] = classification[2].astype(int)
classification[3] = classification[3].astype(str)
classification[4] = classification[4].astype(int)
return classification
def gene_location(gff, lens, step, position):
"""Calculate gene locations based on lens and step."""
gff = gff[gff['chr'].isin(lens.index)].copy()
if gff.empty:
print('Stoped! \n\nChromosomes in gff file and lens file do not correspond.')
exit(0)
dict_chr = dict(zip(lens.index, np.append(np.array([0]), lens.cumsum()[:-1].values)))
gff['loc'] = ''
for name, group in gff.groupby('chr'):
gff.loc[group.index, 'loc'] = (dict_chr[name] + group[position]) * step
return gff
def dotplot_frame(fig, ax, lens1, lens2, step1, step2, genome1_name, genome2_name, arr, pad = 0):
"""Set up the dotplot frame with grid lines and labels."""
for k in lens1.cumsum()[:-1] * step1:
ax.axhline(y=k, alpha=0.8, color='black', lw=0.5)
for k in lens2.cumsum()[:-1] * step2:
ax.axvline(x=k, alpha=0.8, color='black', lw=0.5)
align = dict(family='DejaVu Sans', style='italic', horizontalalignment="center", verticalalignment="center")
yticks = lens1.cumsum() * step1 - 0.5 * lens1 * step1
ax.set_yticks(yticks)
ax.set_yticklabels(lens1.index, fontsize = 13, family='DejaVu Sans', style='normal')
ax.tick_params(axis='y', which='major', pad = pad)
ax.tick_params(axis='x', which='major', pad = pad)
xticks = lens2.cumsum() * step2 - 0.5 * lens2 * step2
ax.set_xticks(xticks)
ax.set_xticklabels(lens2.index, fontsize = 13, family='DejaVu Sans', style='normal')
ax.xaxis.set_ticks_position('none')
ax.yaxis.set_ticks_position('none')
if arr[0] <= 0:
ax.text(-0.06, 0.5, genome1_name, weight='semibold', fontsize=16, rotation=90, **align)
else:
ax.text(-0.06, 0.5, genome1_name, weight='semibold', fontsize=16, rotation=90, **align)
if arr[1] < 0:
ax.text(0.5, -0.06, genome2_name, weight='semibold', fontsize=16, **align)
else:
ax.text(0.5, -0.06, genome2_name, weight='semibold', fontsize=16, **align)
def Bezier3(plist, t):
"""Calculate Bezier curve of degree 3."""
p0, p1, p2 = plist
return p0 * (1 - t) ** 2 + 2 * p1 * t * (1 - t) + p2 * t ** 2
def Bezier4(plist, t):
"""Calculate Bezier curve of degree 4."""
p0, p1, p2, p3, p4 = plist
return p0 * (1 - t) ** 4 + 4 * p1 * t * (1 - t) ** 3 + 6 * p2 * t ** 2 * (1 - t) ** 2 + 4 * p3 * (1 - t) * t ** 3 + p4 * t ** 4
def Rectangle(ax, loc, height, width, color, alpha):
"""Draw a rectangle on the axes with specified properties."""
p = mpatches.Rectangle(loc, width, height, edgecolor=None, facecolor=color, alpha=alpha)
ax.add_patch(p)
def str_to_bool(s):
if isinstance(s, bool):
return s
return str(s).strip().lower() == 'true'
================================================
FILE: wgdi/block_correspondence.py
================================================
import re
import numpy as np
import pandas as pd
import wgdi.base as base
class block_correspondence():
def __init__(self, options):
# Default values
self.tandem = True
self.pvalue = 0.2
self.position = 'order'
self.block_length = 5
self.tandem_length = 200
self.tandem_ratio = 1
self.ks_hit = 0.5
# Set user-defined options
for k, v in options:
setattr(self, str(k), v)
print(k, ' = ', v)
# Parse ks_area and homo if present
self.ks_area = [float(k) for k in getattr(self, 'ks_area', '-1,3').split(',')]
self.homo = [float(k) for k in self.homo.split(',')]
self.tandem_ratio = float(self.tandem_ratio)
self.tandem = base.str_to_bool(self.tandem)
def run(self):
lens1 = base.newlens(self.lens1, self.position)
lens2 = base.newlens(self.lens2, self.position)
# Load block information from CSV
bkinfo = pd.read_csv(self.blockinfo)
bkinfo = self.preprocess_blockinfo(bkinfo, lens1, lens2)
# Initialize correspondence DataFrame
cor = self.initialize_correspondence(lens1, lens2)
# If no tandem allowed, remove tandem regions
if not self.tandem:
bkinfo = self.remove_tandem(bkinfo)
# Remove low KS hits
bkinfo = self.remove_ks_hit(bkinfo)
# Find collinearity regions and save results
collinear_indices = self.collinearity_region(cor, bkinfo, lens1)
bkinfo.loc[bkinfo.index.isin(collinear_indices), :].to_csv(self.savefile, index=False)
def preprocess_blockinfo(self, bkinfo, lens1, lens2):
bkinfo['chr1'] = bkinfo['chr1'].astype(str)
bkinfo['chr2'] = bkinfo['chr2'].astype(str)
# Filter by length, chromosome indices, and p-value
bkinfo = bkinfo[(bkinfo['length'] >= int(self.block_length)) &
(bkinfo['chr1'].isin(lens1.index)) &
(bkinfo['chr2'].isin(lens2.index)) &
(bkinfo['pvalue'] <= float(self.pvalue))]
# Filter by tandem ratio if the column exists
if 'tandem_ratio' in bkinfo.columns:
bkinfo = bkinfo[bkinfo['tandem_ratio'] <= self.tandem_ratio]
return bkinfo
def initialize_correspondence(self, lens1, lens2):
# Create correspondence DataFrame with initial values
cor = [[k, i, 0, lens1[i], j, 0, lens2[j], float(self.homo[0]), float(self.homo[1])]
for k in range(1, int(self.multiple) + 1)
for i in lens1.index
for j in lens2.index]
cor = pd.DataFrame(cor, columns=['sub', 'chr1', 'start1', 'end1', 'chr2', 'start2', 'end2', 'homo1', 'homo2'])
cor['chr1'] = cor['chr1'].astype(str)
cor['chr2'] = cor['chr2'].astype(str)
return cor
def remove_tandem(self, bkinfo):
# Remove tandem regions from the DataFrame
group = bkinfo[bkinfo['chr1'] == bkinfo['chr2']].copy()
group['start'] = group['start1'] - group['start2']
group['end'] = group['end1'] - group['end2']
tandem_condition = (group['start'].abs() <= int(self.tandem_length)) | (group['end'].abs() <= int(self.tandem_length))
index_to_remove = group[tandem_condition].index
return bkinfo.drop(index_to_remove)
def remove_ks_hit(self, bkinfo):
# Remove records with insufficient KS hits
for index, row in bkinfo.iterrows():
ks = self.get_ks_value(row['ks'])
ks_ratio = len([k for k in ks if self.ks_area[0] <= k <= self.ks_area[1]]) / len(ks)
if ks_ratio < self.ks_hit:
bkinfo.drop(index, inplace=True)
return bkinfo
def get_ks_value(self, ks_str):
# Extract and return KS values as floats
ks = ks_str.split('_')
ks = list(map(float, ks[1:])) if ks[0] == '' else list(map(float, ks))
return ks
def collinearity_region(self, cor, bkinfo, lens):
collinear_indices = []
for (chr1, chr2), group in bkinfo.groupby(['chr1', 'chr2']):
group = group.sort_values(by=['length'], ascending=False)
df = pd.Series(0, index=range(1, int(lens[str(chr1)]) + 1))
for index, row in group.iterrows():
# Check homology conditions
if not self.is_valid_homo(row):
continue
# Update the block series and compute ratio
b1 = [int(k) for k in row['block1'].split('_')]
df1 = df.copy()
df1[b1] += 1
ratio = (len(df1[df1 > 0]) - len(df[df > 0])) / len(b1)
if ratio < 0.5:
continue
df[b1] += 1
collinear_indices.append(index)
return collinear_indices
def is_valid_homo(self, row):
# Check if the homology values are within the specified range
return self.homo[0] <= row['homo' + self.multiple] <= self.homo[1]
================================================
FILE: wgdi/block_info.py
================================================
import numpy as np
import pandas as pd
import wgdi.base as base
class block_info:
def __init__(self, options):
self.repeat_number = 20
self.ks_col = 'ks_NG86'
self.blast_reverse = False
for k, v in options:
setattr(self, str(k), v)
print(f"{k} = {v}")
self.repeat_number = int(self.repeat_number)
self.blast_reverse = base.str_to_bool(self.blast_reverse)
def block_position(self, collinearity, blast, gff1, gff2, ks):
data = []
for block in collinearity:
blk_homo, blk_ks = [], []
# Skip blocks with missing gene coordinates in GFF files
if block[1][0][0] not in gff1.index or block[1][0][2] not in gff2.index:
continue
# Extract chromosome info
chr1, chr2 = gff1.at[block[1][0][0], 'chr'], gff2.at[block[1][0][2], 'chr']
# Extract start and end positions
array1, array2 = [float(i[1]) for i in block[1]], [float(i[3]) for i in block[1]]
start1, end1 = array1[0], array1[-1]
start2, end2 = array2[0], array2[-1]
block1, block2 = [], []
for k in block[1]:
block1.append(int(float(k[1])))
block2.append(int(float(k[3])))
# Check for KS values
pair_ks = self.get_ks_value(ks, k)
blk_ks.append(pair_ks)
# Retrieve blast homo data
if k[0]+","+k[2] in blast.index:
blk_homo.append(blast.loc[k[0]+","+k[2], [f'homo{i}' for i in range(1, 6)]].values.tolist())
ks_median, ks_average = self.calculate_ks_statistics(blk_ks)
homo = self.calculate_homo_statistics(blk_homo)
blkks = '_'.join([str(k) for k in blk_ks])
block1 = '_'.join([str(k) for k in block1])
block2 = '_'.join([str(k) for k in block2])
# Calculate tandem ratio
tandem_ratio = self.tandem_ratio(blast, gff2, block[1])
# Store the results
data.append([
block[0], chr1, chr2, start1, end1, start2, end2, block[2], len(block[1]),
ks_median, ks_average, *homo, block1, block2, blkks, tandem_ratio
])
# Create a DataFrame with the results
data_df = pd.DataFrame(data, columns=[
'id', 'chr1', 'chr2', 'start1', 'end1', 'start2', 'end2', 'pvalue', 'length',
'ks_median', 'ks_average', 'homo1', 'homo2', 'homo3', 'homo4', 'homo5',
'block1', 'block2', 'ks', 'tandem_ratio'
])
# Calculate density
data_df['density1'] = data_df['length'] / ((data_df['end1'] - data_df['start1']).abs() + 1)
data_df['density2'] = data_df['length'] / ((data_df['end2'] - data_df['start2']).abs() + 1)
return data_df
def get_ks_value(self, ks, k):
"""Return KS value for the given pair of genes."""
pair = f"{k[0]},{k[2]}"
if pair in ks.index:
return ks[pair]
pair_rev = f"{k[2]},{k[0]}"
if pair_rev in ks.index:
return ks[pair_rev]
return -1
def calculate_ks_statistics(self, blk_ks):
"""Calculate KS statistics: median and average."""
ks_arr = [k for k in blk_ks if k >= 0]
if len(ks_arr) == 0:
return -1, -1
ks_median = base.get_median(ks_arr)
ks_average = sum(ks_arr) / len(ks_arr)
return ks_median, ks_average
def calculate_homo_statistics(self, blk_homo):
"""Calc
gitextract_p42u6yxa/
├── LICENSE
├── README.md
├── __init__.py
├── build/
│ └── lib/
│ └── wgdi/
│ ├── __init__.py
│ ├── align_dotplot.py
│ ├── ancestral_karyotype.py
│ ├── ancestral_karyotype_repertoire.py
│ ├── base.py
│ ├── block_correspondence.py
│ ├── block_info.py
│ ├── block_ks.py
│ ├── circos.py
│ ├── collinearity.py
│ ├── dotplot.py
│ ├── example/
│ │ ├── __init__.py
│ │ ├── align.conf
│ │ ├── alignmenttrees.conf
│ │ ├── ancestral_karyotype.conf
│ │ ├── ancestral_karyotype_repertoire.conf
│ │ ├── blockinfo.conf
│ │ ├── blockks.conf
│ │ ├── circos.conf
│ │ ├── collinearity.conf
│ │ ├── conf.ini
│ │ ├── corr.conf
│ │ ├── dotplot.conf
│ │ ├── fusion_positions_database.conf
│ │ ├── fusions_detection.conf
│ │ ├── karyotype.conf
│ │ ├── karyotype_mapping.conf
│ │ ├── ks.conf
│ │ ├── ks_fit_result.csv
│ │ ├── ksfigure.conf
│ │ ├── kspeaks.conf
│ │ ├── peaksfit.conf
│ │ ├── pindex.conf
│ │ ├── polyploidy_classification.conf
│ │ ├── retain.conf
│ │ └── shared_fusion.conf
│ ├── fusion_positions_database.py
│ ├── fusions_detection.py
│ ├── karyotype.py
│ ├── karyotype_mapping.py
│ ├── ks.py
│ ├── ks_peaks.py
│ ├── ksfigure.py
│ ├── peaksfit.py
│ ├── pindex.py
│ ├── polyploidy_classification.py
│ ├── retain.py
│ ├── run.py
│ ├── run_colliearity.py
│ ├── shared_fusion.py
│ └── trees.py
├── command.txt
├── dist/
│ └── wgdi-0.75-py3-none-any.whl
├── setup.py
├── wgdi/
│ ├── __init__.py
│ ├── align_dotplot.py
│ ├── ancestral_karyotype.py
│ ├── ancestral_karyotype_repertoire.py
│ ├── base.py
│ ├── block_correspondence.py
│ ├── block_info.py
│ ├── block_ks.py
│ ├── circos.py
│ ├── collinearity.py
│ ├── dotplot.py
│ ├── example/
│ │ ├── __init__.py
│ │ ├── align.conf
│ │ ├── alignmenttrees.conf
│ │ ├── ancestral_karyotype.conf
│ │ ├── ancestral_karyotype_repertoire.conf
│ │ ├── blockinfo.conf
│ │ ├── blockks.conf
│ │ ├── circos.conf
│ │ ├── collinearity.conf
│ │ ├── conf.ini
│ │ ├── corr.conf
│ │ ├── dotplot.conf
│ │ ├── fusion_positions_database.conf
│ │ ├── fusions_detection.conf
│ │ ├── karyotype.conf
│ │ ├── karyotype_mapping.conf
│ │ ├── ks.conf
│ │ ├── ks_fit_result.csv
│ │ ├── ksfigure.conf
│ │ ├── kspeaks.conf
│ │ ├── peaksfit.conf
│ │ ├── pindex.conf
│ │ ├── polyploidy_classification.conf
│ │ ├── retain.conf
│ │ └── shared_fusion.conf
│ ├── fusion_positions_database.py
│ ├── fusions_detection.py
│ ├── karyotype.py
│ ├── karyotype_mapping.py
│ ├── ks.py
│ ├── ks_peaks.py
│ ├── ksfigure.py
│ ├── peaksfit.py
│ ├── pindex.py
│ ├── polyploidy_classification.py
│ ├── retain.py
│ ├── run.py
│ ├── run_colliearity.py
│ ├── shared_fusion.py
│ └── trees.py
└── wgdi.egg-info/
├── PKG-INFO
├── SOURCES.txt
├── dependency_links.txt
├── entry_points.txt
├── requires.txt
├── top_level.txt
└── zip-safe
SYMBOL INDEX (336 symbols across 50 files)
FILE: build/lib/wgdi/align_dotplot.py
class align_dotplot (line 7) | class align_dotplot:
method __init__ (line 8) | def __init__(self, options):
method pair_position (line 26) | def pair_position(self, alignment, loc1, loc2, colors):
method run (line 35) | def run(self):
method process_ancestor (line 93) | def process_ancestor(self, ancestor_file, lens_index):
method process_blockinfo (line 101) | def process_blockinfo(self, lens1, lens2):
method alignment (line 111) | def alignment(self, gff1, gff2, bkinfo):
method create_block_dataframe (line 140) | def create_block_dataframe(self, row):
method ancestor_position (line 150) | def ancestor_position(self, ax, gff, lens, mark):
FILE: build/lib/wgdi/ancestral_karyotype.py
class ancestral_karyotype (line 6) | class ancestral_karyotype:
method __init__ (line 7) | def __init__(self, options):
method run (line 15) | def run(self):
FILE: build/lib/wgdi/ancestral_karyotype_repertoire.py
class ancestral_karyotype_repertoire (line 8) | class ancestral_karyotype_repertoire():
method __init__ (line 9) | def __init__(self, options):
method run (line 19) | def run(self):
FILE: build/lib/wgdi/base.py
function gen_md5_id (line 15) | def gen_md5_id(item):
function config (line 20) | def config():
function load_conf (line 27) | def load_conf(file, section):
function rewrite (line 34) | def rewrite(file, section):
function read_colinearscan (line 48) | def read_colinearscan(file):
function read_mcscanx (line 74) | def read_mcscanx(fn):
function read_jcvi (line 100) | def read_jcvi(fn):
function read_collinearity (line 120) | def read_collinearity(fn):
function read_ks (line 144) | def read_ks(file, col):
function get_median (line 154) | def get_median(data):
function cds_to_pep (line 163) | def cds_to_pep(cds_file, pep_file, fmt='fasta'):
function newblast (line 172) | def newblast(file, score, evalue, gene_loc1, gene_loc2, reverse):
function newgff (line 186) | def newgff(file):
function newlens (line 198) | def newlens(file, position):
function read_classification (line 209) | def read_classification(file):
function gene_location (line 220) | def gene_location(gff, lens, step, position):
function dotplot_frame (line 233) | def dotplot_frame(fig, ax, lens1, lens2, step1, step2, genome1_name, gen...
function Bezier3 (line 259) | def Bezier3(plist, t):
function Bezier4 (line 265) | def Bezier4(plist, t):
function Rectangle (line 271) | def Rectangle(ax, loc, height, width, color, alpha):
function str_to_bool (line 276) | def str_to_bool(s):
FILE: build/lib/wgdi/block_correspondence.py
class block_correspondence (line 6) | class block_correspondence():
method __init__ (line 7) | def __init__(self, options):
method run (line 28) | def run(self):
method preprocess_blockinfo (line 50) | def preprocess_blockinfo(self, bkinfo, lens1, lens2):
method initialize_correspondence (line 66) | def initialize_correspondence(self, lens1, lens2):
method remove_tandem (line 79) | def remove_tandem(self, bkinfo):
method remove_ks_hit (line 88) | def remove_ks_hit(self, bkinfo):
method get_ks_value (line 97) | def get_ks_value(self, ks_str):
method collinearity_region (line 103) | def collinearity_region(self, cor, bkinfo, lens):
method is_valid_homo (line 124) | def is_valid_homo(self, row):
FILE: build/lib/wgdi/block_info.py
class block_info (line 6) | class block_info:
method __init__ (line 7) | def __init__(self, options):
method block_position (line 18) | def block_position(self, collinearity, blast, gff1, gff2, ks):
method get_ks_value (line 77) | def get_ks_value(self, ks, k):
method calculate_ks_statistics (line 87) | def calculate_ks_statistics(self, blk_ks):
method calculate_homo_statistics (line 96) | def calculate_homo_statistics(self, blk_homo):
method blast_homo (line 102) | def blast_homo(self, blast, gff1, gff2, repeat_number):
method tandem_ratio (line 122) | def tandem_ratio(self, blast, gff2, block):
method run (line 136) | def run(self):
method auto_file (line 167) | def auto_file(self, gff1, gff2):
method process_mcscanx (line 182) | def process_mcscanx(self, gff1, gff2):
method process_jcvi (line 194) | def process_jcvi(self, gff1, gff2):
FILE: build/lib/wgdi/block_ks.py
class block_ks (line 8) | class block_ks:
method __init__ (line 9) | def __init__(self, options):
method block_position (line 34) | def block_position(self, bkinfo, lens1, lens2, step1, step2):
method remove_tandem (line 62) | def remove_tandem(self, bkinfo):
method run (line 75) | def run(self):
FILE: build/lib/wgdi/circos.py
class circos (line 13) | class circos():
method __init__ (line 14) | def __init__(self, options):
method plot_circle (line 31) | def plot_circle(self, loc_chr, radius, color='black', lw=1, alpha=1, l...
method plot_labels (line 39) | def plot_labels(self, root, labels, loc_chr, radius, horizontalalignme...
method Wedge (line 51) | def Wedge(self, ax, loc, radius, start, end, width, color, alpha):
method plot_bar (line 56) | def plot_bar(self, df, radius, length, lw, color, alpha):
method chr_location (line 75) | def chr_location(self, lens, angle_gap, angle):
method deal_alignment (line 83) | def deal_alignment(self, alignment, gff, lens, loc_chr, angle):
method deal_ancestor (line 105) | def deal_ancestor(self, alignment, gff, lens, loc_chr, angle, al):
method plot_collinearity (line 133) | def plot_collinearity(self, data, radius, lw=0.02, alpha=1):
method plot_legend (line 153) | def plot_legend(self, ax, chr_color, width, height):
method run (line 169) | def run(self):
FILE: build/lib/wgdi/collinearity.py
class collinearity (line 5) | class collinearity:
method __init__ (line 6) | def __init__(self, options, points):
method get_matrix (line 30) | def get_matrix(self):
method run (line 42) | def run(self):
method score_matrix (line 72) | def score_matrix(self):
method max_path (line 117) | def max_path(self, points):
method p_value_estimated (line 146) | def p_value_estimated(self, gap, L1, L2):
FILE: build/lib/wgdi/dotplot.py
class dotplot (line 10) | class dotplot():
method __init__ (line 11) | def __init__(self, options):
method pair_positon (line 31) | def pair_positon(self, blast, gff1, gff2, rednum, repeat_number):
method run (line 58) | def run(self):
method ancestor_posion (line 122) | def ancestor_posion(self, ax, gff, lens, mark):
FILE: build/lib/wgdi/fusion_positions_database.py
class fusion_positions_database (line 5) | class fusion_positions_database:
method __init__ (line 6) | def __init__(self, options):
method run (line 11) | def run(self):
FILE: build/lib/wgdi/fusions_detection.py
class fusions_detection (line 4) | class fusions_detection:
method __init__ (line 5) | def __init__(self, options):
method run (line 14) | def run(self):
method count_non_overlapping (line 45) | def count_non_overlapping(self, group):
FILE: build/lib/wgdi/karyotype.py
class karyotype (line 7) | class karyotype():
method __init__ (line 8) | def __init__(self, options):
method run (line 22) | def run(self):
FILE: build/lib/wgdi/karyotype_mapping.py
class karyotype_mapping (line 7) | class karyotype_mapping:
method __init__ (line 8) | def __init__(self, options):
method karyotype_left (line 28) | def karyotype_left(self, pairs, ancestor, gff1, gff2):
method karyotype_top (line 44) | def karyotype_top(self, pairs, ancestor, gff1, gff2):
method karyotype_map (line 60) | def karyotype_map(self, gff, lens):
method colinear_gene_pairs (line 91) | def colinear_gene_pairs(self, bkinfo, gff1, gff2):
method new_ancestor (line 109) | def new_ancestor(self, ancestor, gff1, gff2, blast):
method run (line 169) | def run(self):
FILE: build/lib/wgdi/ks.py
class ks (line 11) | class ks:
method __init__ (line 12) | def __init__(self, options):
method auto_file (line 26) | def auto_file(self):
method run (line 57) | def run(self):
method pair_kaks (line 127) | def pair_kaks(self, k):
method align (line 143) | def align(self):
method pal2nal (line 158) | def pal2nal(self):
method run_yn00 (line 167) | def run_yn00(self):
FILE: build/lib/wgdi/ks_peaks.py
class kspeaks (line 8) | class kspeaks:
method __init__ (line 9) | def __init__(self, options):
method remove_tandem (line 32) | def remove_tandem(self, bkinfo):
method ks_kde (line 46) | def ks_kde(self, df):
method run (line 77) | def run(self):
FILE: build/lib/wgdi/ksfigure.py
class ksfigure (line 11) | class ksfigure():
method __init__ (line 12) | def __init__(self, options):
method Gaussian_distribution (line 32) | def Gaussian_distribution(self, t, k):
method run (line 45) | def run(self):
FILE: build/lib/wgdi/peaksfit.py
class peaksfit (line 13) | class peaksfit():
method __init__ (line 14) | def __init__(self, options):
method ks_values (line 29) | def ks_values(self, df):
method gaussian_fuc (line 40) | def gaussian_fuc(self, x, *params):
method kde_fit (line 49) | def kde_fit(self, data, x):
method run (line 67) | def run(self):
FILE: build/lib/wgdi/pindex.py
class pindex (line 9) | class pindex():
method __init__ (line 10) | def __init__(self, options):
method Pindex (line 23) | def Pindex(self, sub1, sub2):
method retain (line 42) | def retain(self, arr):
method run (line 56) | def run(self):
method cal_pindex (line 70) | def cal_pindex(self, alignment):
method turn_percentage (line 100) | def turn_percentage(self, x):
method infomation (line 103) | def infomation(self, df):
FILE: build/lib/wgdi/polyploidy_classification.py
class polyploidy_classification (line 5) | class polyploidy_classification:
method __init__ (line 6) | def __init__(self, options):
method run (line 19) | def run(self):
FILE: build/lib/wgdi/retain.py
class retain (line 5) | class retain:
method __init__ (line 6) | def __init__(self, options):
method run (line 21) | def run(self):
method align_chr (line 91) | def align_chr(self, alignment):
method moving_average (line 108) | def moving_average(self, arr):
FILE: build/lib/wgdi/run.py
function run_subprogram (line 73) | def run_subprogram(program, conf, name):
function run_configure (line 79) | def run_configure():
function module_to_run (line 83) | def module_to_run(argument, conf):
function main (line 119) | def main():
FILE: build/lib/wgdi/run_colliearity.py
class mycollinearity (line 13) | class mycollinearity():
method __init__ (line 14) | def __init__(self, options):
method deal_blast_for_chromosomes (line 42) | def deal_blast_for_chromosomes(self, blast, rednum, repeat_number):
method deal_blast_for_genomes (line 59) | def deal_blast_for_genomes(self, blast, rednum, repeat_number):
method run (line 88) | def run(self):
method single_pool (line 158) | def single_pool(self, group, gff1, gff2, lens1, lens2):
FILE: build/lib/wgdi/shared_fusion.py
class shared_fusion (line 4) | class shared_fusion:
method __init__ (line 5) | def __init__(self, options):
method run (line 18) | def run(self):
method block_fusions (line 62) | def block_fusions(self, bkinfo, ancestor_left, ancestor_top):
FILE: build/lib/wgdi/trees.py
class trees (line 13) | class trees():
method __init__ (line 14) | def __init__(self, options):
method grouping (line 36) | def grouping(self, alignment):
method codon (line 85) | def codon(self):
method pal2nal (line 105) | def pal2nal(self):
method align (line 115) | def align(self):
method trimal (line 130) | def trimal(self):
method divvier (line 140) | def divvier(self):
method buildtrees (line 150) | def buildtrees(self):
method run (line 174) | def run(self):
FILE: wgdi/align_dotplot.py
class align_dotplot (line 7) | class align_dotplot:
method __init__ (line 8) | def __init__(self, options):
method pair_position (line 26) | def pair_position(self, alignment, loc1, loc2, colors):
method run (line 35) | def run(self):
method process_ancestor (line 93) | def process_ancestor(self, ancestor_file, lens_index):
method process_blockinfo (line 101) | def process_blockinfo(self, lens1, lens2):
method alignment (line 111) | def alignment(self, gff1, gff2, bkinfo):
method create_block_dataframe (line 140) | def create_block_dataframe(self, row):
method ancestor_position (line 150) | def ancestor_position(self, ax, gff, lens, mark):
FILE: wgdi/ancestral_karyotype.py
class ancestral_karyotype (line 6) | class ancestral_karyotype:
method __init__ (line 7) | def __init__(self, options):
method run (line 15) | def run(self):
FILE: wgdi/ancestral_karyotype_repertoire.py
class ancestral_karyotype_repertoire (line 8) | class ancestral_karyotype_repertoire():
method __init__ (line 9) | def __init__(self, options):
method run (line 19) | def run(self):
FILE: wgdi/base.py
function gen_md5_id (line 15) | def gen_md5_id(item):
function config (line 20) | def config():
function load_conf (line 27) | def load_conf(file, section):
function rewrite (line 34) | def rewrite(file, section):
function read_colinearscan (line 48) | def read_colinearscan(file):
function read_mcscanx (line 74) | def read_mcscanx(fn):
function read_jcvi (line 100) | def read_jcvi(fn):
function read_collinearity (line 120) | def read_collinearity(fn):
function read_ks (line 144) | def read_ks(file, col):
function get_median (line 154) | def get_median(data):
function cds_to_pep (line 163) | def cds_to_pep(cds_file, pep_file, fmt='fasta'):
function newblast (line 172) | def newblast(file, score, evalue, gene_loc1, gene_loc2, reverse):
function newgff (line 186) | def newgff(file):
function newlens (line 198) | def newlens(file, position):
function read_classification (line 209) | def read_classification(file):
function gene_location (line 220) | def gene_location(gff, lens, step, position):
function dotplot_frame (line 233) | def dotplot_frame(fig, ax, lens1, lens2, step1, step2, genome1_name, gen...
function Bezier3 (line 259) | def Bezier3(plist, t):
function Bezier4 (line 265) | def Bezier4(plist, t):
function Rectangle (line 271) | def Rectangle(ax, loc, height, width, color, alpha):
function str_to_bool (line 276) | def str_to_bool(s):
FILE: wgdi/block_correspondence.py
class block_correspondence (line 6) | class block_correspondence():
method __init__ (line 7) | def __init__(self, options):
method run (line 28) | def run(self):
method preprocess_blockinfo (line 50) | def preprocess_blockinfo(self, bkinfo, lens1, lens2):
method initialize_correspondence (line 66) | def initialize_correspondence(self, lens1, lens2):
method remove_tandem (line 79) | def remove_tandem(self, bkinfo):
method remove_ks_hit (line 88) | def remove_ks_hit(self, bkinfo):
method get_ks_value (line 97) | def get_ks_value(self, ks_str):
method collinearity_region (line 103) | def collinearity_region(self, cor, bkinfo, lens):
method is_valid_homo (line 124) | def is_valid_homo(self, row):
FILE: wgdi/block_info.py
class block_info (line 6) | class block_info:
method __init__ (line 7) | def __init__(self, options):
method block_position (line 18) | def block_position(self, collinearity, blast, gff1, gff2, ks):
method get_ks_value (line 77) | def get_ks_value(self, ks, k):
method calculate_ks_statistics (line 87) | def calculate_ks_statistics(self, blk_ks):
method calculate_homo_statistics (line 96) | def calculate_homo_statistics(self, blk_homo):
method blast_homo (line 102) | def blast_homo(self, blast, gff1, gff2, repeat_number):
method tandem_ratio (line 122) | def tandem_ratio(self, blast, gff2, block):
method run (line 136) | def run(self):
method auto_file (line 167) | def auto_file(self, gff1, gff2):
method process_mcscanx (line 182) | def process_mcscanx(self, gff1, gff2):
method process_jcvi (line 194) | def process_jcvi(self, gff1, gff2):
FILE: wgdi/block_ks.py
class block_ks (line 8) | class block_ks:
method __init__ (line 9) | def __init__(self, options):
method block_position (line 34) | def block_position(self, bkinfo, lens1, lens2, step1, step2):
method remove_tandem (line 62) | def remove_tandem(self, bkinfo):
method run (line 75) | def run(self):
FILE: wgdi/circos.py
class circos (line 13) | class circos():
method __init__ (line 14) | def __init__(self, options):
method plot_circle (line 31) | def plot_circle(self, loc_chr, radius, color='black', lw=1, alpha=1, l...
method plot_labels (line 39) | def plot_labels(self, root, labels, loc_chr, radius, horizontalalignme...
method Wedge (line 51) | def Wedge(self, ax, loc, radius, start, end, width, color, alpha):
method plot_bar (line 56) | def plot_bar(self, df, radius, length, lw, color, alpha):
method chr_location (line 75) | def chr_location(self, lens, angle_gap, angle):
method deal_alignment (line 83) | def deal_alignment(self, alignment, gff, lens, loc_chr, angle):
method deal_ancestor (line 105) | def deal_ancestor(self, alignment, gff, lens, loc_chr, angle, al):
method plot_collinearity (line 133) | def plot_collinearity(self, data, radius, lw=0.02, alpha=1):
method plot_legend (line 153) | def plot_legend(self, ax, chr_color, width, height):
method run (line 169) | def run(self):
FILE: wgdi/collinearity.py
class collinearity (line 5) | class collinearity:
method __init__ (line 6) | def __init__(self, options, points):
method get_matrix (line 30) | def get_matrix(self):
method run (line 42) | def run(self):
method score_matrix (line 72) | def score_matrix(self):
method max_path (line 117) | def max_path(self, points):
method p_value_estimated (line 146) | def p_value_estimated(self, gap, L1, L2):
FILE: wgdi/dotplot.py
class dotplot (line 10) | class dotplot():
method __init__ (line 11) | def __init__(self, options):
method pair_positon (line 31) | def pair_positon(self, blast, gff1, gff2, rednum, repeat_number):
method run (line 58) | def run(self):
method ancestor_posion (line 122) | def ancestor_posion(self, ax, gff, lens, mark):
FILE: wgdi/fusion_positions_database.py
class fusion_positions_database (line 5) | class fusion_positions_database:
method __init__ (line 6) | def __init__(self, options):
method run (line 11) | def run(self):
FILE: wgdi/fusions_detection.py
class fusions_detection (line 4) | class fusions_detection:
method __init__ (line 5) | def __init__(self, options):
method run (line 14) | def run(self):
method count_non_overlapping (line 45) | def count_non_overlapping(self, group):
FILE: wgdi/karyotype.py
class karyotype (line 7) | class karyotype():
method __init__ (line 8) | def __init__(self, options):
method run (line 22) | def run(self):
FILE: wgdi/karyotype_mapping.py
class karyotype_mapping (line 7) | class karyotype_mapping:
method __init__ (line 8) | def __init__(self, options):
method karyotype_left (line 28) | def karyotype_left(self, pairs, ancestor, gff1, gff2):
method karyotype_top (line 44) | def karyotype_top(self, pairs, ancestor, gff1, gff2):
method karyotype_map (line 60) | def karyotype_map(self, gff, lens):
method colinear_gene_pairs (line 91) | def colinear_gene_pairs(self, bkinfo, gff1, gff2):
method new_ancestor (line 109) | def new_ancestor(self, ancestor, gff1, gff2, blast):
method run (line 169) | def run(self):
FILE: wgdi/ks.py
class ks (line 11) | class ks:
method __init__ (line 12) | def __init__(self, options):
method auto_file (line 26) | def auto_file(self):
method run (line 57) | def run(self):
method pair_kaks (line 127) | def pair_kaks(self, k):
method align (line 143) | def align(self):
method pal2nal (line 158) | def pal2nal(self):
method run_yn00 (line 167) | def run_yn00(self):
FILE: wgdi/ks_peaks.py
class kspeaks (line 8) | class kspeaks:
method __init__ (line 9) | def __init__(self, options):
method remove_tandem (line 32) | def remove_tandem(self, bkinfo):
method ks_kde (line 46) | def ks_kde(self, df):
method run (line 77) | def run(self):
FILE: wgdi/ksfigure.py
class ksfigure (line 11) | class ksfigure():
method __init__ (line 12) | def __init__(self, options):
method Gaussian_distribution (line 32) | def Gaussian_distribution(self, t, k):
method run (line 45) | def run(self):
FILE: wgdi/peaksfit.py
class peaksfit (line 13) | class peaksfit():
method __init__ (line 14) | def __init__(self, options):
method ks_values (line 29) | def ks_values(self, df):
method gaussian_fuc (line 40) | def gaussian_fuc(self, x, *params):
method kde_fit (line 49) | def kde_fit(self, data, x):
method run (line 67) | def run(self):
FILE: wgdi/pindex.py
class pindex (line 9) | class pindex():
method __init__ (line 10) | def __init__(self, options):
method Pindex (line 23) | def Pindex(self, sub1, sub2):
method retain (line 42) | def retain(self, arr):
method run (line 56) | def run(self):
method cal_pindex (line 70) | def cal_pindex(self, alignment):
method turn_percentage (line 100) | def turn_percentage(self, x):
method infomation (line 103) | def infomation(self, df):
FILE: wgdi/polyploidy_classification.py
class polyploidy_classification (line 5) | class polyploidy_classification:
method __init__ (line 6) | def __init__(self, options):
method run (line 19) | def run(self):
FILE: wgdi/retain.py
class retain (line 5) | class retain:
method __init__ (line 6) | def __init__(self, options):
method run (line 21) | def run(self):
method align_chr (line 91) | def align_chr(self, alignment):
method moving_average (line 108) | def moving_average(self, arr):
FILE: wgdi/run.py
function run_subprogram (line 73) | def run_subprogram(program, conf, name):
function run_configure (line 79) | def run_configure():
function module_to_run (line 83) | def module_to_run(argument, conf):
function main (line 119) | def main():
FILE: wgdi/run_colliearity.py
class mycollinearity (line 13) | class mycollinearity():
method __init__ (line 14) | def __init__(self, options):
method deal_blast_for_chromosomes (line 42) | def deal_blast_for_chromosomes(self, blast, rednum, repeat_number):
method deal_blast_for_genomes (line 59) | def deal_blast_for_genomes(self, blast, rednum, repeat_number):
method run (line 88) | def run(self):
method single_pool (line 158) | def single_pool(self, group, gff1, gff2, lens1, lens2):
FILE: wgdi/shared_fusion.py
class shared_fusion (line 4) | class shared_fusion:
method __init__ (line 5) | def __init__(self, options):
method run (line 18) | def run(self):
method block_fusions (line 62) | def block_fusions(self, bkinfo, ancestor_left, ancestor_top):
FILE: wgdi/trees.py
class trees (line 13) | class trees():
method __init__ (line 14) | def __init__(self, options):
method grouping (line 36) | def grouping(self, alignment):
method codon (line 85) | def codon(self):
method pal2nal (line 105) | def pal2nal(self):
method align (line 115) | def align(self):
method trimal (line 130) | def trimal(self):
method divvier (line 140) | def divvier(self):
method buildtrees (line 150) | def buildtrees(self):
method run (line 174) | def run(self):
Condensed preview — 115 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (335K chars).
[
{
"path": "LICENSE",
"chars": 1291,
"preview": "Copyright (c) 2018-2018, Pengchuan Sun\n\nAll rights reserved.\n\nRedistribution and use in source and binary forms, with or"
},
{
"path": "README.md",
"chars": 4670,
"preview": "# WGDI\n\n ["
},
{
"path": "__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "build/lib/wgdi/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "build/lib/wgdi/align_dotplot.py",
"chars": 7097,
"preview": "import re\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\nimport wgdi.base as base\n\nclass align_d"
},
{
"path": "build/lib/wgdi/ancestral_karyotype.py",
"chars": 3024,
"preview": "import pandas as pd\nfrom Bio import SeqIO\nimport wgdi.base as base\n\n\nclass ancestral_karyotype:\n def __init__(self, o"
},
{
"path": "build/lib/wgdi/ancestral_karyotype_repertoire.py",
"chars": 3319,
"preview": "\nimport numpy as np\nimport pandas as pd\nfrom Bio import SeqIO\n\nimport wgdi.base as base\n\nclass ancestral_karyotype_reper"
},
{
"path": "build/lib/wgdi/base.py",
"chars": 9431,
"preview": "import configparser\nimport hashlib\nimport os\nimport re\n\nimport matplotlib\nimport matplotlib.patches as mpatches\nimport n"
},
{
"path": "build/lib/wgdi/block_correspondence.py",
"chars": 5121,
"preview": "import re\nimport numpy as np\nimport pandas as pd\nimport wgdi.base as base\n\nclass block_correspondence():\n def __init_"
},
{
"path": "build/lib/wgdi/block_info.py",
"chars": 8681,
"preview": "import numpy as np\nimport pandas as pd\nimport wgdi.base as base\n\n\nclass block_info:\n def __init__(self, options):\n "
},
{
"path": "build/lib/wgdi/block_ks.py",
"chars": 5767,
"preview": "import re\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\nimport wgdi.base as base\n\n\nclass block_"
},
{
"path": "build/lib/wgdi/circos.py",
"chars": 11217,
"preview": "import re\nimport sys\n\nimport matplotlib as mpl\nimport matplotlib.patches as mpatches\nimport matplotlib.pyplot as plt\nimp"
},
{
"path": "build/lib/wgdi/collinearity.py",
"chars": 7394,
"preview": "import numpy as np\nimport pandas as pd\n\n\nclass collinearity:\n def __init__(self, options, points):\n # Default "
},
{
"path": "build/lib/wgdi/dotplot.py",
"chars": 6144,
"preview": "import re\n\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\n\nimport wgdi.base as base\n\n\nclass dotp"
},
{
"path": "build/lib/wgdi/example/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "build/lib/wgdi/example/align.conf",
"chars": 382,
"preview": "[alignment]\nblockinfo = block information file (.csv)\nblockinfo_reverse = false\nclassid = class1\ngff1 = gff1 file\ngff2"
},
{
"path": "build/lib/wgdi/example/alignmenttrees.conf",
"chars": 551,
"preview": "[alignmenttrees]\nalignment = alignment file (.csv)\ngff = gff file (reference genome, If alignment has no reference speci"
},
{
"path": "build/lib/wgdi/example/ancestral_karyotype.conf",
"chars": 333,
"preview": "[ancestral_karyotype]\ngff = gff file (cat the relevant 'gff' files into a file)\npep_file = pep file (cat the relevant 'p"
},
{
"path": "build/lib/wgdi/example/ancestral_karyotype_repertoire.conf",
"chars": 457,
"preview": "[ancestral_karyotype_repertoire]\nblockinfo = block information (*.csv)\n# blockinfo: processed *.csv\nblockinfo_reverse ="
},
{
"path": "build/lib/wgdi/example/blockinfo.conf",
"chars": 267,
"preview": "[blockinfo]\nblast = blast file\ngff1 = gff1 file\ngff2 = gff2 file\nlens1 = lens1 file\nlens2 = lens2 file\ncollinearity = "
},
{
"path": "build/lib/wgdi/example/blockks.conf",
"chars": 301,
"preview": "[blockks]\nlens1 = lens1 file\nlens2 = lens2 file\ngenome1_name = Genome1 name\ngenome2_name = Genome2 name\nblockinfo = bl"
},
{
"path": "build/lib/wgdi/example/circos.conf",
"chars": 426,
"preview": "[circos]\ngff = gff file\nlens = lens file\nradius = 0.2\nangle_gap = 0.05\nring_width = 0.015\ncolors = 1:c,2:m,3:blue,4:g"
},
{
"path": "build/lib/wgdi/example/collinearity.conf",
"chars": 306,
"preview": "[collinearity]\ngff1 = gff1 file\ngff2 = gff2 file\nlens1 = lens1 file\nlens2 = lens2 file\nblast = blast file\nblast_reverse "
},
{
"path": "build/lib/wgdi/example/conf.ini",
"chars": 476,
"preview": "[ini]\nmafft_path = /home/sunpc/micromamba/envs/wgdi/bin/mafft\npal2nal_path = /home/sunpc/micromamba/envs/wgdi/bin/pal2na"
},
{
"path": "build/lib/wgdi/example/corr.conf",
"chars": 225,
"preview": "[correspondence]\nblockinfo = blockinfo file(.csv) \nlens1 = lens1 file\nlens2 = lens2 file\ntandem = true\ntandem_length = "
},
{
"path": "build/lib/wgdi/example/dotplot.conf",
"chars": 404,
"preview": "[dotplot]\nblast = blast file\ngff1 = gff1 file\ngff2 = gff2 file\nlens1 = lens1 file\nlens2 = lens2 file\ngenome1_name = G"
},
{
"path": "build/lib/wgdi/example/fusion_positions_database.conf",
"chars": 266,
"preview": "[fusion_positions_database]\npep = pep file\ngff = gff file\nfusion_positions = fusion_positions file\n# Number of gene sets"
},
{
"path": "build/lib/wgdi/example/fusions_detection.conf",
"chars": 244,
"preview": "[fusions_detection]\nblockinfo = block information (*.csv)\nancestor = ancestor file\n#The number of genes spanned by a syn"
},
{
"path": "build/lib/wgdi/example/karyotype.conf",
"chars": 116,
"preview": "[karyotype]\nancestor = ancestor chromosome file\nwidth = 0.5\nfigsize = 10,6.18\nsavefig = save image(.png, .pdf, .svg)"
},
{
"path": "build/lib/wgdi/example/karyotype_mapping.conf",
"chars": 420,
"preview": "[karyotype_mapping]\nblast = blast file\nblast_reverse = false\ngff1 = gff1 file\ngff2 = gff2 file \nscore = 100\nevalue = 1e-"
},
{
"path": "build/lib/wgdi/example/ks.conf",
"chars": 176,
"preview": "[ks]\ncds_file = \tcds file \n#cat all cds files together\npep_file = \tpep file\n#cat all pep files together\nalign_software ="
},
{
"path": "build/lib/wgdi/example/ks_fit_result.csv",
"chars": 377,
"preview": ",color,linewidth,linestyle,,,,,,\ncsa_csa,red,2,-,2.532090116,1.510453744,0.229652282,1.638111687,2.048906176,0.345639862"
},
{
"path": "build/lib/wgdi/example/ksfigure.conf",
"chars": 239,
"preview": "[ksfigure]\nksfit = ksfit result(*.csv)\nlabelfontsize = 15\nlegendfontsize = 15\nxlabel = none \nylabel = none "
},
{
"path": "build/lib/wgdi/example/kspeaks.conf",
"chars": 247,
"preview": "[kspeaks]\nblockinfo = block information (*.csv)\npvalue = 0.2\ntandem = true\nblock_length = int number\nks_area = 0,10\nmult"
},
{
"path": "build/lib/wgdi/example/peaksfit.conf",
"chars": 191,
"preview": "[peaksfit]\nblockinfo = block information (*.csv)\nmode = median\nbins_number = 200\nks_area = 0,10\nfontsize = 9\narea = 0,3\n"
},
{
"path": "build/lib/wgdi/example/pindex.conf",
"chars": 169,
"preview": "[pindex]\nalignment = alignment file (.csv)\ngff = gff file\nlens =lens file\ngap = 50\nretention = 0.05\ndiff = 0.05\nremove_d"
},
{
"path": "build/lib/wgdi/example/polyploidy_classification.conf",
"chars": 231,
"preview": "[polyploidy classification]\nblockinfo = block information (*.csv)\nancestor_left = ancestor file\nancestor_top = ancestor "
},
{
"path": "build/lib/wgdi/example/retain.conf",
"chars": 224,
"preview": "[retain]\nalignment = alignment file\ngff = gff file\nlens = lens file\ncolors = red,blue,green\nrefgenome = shorthand\nfigsiz"
},
{
"path": "build/lib/wgdi/example/shared_fusion.conf",
"chars": 323,
"preview": "[shared_fusion]\nblockinfo = block information (*.csv)\n# The new lens file is the output filtered by lens file.\nlens1 = l"
},
{
"path": "build/lib/wgdi/fusion_positions_database.py",
"chars": 3037,
"preview": "import pandas as pd\nimport os\nfrom Bio import SeqIO\n\nclass fusion_positions_database:\n def __init__(self, options):\n "
},
{
"path": "build/lib/wgdi/fusions_detection.py",
"chars": 2903,
"preview": "import pandas as pd\nfrom tabulate import tabulate\n\nclass fusions_detection:\n def __init__(self, options):\n sel"
},
{
"path": "build/lib/wgdi/karyotype.py",
"chars": 1580,
"preview": "import matplotlib.pyplot as plt\nimport pandas as pd\n\nimport wgdi.base as base\n\n\nclass karyotype():\n def __init__(self"
},
{
"path": "build/lib/wgdi/karyotype_mapping.py",
"chars": 9669,
"preview": "import numpy as np\nimport pandas as pd\n\nimport wgdi.base as base\n\n\nclass karyotype_mapping:\n def __init__(self, optio"
},
{
"path": "build/lib/wgdi/ks.py",
"chars": 6469,
"preview": "import os\nimport sys\nimport numpy as np\nimport pandas as pd\nfrom Bio import SeqIO\nimport subprocess\nfrom Bio.Phylo.PAML "
},
{
"path": "build/lib/wgdi/ks_peaks.py",
"chars": 4936,
"preview": "import matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\nfrom scipy.stats.kde import gaussian_kde\n\nimport "
},
{
"path": "build/lib/wgdi/ksfigure.py",
"chars": 3528,
"preview": "import re\nimport sys\n\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\nimport wgdi.base as base\nfr"
},
{
"path": "build/lib/wgdi/peaksfit.py",
"chars": 3468,
"preview": "import re\nimport sys\n\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\nfrom scipy.optimize import "
},
{
"path": "build/lib/wgdi/pindex.py",
"chars": 5176,
"preview": "import os\nimport sys\n\nimport numpy as np\nimport pandas as pd\nimport wgdi.base as base\n\n\nclass pindex():\n def __init__"
},
{
"path": "build/lib/wgdi/polyploidy_classification.py",
"chars": 3467,
"preview": "import pandas as pd\nimport wgdi.base as base\n\n\nclass polyploidy_classification:\n def __init__(self, options):\n "
},
{
"path": "build/lib/wgdi/retain.py",
"chars": 5294,
"preview": "import matplotlib.pyplot as plt\nimport pandas as pd\nimport wgdi.base as base\n\nclass retain:\n def __init__(self, optio"
},
{
"path": "build/lib/wgdi/run.py",
"chars": 7840,
"preview": "import argparse\nimport os\nimport shutil\nimport sys\n\nimport wgdi\nimport wgdi.base as base\nfrom wgdi.align_dotplot import "
},
{
"path": "build/lib/wgdi/run_colliearity.py",
"chars": 8233,
"preview": "import gc\nimport re\nimport sys\nfrom multiprocessing import Pool\n\nimport numpy as np\nimport pandas as pd\n\nimport wgdi.bas"
},
{
"path": "build/lib/wgdi/shared_fusion.py",
"chars": 4438,
"preview": "import pandas as pd\nimport wgdi.base as base\n\nclass shared_fusion:\n def __init__(self, options):\n for k, v in "
},
{
"path": "build/lib/wgdi/trees.py",
"chars": 7904,
"preview": "import os\nimport shutil\nfrom io import StringIO\n\nimport numpy as np\nimport pandas as pd\nfrom Bio import AlignIO, Seq, Se"
},
{
"path": "command.txt",
"chars": 53,
"preview": "python setup.py sdist bdist_wheel\ntwine upload dist/*"
},
{
"path": "setup.py",
"chars": 1120,
"preview": "#!/usr/bin/env python\n# -*- coding: UTF-8 -*-\n\nfrom setuptools import find_packages, setup\n\nwith open(\"README.md\", \"r\", "
},
{
"path": "wgdi/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "wgdi/align_dotplot.py",
"chars": 7097,
"preview": "import re\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\nimport wgdi.base as base\n\nclass align_d"
},
{
"path": "wgdi/ancestral_karyotype.py",
"chars": 3024,
"preview": "import pandas as pd\nfrom Bio import SeqIO\nimport wgdi.base as base\n\n\nclass ancestral_karyotype:\n def __init__(self, o"
},
{
"path": "wgdi/ancestral_karyotype_repertoire.py",
"chars": 3319,
"preview": "\nimport numpy as np\nimport pandas as pd\nfrom Bio import SeqIO\n\nimport wgdi.base as base\n\nclass ancestral_karyotype_reper"
},
{
"path": "wgdi/base.py",
"chars": 9431,
"preview": "import configparser\nimport hashlib\nimport os\nimport re\n\nimport matplotlib\nimport matplotlib.patches as mpatches\nimport n"
},
{
"path": "wgdi/block_correspondence.py",
"chars": 5121,
"preview": "import re\nimport numpy as np\nimport pandas as pd\nimport wgdi.base as base\n\nclass block_correspondence():\n def __init_"
},
{
"path": "wgdi/block_info.py",
"chars": 8681,
"preview": "import numpy as np\nimport pandas as pd\nimport wgdi.base as base\n\n\nclass block_info:\n def __init__(self, options):\n "
},
{
"path": "wgdi/block_ks.py",
"chars": 5767,
"preview": "import re\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\nimport wgdi.base as base\n\n\nclass block_"
},
{
"path": "wgdi/circos.py",
"chars": 11217,
"preview": "import re\nimport sys\n\nimport matplotlib as mpl\nimport matplotlib.patches as mpatches\nimport matplotlib.pyplot as plt\nimp"
},
{
"path": "wgdi/collinearity.py",
"chars": 7394,
"preview": "import numpy as np\nimport pandas as pd\n\n\nclass collinearity:\n def __init__(self, options, points):\n # Default "
},
{
"path": "wgdi/dotplot.py",
"chars": 6144,
"preview": "import re\n\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\n\nimport wgdi.base as base\n\n\nclass dotp"
},
{
"path": "wgdi/example/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "wgdi/example/align.conf",
"chars": 382,
"preview": "[alignment]\nblockinfo = block information file (.csv)\nblockinfo_reverse = false\nclassid = class1\ngff1 = gff1 file\ngff2"
},
{
"path": "wgdi/example/alignmenttrees.conf",
"chars": 551,
"preview": "[alignmenttrees]\nalignment = alignment file (.csv)\ngff = gff file (reference genome, If alignment has no reference speci"
},
{
"path": "wgdi/example/ancestral_karyotype.conf",
"chars": 333,
"preview": "[ancestral_karyotype]\ngff = gff file (cat the relevant 'gff' files into a file)\npep_file = pep file (cat the relevant 'p"
},
{
"path": "wgdi/example/ancestral_karyotype_repertoire.conf",
"chars": 457,
"preview": "[ancestral_karyotype_repertoire]\nblockinfo = block information (*.csv)\n# blockinfo: processed *.csv\nblockinfo_reverse ="
},
{
"path": "wgdi/example/blockinfo.conf",
"chars": 267,
"preview": "[blockinfo]\nblast = blast file\ngff1 = gff1 file\ngff2 = gff2 file\nlens1 = lens1 file\nlens2 = lens2 file\ncollinearity = "
},
{
"path": "wgdi/example/blockks.conf",
"chars": 301,
"preview": "[blockks]\nlens1 = lens1 file\nlens2 = lens2 file\ngenome1_name = Genome1 name\ngenome2_name = Genome2 name\nblockinfo = bl"
},
{
"path": "wgdi/example/circos.conf",
"chars": 426,
"preview": "[circos]\ngff = gff file\nlens = lens file\nradius = 0.2\nangle_gap = 0.05\nring_width = 0.015\ncolors = 1:c,2:m,3:blue,4:g"
},
{
"path": "wgdi/example/collinearity.conf",
"chars": 306,
"preview": "[collinearity]\ngff1 = gff1 file\ngff2 = gff2 file\nlens1 = lens1 file\nlens2 = lens2 file\nblast = blast file\nblast_reverse "
},
{
"path": "wgdi/example/conf.ini",
"chars": 476,
"preview": "[ini]\nmafft_path = /home/sunpc/micromamba/envs/wgdi/bin/mafft\npal2nal_path = /home/sunpc/micromamba/envs/wgdi/bin/pal2na"
},
{
"path": "wgdi/example/corr.conf",
"chars": 225,
"preview": "[correspondence]\nblockinfo = blockinfo file(.csv) \nlens1 = lens1 file\nlens2 = lens2 file\ntandem = true\ntandem_length = "
},
{
"path": "wgdi/example/dotplot.conf",
"chars": 404,
"preview": "[dotplot]\nblast = blast file\ngff1 = gff1 file\ngff2 = gff2 file\nlens1 = lens1 file\nlens2 = lens2 file\ngenome1_name = G"
},
{
"path": "wgdi/example/fusion_positions_database.conf",
"chars": 266,
"preview": "[fusion_positions_database]\npep = pep file\ngff = gff file\nfusion_positions = fusion_positions file\n# Number of gene sets"
},
{
"path": "wgdi/example/fusions_detection.conf",
"chars": 244,
"preview": "[fusions_detection]\nblockinfo = block information (*.csv)\nancestor = ancestor file\n#The number of genes spanned by a syn"
},
{
"path": "wgdi/example/karyotype.conf",
"chars": 116,
"preview": "[karyotype]\nancestor = ancestor chromosome file\nwidth = 0.5\nfigsize = 10,6.18\nsavefig = save image(.png, .pdf, .svg)"
},
{
"path": "wgdi/example/karyotype_mapping.conf",
"chars": 420,
"preview": "[karyotype_mapping]\nblast = blast file\nblast_reverse = false\ngff1 = gff1 file\ngff2 = gff2 file \nscore = 100\nevalue = 1e-"
},
{
"path": "wgdi/example/ks.conf",
"chars": 176,
"preview": "[ks]\ncds_file = \tcds file \n#cat all cds files together\npep_file = \tpep file\n#cat all pep files together\nalign_software ="
},
{
"path": "wgdi/example/ks_fit_result.csv",
"chars": 377,
"preview": ",color,linewidth,linestyle,,,,,,\ncsa_csa,red,2,-,2.532090116,1.510453744,0.229652282,1.638111687,2.048906176,0.345639862"
},
{
"path": "wgdi/example/ksfigure.conf",
"chars": 239,
"preview": "[ksfigure]\nksfit = ksfit result(*.csv)\nlabelfontsize = 15\nlegendfontsize = 15\nxlabel = none \nylabel = none "
},
{
"path": "wgdi/example/kspeaks.conf",
"chars": 247,
"preview": "[kspeaks]\nblockinfo = block information (*.csv)\npvalue = 0.2\ntandem = true\nblock_length = int number\nks_area = 0,10\nmult"
},
{
"path": "wgdi/example/peaksfit.conf",
"chars": 191,
"preview": "[peaksfit]\nblockinfo = block information (*.csv)\nmode = median\nbins_number = 200\nks_area = 0,10\nfontsize = 9\narea = 0,3\n"
},
{
"path": "wgdi/example/pindex.conf",
"chars": 169,
"preview": "[pindex]\nalignment = alignment file (.csv)\ngff = gff file\nlens =lens file\ngap = 50\nretention = 0.05\ndiff = 0.05\nremove_d"
},
{
"path": "wgdi/example/polyploidy_classification.conf",
"chars": 231,
"preview": "[polyploidy classification]\nblockinfo = block information (*.csv)\nancestor_left = ancestor file\nancestor_top = ancestor "
},
{
"path": "wgdi/example/retain.conf",
"chars": 224,
"preview": "[retain]\nalignment = alignment file\ngff = gff file\nlens = lens file\ncolors = red,blue,green\nrefgenome = shorthand\nfigsiz"
},
{
"path": "wgdi/example/shared_fusion.conf",
"chars": 323,
"preview": "[shared_fusion]\nblockinfo = block information (*.csv)\n# The new lens file is the output filtered by lens file.\nlens1 = l"
},
{
"path": "wgdi/fusion_positions_database.py",
"chars": 3037,
"preview": "import pandas as pd\nimport os\nfrom Bio import SeqIO\n\nclass fusion_positions_database:\n def __init__(self, options):\n "
},
{
"path": "wgdi/fusions_detection.py",
"chars": 2903,
"preview": "import pandas as pd\nfrom tabulate import tabulate\n\nclass fusions_detection:\n def __init__(self, options):\n sel"
},
{
"path": "wgdi/karyotype.py",
"chars": 1580,
"preview": "import matplotlib.pyplot as plt\nimport pandas as pd\n\nimport wgdi.base as base\n\n\nclass karyotype():\n def __init__(self"
},
{
"path": "wgdi/karyotype_mapping.py",
"chars": 9669,
"preview": "import numpy as np\nimport pandas as pd\n\nimport wgdi.base as base\n\n\nclass karyotype_mapping:\n def __init__(self, optio"
},
{
"path": "wgdi/ks.py",
"chars": 6469,
"preview": "import os\nimport sys\nimport numpy as np\nimport pandas as pd\nfrom Bio import SeqIO\nimport subprocess\nfrom Bio.Phylo.PAML "
},
{
"path": "wgdi/ks_peaks.py",
"chars": 4936,
"preview": "import matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\nfrom scipy.stats.kde import gaussian_kde\n\nimport "
},
{
"path": "wgdi/ksfigure.py",
"chars": 3528,
"preview": "import re\nimport sys\n\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\nimport wgdi.base as base\nfr"
},
{
"path": "wgdi/peaksfit.py",
"chars": 3468,
"preview": "import re\nimport sys\n\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\nfrom scipy.optimize import "
},
{
"path": "wgdi/pindex.py",
"chars": 5176,
"preview": "import os\nimport sys\n\nimport numpy as np\nimport pandas as pd\nimport wgdi.base as base\n\n\nclass pindex():\n def __init__"
},
{
"path": "wgdi/polyploidy_classification.py",
"chars": 3467,
"preview": "import pandas as pd\nimport wgdi.base as base\n\n\nclass polyploidy_classification:\n def __init__(self, options):\n "
},
{
"path": "wgdi/retain.py",
"chars": 5294,
"preview": "import matplotlib.pyplot as plt\nimport pandas as pd\nimport wgdi.base as base\n\nclass retain:\n def __init__(self, optio"
},
{
"path": "wgdi/run.py",
"chars": 7840,
"preview": "import argparse\nimport os\nimport shutil\nimport sys\n\nimport wgdi\nimport wgdi.base as base\nfrom wgdi.align_dotplot import "
},
{
"path": "wgdi/run_colliearity.py",
"chars": 8233,
"preview": "import gc\nimport re\nimport sys\nfrom multiprocessing import Pool\n\nimport numpy as np\nimport pandas as pd\n\nimport wgdi.bas"
},
{
"path": "wgdi/shared_fusion.py",
"chars": 4438,
"preview": "import pandas as pd\nimport wgdi.base as base\n\nclass shared_fusion:\n def __init__(self, options):\n for k, v in "
},
{
"path": "wgdi/trees.py",
"chars": 7904,
"preview": "import os\nimport shutil\nfrom io import StringIO\n\nimport numpy as np\nimport pandas as pd\nfrom Bio import AlignIO, Seq, Se"
},
{
"path": "wgdi.egg-info/PKG-INFO",
"chars": 5358,
"preview": "Metadata-Version: 2.1\nName: wgdi\nVersion: 0.75\nSummary: A user-friendly toolkit for evolutionary analyses of whole-genom"
},
{
"path": "wgdi.egg-info/SOURCES.txt",
"chars": 1509,
"preview": "LICENSE\nREADME.md\nsetup.py\nwgdi/__init__.py\nwgdi/align_dotplot.py\nwgdi/ancestral_karyotype.py\nwgdi/ancestral_karyotype_r"
},
{
"path": "wgdi.egg-info/dependency_links.txt",
"chars": 1,
"preview": "\n"
},
{
"path": "wgdi.egg-info/entry_points.txt",
"chars": 39,
"preview": "[console_scripts]\nwgdi = wgdi.run:main\n"
},
{
"path": "wgdi.egg-info/requires.txt",
"chars": 56,
"preview": "pandas>=1.1.0\nnumpy\nbiopython\nmatplotlib\nscipy\ntabulate\n"
},
{
"path": "wgdi.egg-info/top_level.txt",
"chars": 5,
"preview": "wgdi\n"
},
{
"path": "wgdi.egg-info/zip-safe",
"chars": 1,
"preview": "\n"
}
]
// ... and 1 more files (download for full content)
About this extraction
This page contains the full source code of the SunPengChuan/wgdi GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 115 files (311.6 KB), approximately 87.8k tokens, and a symbol index with 336 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.