Repository: SunPengChuan/wgdi Branch: master Commit: 00375818da64 Files: 115 Total size: 311.6 KB Directory structure: gitextract_p42u6yxa/ ├── LICENSE ├── README.md ├── __init__.py ├── build/ │ └── lib/ │ └── wgdi/ │ ├── __init__.py │ ├── align_dotplot.py │ ├── ancestral_karyotype.py │ ├── ancestral_karyotype_repertoire.py │ ├── base.py │ ├── block_correspondence.py │ ├── block_info.py │ ├── block_ks.py │ ├── circos.py │ ├── collinearity.py │ ├── dotplot.py │ ├── example/ │ │ ├── __init__.py │ │ ├── align.conf │ │ ├── alignmenttrees.conf │ │ ├── ancestral_karyotype.conf │ │ ├── ancestral_karyotype_repertoire.conf │ │ ├── blockinfo.conf │ │ ├── blockks.conf │ │ ├── circos.conf │ │ ├── collinearity.conf │ │ ├── conf.ini │ │ ├── corr.conf │ │ ├── dotplot.conf │ │ ├── fusion_positions_database.conf │ │ ├── fusions_detection.conf │ │ ├── karyotype.conf │ │ ├── karyotype_mapping.conf │ │ ├── ks.conf │ │ ├── ks_fit_result.csv │ │ ├── ksfigure.conf │ │ ├── kspeaks.conf │ │ ├── peaksfit.conf │ │ ├── pindex.conf │ │ ├── polyploidy_classification.conf │ │ ├── retain.conf │ │ └── shared_fusion.conf │ ├── fusion_positions_database.py │ ├── fusions_detection.py │ ├── karyotype.py │ ├── karyotype_mapping.py │ ├── ks.py │ ├── ks_peaks.py │ ├── ksfigure.py │ ├── peaksfit.py │ ├── pindex.py │ ├── polyploidy_classification.py │ ├── retain.py │ ├── run.py │ ├── run_colliearity.py │ ├── shared_fusion.py │ └── trees.py ├── command.txt ├── dist/ │ └── wgdi-0.75-py3-none-any.whl ├── setup.py ├── wgdi/ │ ├── __init__.py │ ├── align_dotplot.py │ ├── ancestral_karyotype.py │ ├── ancestral_karyotype_repertoire.py │ ├── base.py │ ├── block_correspondence.py │ ├── block_info.py │ ├── block_ks.py │ ├── circos.py │ ├── collinearity.py │ ├── dotplot.py │ ├── example/ │ │ ├── __init__.py │ │ ├── align.conf │ │ ├── alignmenttrees.conf │ │ ├── ancestral_karyotype.conf │ │ ├── ancestral_karyotype_repertoire.conf │ │ ├── blockinfo.conf │ │ ├── blockks.conf │ │ ├── circos.conf │ │ ├── collinearity.conf │ │ ├── conf.ini │ │ ├── corr.conf │ │ ├── dotplot.conf │ │ ├── fusion_positions_database.conf │ │ ├── fusions_detection.conf │ │ ├── karyotype.conf │ │ ├── karyotype_mapping.conf │ │ ├── ks.conf │ │ ├── ks_fit_result.csv │ │ ├── ksfigure.conf │ │ ├── kspeaks.conf │ │ ├── peaksfit.conf │ │ ├── pindex.conf │ │ ├── polyploidy_classification.conf │ │ ├── retain.conf │ │ └── shared_fusion.conf │ ├── fusion_positions_database.py │ ├── fusions_detection.py │ ├── karyotype.py │ ├── karyotype_mapping.py │ ├── ks.py │ ├── ks_peaks.py │ ├── ksfigure.py │ ├── peaksfit.py │ ├── pindex.py │ ├── polyploidy_classification.py │ ├── retain.py │ ├── run.py │ ├── run_colliearity.py │ ├── shared_fusion.py │ └── trees.py └── wgdi.egg-info/ ├── PKG-INFO ├── SOURCES.txt ├── dependency_links.txt ├── entry_points.txt ├── requires.txt ├── top_level.txt └── zip-safe ================================================ FILE CONTENTS ================================================ ================================================ FILE: LICENSE ================================================ Copyright (c) 2018-2018, Pengchuan Sun All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. ================================================ FILE: README.md ================================================ # WGDI ![Latest PyPI version](https://img.shields.io/pypi/v/wgdi.svg) [![Downloads](https://pepy.tech/badge/wgdi/month)](https://pepy.tech/project/wgdi) [![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/wgdi/README.html) | | | | --- | --- | | Author | Pengchuan Sun ([sunpengchuan](https//github.com/sunpengchuan)) | | Email | | | License | [BSD](http://creativecommons.org/licenses/BSD/) | ## Description **WGDI (Whole-Genome Duplication Integrated analysis)** is a Python-based command-line tool designed to simplify the analysis of whole-genome duplications (WGD) and cross-species genome alignments. It offers three main workflows that enhance the detection and study of WGD events: ## Key Features ### 1. Polyploid Inference - Identifies and confirms polyploid events with high accuracy. ### 2. Genomic Homology Inference - Traces the evolutionary history of duplicated regions across species, with a focus on distinguishing subgenomes. ### 3. Ancestral Karyotyping - Reconstructs protochromosomes and traces common chromosomal rearrangements to understand chromosome evolution. ## Installation Python package and command line interface (IDLE) for the analysis of whole genome duplications (WGDI). WGDI can be deployed in Windows, Linux, and Mac OS operating systems and can be installed via pip and conda. #### Bioconda ``` conda install -c bioconda wgdi ``` #### Pypi ``` pip3 install wgdi ``` Documentation for installation along with a user tutorial, a default parameter file, and test data are provided. please consult the docs at . ## Tips Here are some videos with simple examples of WGDI. ###### [WGDI的简单使用(一)](https://www.bilibili.com/video/BV1qK4y1U7eK) or https://youtu.be/k-S6FVcBIQw ###### [WGDI的简单使用(二)](https://www.bilibili.com/video/BV195411P7L1) or https://youtu.be/QiZYFYGclyE chatting group QQ : 966612552 ## Citating WGDI If you use wgdi in your work, please cite: > Sun P., Jiao B., Yang Y., Shan L., Li T., Li X., Xi Z., Wang X., and Liu J. (2022). WGDI: A user-friendly toolkit for evolutionary analyses of whole-genome duplications and ancestral karyotypes. Mol. Plant. doi: https://doi.org/10.1016/j.molp.2022.10.018. ## News ## 0.75 * Fixed some issues (-fpd, -km). * Introduced a threads parameter for the iqtree command within alignmenttrees (-at). ## 0.74 * Improved the the fusion positions dataset (-fpd). * Fixed some issues (-pc). ## 0.7.1 * Added extract the fusion positions dataset (-fpd). * Added determine whether these fusion events occur in other genomes (-fd). * Improved the karyotype_mapping (-km) effect. * Fixed the problem caused by the Python version, now it is compatible with version 3.12. ## 0.6.5 * Fixed some issues (-sf). * Added new tips to avoid some errors. ## 0.6.4 * Fixed the problem caused by the Python version, now it is compatible with version 3.11.3. ## 0.6.3 * Fixed some issues (-ks, -sf). ## 0.6.2 * Added find shared fusions between species (-sf). ## 0.6.1 * Fixed issue with alignment (-a). Only version 0.6.0 has this bug. ## 0.6.0 * Fixed issue with improved collinearity (-icl). * Added a parameter 'tandem_ratio' to blockinfo (-bi). ## 0.5.9 * Update the improved collinearity (-icl). Faster than before, but lower than MCscanX, JCVI. * Fixed issue with ancestral karyotype repertoire (-akr). ## 0.5.8 * Fixed issue with gene names (-ks). ## 0.5.7 - Fixed issue with chromosome order (-ak). - Fixed issue with gene names (-ks). This version is not fixed, please install the latest version. ## 0.5.5 and 0.5.6 * Add ancestral karyotype (-ak) * Add ancestral karyotype repertoire (-akr) ## 0.5.4 * Improved the karyotype_mapping (-km) effect. * little change (-at). ## 0.5.3 * Fixed legend issue with (-kf). * Fixed calculate Ks issue with (-ks). * Improved the karyotype_mapping (-km) effect. * Improved the alignmenttrees (-at) effect. ## 0.5.2 * Fixed some bugs. ## 0.5.1 * Fixed the error of the command (-conf). * Improved the karyotype_mapping (-km) effect. * Added the available data set of alignmenttree (-at). Low copy data set (for example, single-copy_groups.tsv of sonicparanoid2 software). ## 0.4.9 * The latest version adds karyotype_mapping (-km) and karyotype (-k) display. * The latest version changes the calculation of extracting pvalue from collinearity (-icl), making this parameter more sensitive. Therefore, it is recommended to set to 0.2 instead of 0.05. * The latest version has also changed the drawing display of ksfigure (-kf) to make it more beautiful. ================================================ FILE: __init__.py ================================================ ================================================ FILE: build/lib/wgdi/__init__.py ================================================ ================================================ FILE: build/lib/wgdi/align_dotplot.py ================================================ import re import matplotlib.pyplot as plt import numpy as np import pandas as pd import wgdi.base as base class align_dotplot: def __init__(self, options): # Default values self.position = 'order' self.figsize = 'default' self.classid = 'class1' # Initialize from options for k, v in options: setattr(self, str(k), v) print(f'{k} = {v}') self.ks_area = [float(k) for k in getattr(self, 'ks_area', '-1,3').split(',')] self.colors = [str(k) for k in getattr(self, 'colors', 'red,blue,green,black,orange').split(',')] self.ancestor_top = None if getattr(self, 'ancestor_top', 'none') == 'none' else self.ancestor_top self.ancestor_left = None if getattr(self, 'ancestor_left', 'none') == 'none' else self.ancestor_left self.blockinfo_reverse = base.str_to_bool(self.blockinfo_reverse) def pair_position(self, alignment, loc1, loc2, colors): alignment.index = alignment.index.map(loc1) data = [] for i, k in enumerate(alignment.columns): df = alignment[k].map(loc2).dropna() for idx, row in df.items(): data.append([idx, row, colors[i]]) return pd.DataFrame(data, columns=['loc1', 'loc2', 'color']) def run(self): axis = [0, 1, 1, 0] # Lens generation and figure size lens1 = base.newlens(self.lens1, self.position) lens2 = base.newlens(self.lens2, self.position) if re.search(r'\d', self.figsize): self.figsize = [float(k) for k in self.figsize.split(',')] else: self.figsize = np.array([1, float(lens1.sum()) / float(lens2.sum())]) * 10 plt.rcParams['ytick.major.pad'] = 0 # Create plot fig, ax = plt.subplots(figsize=self.figsize) ax.xaxis.set_ticks_position('top') step1, step2 = 1 / float(lens1.sum()), 1 / float(lens2.sum()) # Process Ancestor Data if self.ancestor_left: axis[0] = -0.02 lens_ancestor_left = self.process_ancestor(self.ancestor_left, lens1.index) if self.ancestor_top: axis[3] = -0.02 lens_ancestor_top = self.process_ancestor(self.ancestor_top, lens2.index) base.dotplot_frame(fig, ax, lens1, lens2, step1, step2, self.genome1_name, self.genome2_name, [0, 1]) # Process GFF files gff1, gff2 = base.newgff(self.gff1), base.newgff(self.gff2) gff1 = base.gene_location(gff1, lens1, step1, self.position) gff2 = base.gene_location(gff2, lens2, step2, self.position) if self.ancestor_top: self.ancestor_position(ax, gff2, lens_ancestor_top, 'top') if self.ancestor_left: self.ancestor_position(ax, gff1, lens_ancestor_left, 'left') # Process block info and alignment bkinfo = self.process_blockinfo(lens1,lens2) align = self.alignment(gff1, gff2, bkinfo) alignment = align[gff1.columns[-len(bkinfo[self.classid].drop_duplicates()):]] alignment.to_csv(self.savefile, header=False) # Create scatter plot df = self.pair_position(alignment, gff1['loc'], gff2['loc'], self.colors) plt.scatter(df['loc2'], df['loc1'], s=float(self.markersize), c=df['color'], alpha=0.5, edgecolors=None, linewidths=0, marker='o') ax.axis(axis) plt.subplots_adjust(left=0.07, right=0.97, top=0.93, bottom=0.03) plt.savefig(self.savefig, dpi=500) plt.show() def process_ancestor(self, ancestor_file, lens_index): df = pd.read_csv(ancestor_file, sep="\t", header=None) df[0] = df[0].astype(str) df[3] = df[3].astype(str) df[4] = df[4].astype(int) df[4] = df[4] / df[4].max() return df[df[0].isin(lens_index)] def process_blockinfo(self, lens1, lens2): bkinfo = pd.read_csv(self.blockinfo, index_col='id') if self.blockinfo_reverse == True: bkinfo[['chr1', 'chr2']] = bkinfo[['chr2', 'chr1']] bkinfo[['block1', 'block2']] = bkinfo[['block2', 'block1']] bkinfo['chr1'] = bkinfo['chr1'].astype(str) bkinfo['chr2'] = bkinfo['chr2'].astype(str) bkinfo[self.classid] = bkinfo[self.classid].astype(str) return bkinfo[bkinfo['chr1'].isin(lens1.index) & (bkinfo['chr2'].isin(lens2.index))] def alignment(self, gff1, gff2, bkinfo): gff1['uid'] = gff1['chr'] + 'g' + gff1['order'].astype(str) gff2['uid'] = gff2['chr'] + 'g' + gff2['order'].astype(str) gff1['id'] = gff1.index gff2['id'] = gff2.index for cl, group in bkinfo.groupby(self.classid): name = f'l{cl}' gff1[name] = '' group = group.sort_values(by=['length'], ascending=True) for _, row in group.iterrows(): block = self.create_block_dataframe(row) if block.empty: continue block1_min, block1_max = block['block1'].agg(['min', 'max']) area = gff1[(gff1['chr'] == row['chr1']) & (gff1['order'] >= block1_min) & (gff1['order'] <= block1_max)].index block['id1'] = (row['chr1'] + 'g' + block['block1'].astype(str)).map( dict(zip(gff1['uid'], gff1.index))) block['id2'] = (row['chr2'] + 'g' + block['block2'].astype(str)).map( dict(zip(gff2['uid'], gff2.index))) gff1.loc[block['id1'].values, name] = block['id2'].values gff1.loc[gff1.index.isin(area) & gff1[name].eq(''), name] = '.' return gff1 def create_block_dataframe(self, row): b1, b2, ks = row['block1'].split('_'), row['block2'].split('_'), row['ks'].split('_') ks = list(map(float, ks[1:])) if ks[0] == '' else list(map(float, ks)) block = pd.DataFrame(np.array([b1, b2, ks]).T, columns=['block1', 'block2', 'ks']) block['block1'] = block['block1'].astype(int) block['block2'] = block['block2'].astype(int) block['ks'] = block['ks'].astype(float) return block[(block['ks'] <= self.ks_area[1]) & (block['ks'] >= self.ks_area[0])].drop_duplicates(subset=['block1'], keep='first') def ancestor_position(self, ax, gff, lens, mark): for _, row in lens.iterrows(): loc1 = gff[(gff['chr'] == row[0]) & (gff['order'] == int(row[1]))].index loc2 = gff[(gff['chr'] == row[0]) & (gff['order'] == int(row[2]))].index loc1, loc2 = gff.loc[[loc1[0], loc2[0]], 'loc'] if mark == 'top': width = abs(loc1-loc2) loc = [min(loc1, loc2), 0] height = -0.02 if mark == 'left': height = abs(loc1-loc2) loc = [-0.02, min(loc1, loc2), ] width = 0.02 base.Rectangle(ax, loc, height, width, row[3], row[4]) ================================================ FILE: build/lib/wgdi/ancestral_karyotype.py ================================================ import pandas as pd from Bio import SeqIO import wgdi.base as base class ancestral_karyotype: def __init__(self, options): self.mark = 'aak' # Set attributes from options for k, v in options: setattr(self, str(k), v) print(f"{k} = {v}") def run(self): # Load and filter data gff = base.newgff(self.gff) ancestor = base.read_classification(self.ancestor) gff = gff[gff['chr'].isin(ancestor[0].values.tolist())] # Create new gff copy and initialize required variables newgff = gff.copy() data, num = [], 1 # Create dictionary mapping chromosome to order chr_arr = ancestor[3].drop_duplicates().to_list() chr_dict = {chr: idx + 1 for idx, chr in enumerate(chr_arr)} ancestor['order'] = ancestor[3].map(chr_dict) dict1, dict2 = {}, {} # Process ancestor and gff information for (cla, order), group in ancestor.groupby([4, 'order'], sort=[False, False]): for index, row in group.iterrows(): index1 = gff[(gff['chr'] == row[0]) & (gff['order'] >= row[1]) & (gff['order'] <= row[2])].index newgff.loc[index1, 'chr'] = str(num) # Store results in data for k in index1: data.append(newgff.loc[k, :].values.tolist() + [k]) dict1[str(num)] = cla dict2[str(num)] = group[3].values[0] num += 1 # Create dataframe from the data collected df = pd.DataFrame(data) # Filter based on peptide file pep = SeqIO.to_dict(SeqIO.parse(self.pep_file, "fasta")) df = df[df[6].isin(pep.keys())] # Assign new names and order for name, group in df.groupby(0): df.loc[group.index, 'order'] = range(1, len(group) + 1) df.loc[group.index, 'newname'] = [f"{self.mark}{name}g{i:05d}" for i in range(1, len(group) + 1)] # Set data types and sort df['order'] = df['order'].astype(int) df = df[[0, 'newname', 1, 2, 3, 'order', 6]].sort_values(by=[0, 'order']) # Save output files df.to_csv(self.ancestor_gff, sep="\t", index=False, header=None) lens = df.groupby(0).max()[[2, 'order']] lens.to_csv(self.ancestor_lens, sep="\t", header=None) # Add extra columns and save final results lens[1] = 1 lens['color'] = lens.index.map(dict2) lens['class'] = lens.index.map(dict1) lens[[1, 'order', 'color', 'class']].to_csv(self.ancestor_file, sep="\t", header=None) # Update peptide sequences with new IDs and save id_dict = df.set_index(6).to_dict()['newname'] seqs = [] for seq_record in SeqIO.parse(self.pep_file, "fasta"): if seq_record.id in id_dict: seq_record.id = id_dict[seq_record.id] seqs.append(seq_record) SeqIO.write(seqs, self.ancestor_pep, "fasta") ================================================ FILE: build/lib/wgdi/ancestral_karyotype_repertoire.py ================================================ import numpy as np import pandas as pd from Bio import SeqIO import wgdi.base as base class ancestral_karyotype_repertoire(): def __init__(self, options): self.gap = 5 self.direction = 0.01 self.mark = 'aak1s' self.blockinfo_reverse = False for k, v in options: setattr(self, str(k), v) print(k, ' = ', v) self.blockinfo_reverse = base.str_to_bool(self.blockinfo_reverse) def run(self): gff1 = base.newgff(self.gff1) gff2 = base.newgff(self.gff2) bkinfo = pd.read_csv(self.blockinfo, index_col='id') if self.blockinfo_reverse == True: bkinfo[['chr1', 'chr2']] = bkinfo[['chr2', 'chr1']] bkinfo[['block1', 'block2']] = bkinfo[['block2', 'block1']] for index, row in bkinfo.iterrows(): block1, block2 = row['block1'].split('_'), row['block2'].split('_') block1, block2 = [int(k) for k in block1], [int(k) for k in block2] if int(block1[1])-int(block1[0]) < 0: self.direction = -0.01 for i in range(1, len(block2)): if abs(block1[i]-block1[i-1]) == 1 and abs(block2[i]-block2[i-1]) < int(self.gap): gff1_id = gff1[(gff1['chr'] == str(row['chr1'])) & ( gff1['order'] == block1[i])].index[0] order = gff1.loc[gff1_id, 'order'] gff1_row = gff1.loc[gff1_id, :].copy() for num in range(block2[i-1], block2[i]): order = order + self.direction id = gff2[(gff2['chr'] == str(row['chr2'])) & (gff2['order'] == num)].index[0] gff1_row['order'] = order gff1.loc[id, :] = gff1_row df = gff1.copy() df = df.sort_values(by=['chr', 'order']) for name, group in df.groupby(['chr']): df.loc[group.index, 'order'] = list(range(1, len(group)+1)) df.loc[group.index, 'newname'] = list( [str(self.mark)+str(name)+'g'+str(i).zfill(5) for i in range(1, len(group)+1)]) df['order'] = df['order'].astype(int) df['oldname'] = df.index columns = ['chr', 'newname', 'start', 'end', 'strand', 'order', 'oldname'] df[columns].to_csv(self.ancestor_gff, sep="\t", index=False, header=None) lens = df.groupby('chr').max()[['end', 'order']] lens['end'] = lens['end'].astype(np.int64) lens.to_csv(self.ancestor_lens, sep="\t", header=None) ancestor = base.read_classification(self.ancestor) for index, row in ancestor.iterrows(): ancestor.at[index, 1] = 1 ancestor.at[index, 2] = lens.at[str(row[0]),'order'] ancestor.to_csv(self.ancestor_new, sep="\t", index=False, header=None) id_dict = df['newname'].to_dict() seqs = [] for seq_record in SeqIO.parse(self.ancestor_pep, "fasta"): if seq_record.id in id_dict: seq_record.id = id_dict[seq_record.id] else: continue seq_record.description = '' seqs.append(seq_record) SeqIO.write(seqs, self.ancestor_pep_new, "fasta") ================================================ FILE: build/lib/wgdi/base.py ================================================ import configparser import hashlib import os import re import matplotlib import matplotlib.patches as mpatches import numpy as np import pandas as pd from Bio import SeqIO import wgdi def gen_md5_id(item): """Generate MD5 hash for the given item.""" return hashlib.md5(item.encode('utf-8')).hexdigest() def config(): """Read configuration from the example conf.ini file.""" conf = configparser.ConfigParser() conf.read(os.path.join(wgdi.__path__[0], 'example/conf.ini')) return conf.items('ini') def load_conf(file, section): """Load configuration items from the specified section.""" conf = configparser.ConfigParser() conf.read(file) return conf.items(section) def rewrite(file, section): """Rewrite the configuration file to keep only the specified section.""" conf = configparser.ConfigParser() conf.read(file) if conf.has_section(section): for k in conf.sections(): if k != section: conf.remove_section(k) conf.write(open(os.path.join(wgdi.__path__[0], 'example/conf.ini'), 'w')) print('Option ini has been modified') else: print('Option ini no change') def read_colinearscan(file): """Read colinearscan output and parse into data structure.""" data, b, flag, num = [], [], 0, 1 with open(file) as f: for line in f: line = line.strip() if re.match(r"the", line): num = re.search(r'\d+', line).group() b = [] flag = 1 continue if re.match(r"\>LOCALE", line): flag = 0 p = re.split(':', line) if b: data.append([num, b, p[1]]) b = [] continue if flag == 1: a = re.split(r"\s", line) b.append(a) if b: data.append([num, b, p[1]]) return data def read_mcscanx(fn): """Read mcscanx output and parse into data structure.""" with open(fn) as f1: data, b = [], [] flag, num = 0, 0 for line in f1: line = line.strip() if re.match(r"## Alignment", line): flag = 1 if not b: arr = re.findall(r"[\d+\.]+", line)[0] continue data.append([num, b, 0]) b = [] num = re.findall(r"\d+", line)[0] continue if flag == 0: continue a = re.split(r"\:", line) c = re.split(r"\s+", a[1]) b.append([c[1], c[1], c[2], c[2]]) if b: data.append([num, b, 0]) return data def read_jcvi(fn): """Read jcvi output and parse into data structure.""" with open(fn) as f1: data, b = [], [] num = 1 for line in f1: line = line.strip() if re.match(r"###", line): if b: data.append([num, b, 0]) b = [] num += 1 continue a = re.split(r"\t", line) b.append([a[0], a[0], a[1], a[1]]) if b: data.append([num, b, 0]) return data def read_collinearity(fn): """Read collinearity output and parse into data structure.""" with open(fn) as f1: data, b = [], [] flag, arr = 0, [] for line in f1: line = line.strip() if re.match(r"# Alignment", line): flag = 1 if not b: arr = re.findall(r'[\.\d+]+', line) continue data.append([arr[0], b, arr[2]]) b = [] arr = re.findall(r'[\.\d+]+', line) continue if flag == 0: continue b.append(re.split(r"\s", line)) if b: data.append([arr[0], b, arr[2]]) return data def read_ks(file, col): """Read KS values from file and select specified column.""" ks = pd.read_csv(file, sep='\t') ks.drop_duplicates(subset=['id1', 'id2'], keep='first', inplace=True) ks[col] = ks[col].astype(float) ks = ks[ks[col] >= 0] ks.index = ks['id1'] + ',' + ks['id2'] return ks[col] def get_median(data): """Calculate the median of the data list.""" if not data: return 0 data_sorted = sorted(data) half = len(data_sorted) // 2 return (data_sorted[half] + data_sorted[-(half + 1)]) / 2 def cds_to_pep(cds_file, pep_file, fmt='fasta'): """Translate CDS sequences to peptide sequences and write to file.""" records = list(SeqIO.parse(cds_file, fmt)) for rec in records: rec.seq = rec.seq.translate() SeqIO.write(records, pep_file, 'fasta') return True def newblast(file, score, evalue, gene_loc1, gene_loc2, reverse): """Filter BLAST results based on score, evalue, and gene locations.""" blast = pd.read_csv(file, sep="\t", header=None) if reverse == 'true': blast[[0, 1]] = blast[[1, 0]] blast = blast[(blast[11] >= score) & (blast[10] < evalue) & (blast[1] != blast[0])] blast = blast[(blast[0].isin(gene_loc1.index)) & (blast[1].isin(gene_loc2.index))] blast.drop_duplicates(subset=[0, 1], keep='first', inplace=True) blast[0] = blast[0].astype(str) blast[1] = blast[1].astype(str) return blast def newgff(file): """Read GFF file and rename columns with appropriate data types.""" gff = pd.read_csv(file, sep="\t", header=None, index_col=1) gff.rename(columns={0: 'chr', 2: 'start', 3: 'end', 4: 'strand', 5: 'order'}, inplace=True) gff['chr'] = gff['chr'].astype(str) gff['start'] = gff['start'].astype(np.int64) gff['end'] = gff['end'].astype(np.int64) gff['strand'] = gff['strand'].astype(str) gff['order'] = gff['order'].astype(int) return gff def newlens(file, position): """Read lens file and select position based on 'order' or 'end'.""" lens = pd.read_csv(file, sep="\t", header=None, index_col=0) lens.index = lens.index.astype(str) if position == 'order': lens = lens[2] elif position == 'end': lens = lens[1] return lens def read_classification(file): """Read classification data and convert columns to appropriate types.""" classification = pd.read_csv(file, sep="\t", header=None) classification[0] = classification[0].astype(str) classification[1] = classification[1].astype(int) classification[2] = classification[2].astype(int) classification[3] = classification[3].astype(str) classification[4] = classification[4].astype(int) return classification def gene_location(gff, lens, step, position): """Calculate gene locations based on lens and step.""" gff = gff[gff['chr'].isin(lens.index)].copy() if gff.empty: print('Stoped! \n\nChromosomes in gff file and lens file do not correspond.') exit(0) dict_chr = dict(zip(lens.index, np.append(np.array([0]), lens.cumsum()[:-1].values))) gff['loc'] = '' for name, group in gff.groupby('chr'): gff.loc[group.index, 'loc'] = (dict_chr[name] + group[position]) * step return gff def dotplot_frame(fig, ax, lens1, lens2, step1, step2, genome1_name, genome2_name, arr, pad = 0): """Set up the dotplot frame with grid lines and labels.""" for k in lens1.cumsum()[:-1] * step1: ax.axhline(y=k, alpha=0.8, color='black', lw=0.5) for k in lens2.cumsum()[:-1] * step2: ax.axvline(x=k, alpha=0.8, color='black', lw=0.5) align = dict(family='DejaVu Sans', style='italic', horizontalalignment="center", verticalalignment="center") yticks = lens1.cumsum() * step1 - 0.5 * lens1 * step1 ax.set_yticks(yticks) ax.set_yticklabels(lens1.index, fontsize = 13, family='DejaVu Sans', style='normal') ax.tick_params(axis='y', which='major', pad = pad) ax.tick_params(axis='x', which='major', pad = pad) xticks = lens2.cumsum() * step2 - 0.5 * lens2 * step2 ax.set_xticks(xticks) ax.set_xticklabels(lens2.index, fontsize = 13, family='DejaVu Sans', style='normal') ax.xaxis.set_ticks_position('none') ax.yaxis.set_ticks_position('none') if arr[0] <= 0: ax.text(-0.06, 0.5, genome1_name, weight='semibold', fontsize=16, rotation=90, **align) else: ax.text(-0.06, 0.5, genome1_name, weight='semibold', fontsize=16, rotation=90, **align) if arr[1] < 0: ax.text(0.5, -0.06, genome2_name, weight='semibold', fontsize=16, **align) else: ax.text(0.5, -0.06, genome2_name, weight='semibold', fontsize=16, **align) def Bezier3(plist, t): """Calculate Bezier curve of degree 3.""" p0, p1, p2 = plist return p0 * (1 - t) ** 2 + 2 * p1 * t * (1 - t) + p2 * t ** 2 def Bezier4(plist, t): """Calculate Bezier curve of degree 4.""" p0, p1, p2, p3, p4 = plist return p0 * (1 - t) ** 4 + 4 * p1 * t * (1 - t) ** 3 + 6 * p2 * t ** 2 * (1 - t) ** 2 + 4 * p3 * (1 - t) * t ** 3 + p4 * t ** 4 def Rectangle(ax, loc, height, width, color, alpha): """Draw a rectangle on the axes with specified properties.""" p = mpatches.Rectangle(loc, width, height, edgecolor=None, facecolor=color, alpha=alpha) ax.add_patch(p) def str_to_bool(s): if isinstance(s, bool): return s return str(s).strip().lower() == 'true' ================================================ FILE: build/lib/wgdi/block_correspondence.py ================================================ import re import numpy as np import pandas as pd import wgdi.base as base class block_correspondence(): def __init__(self, options): # Default values self.tandem = True self.pvalue = 0.2 self.position = 'order' self.block_length = 5 self.tandem_length = 200 self.tandem_ratio = 1 self.ks_hit = 0.5 # Set user-defined options for k, v in options: setattr(self, str(k), v) print(k, ' = ', v) # Parse ks_area and homo if present self.ks_area = [float(k) for k in getattr(self, 'ks_area', '-1,3').split(',')] self.homo = [float(k) for k in self.homo.split(',')] self.tandem_ratio = float(self.tandem_ratio) self.tandem = base.str_to_bool(self.tandem) def run(self): lens1 = base.newlens(self.lens1, self.position) lens2 = base.newlens(self.lens2, self.position) # Load block information from CSV bkinfo = pd.read_csv(self.blockinfo) bkinfo = self.preprocess_blockinfo(bkinfo, lens1, lens2) # Initialize correspondence DataFrame cor = self.initialize_correspondence(lens1, lens2) # If no tandem allowed, remove tandem regions if not self.tandem: bkinfo = self.remove_tandem(bkinfo) # Remove low KS hits bkinfo = self.remove_ks_hit(bkinfo) # Find collinearity regions and save results collinear_indices = self.collinearity_region(cor, bkinfo, lens1) bkinfo.loc[bkinfo.index.isin(collinear_indices), :].to_csv(self.savefile, index=False) def preprocess_blockinfo(self, bkinfo, lens1, lens2): bkinfo['chr1'] = bkinfo['chr1'].astype(str) bkinfo['chr2'] = bkinfo['chr2'].astype(str) # Filter by length, chromosome indices, and p-value bkinfo = bkinfo[(bkinfo['length'] >= int(self.block_length)) & (bkinfo['chr1'].isin(lens1.index)) & (bkinfo['chr2'].isin(lens2.index)) & (bkinfo['pvalue'] <= float(self.pvalue))] # Filter by tandem ratio if the column exists if 'tandem_ratio' in bkinfo.columns: bkinfo = bkinfo[bkinfo['tandem_ratio'] <= self.tandem_ratio] return bkinfo def initialize_correspondence(self, lens1, lens2): # Create correspondence DataFrame with initial values cor = [[k, i, 0, lens1[i], j, 0, lens2[j], float(self.homo[0]), float(self.homo[1])] for k in range(1, int(self.multiple) + 1) for i in lens1.index for j in lens2.index] cor = pd.DataFrame(cor, columns=['sub', 'chr1', 'start1', 'end1', 'chr2', 'start2', 'end2', 'homo1', 'homo2']) cor['chr1'] = cor['chr1'].astype(str) cor['chr2'] = cor['chr2'].astype(str) return cor def remove_tandem(self, bkinfo): # Remove tandem regions from the DataFrame group = bkinfo[bkinfo['chr1'] == bkinfo['chr2']].copy() group['start'] = group['start1'] - group['start2'] group['end'] = group['end1'] - group['end2'] tandem_condition = (group['start'].abs() <= int(self.tandem_length)) | (group['end'].abs() <= int(self.tandem_length)) index_to_remove = group[tandem_condition].index return bkinfo.drop(index_to_remove) def remove_ks_hit(self, bkinfo): # Remove records with insufficient KS hits for index, row in bkinfo.iterrows(): ks = self.get_ks_value(row['ks']) ks_ratio = len([k for k in ks if self.ks_area[0] <= k <= self.ks_area[1]]) / len(ks) if ks_ratio < self.ks_hit: bkinfo.drop(index, inplace=True) return bkinfo def get_ks_value(self, ks_str): # Extract and return KS values as floats ks = ks_str.split('_') ks = list(map(float, ks[1:])) if ks[0] == '' else list(map(float, ks)) return ks def collinearity_region(self, cor, bkinfo, lens): collinear_indices = [] for (chr1, chr2), group in bkinfo.groupby(['chr1', 'chr2']): group = group.sort_values(by=['length'], ascending=False) df = pd.Series(0, index=range(1, int(lens[str(chr1)]) + 1)) for index, row in group.iterrows(): # Check homology conditions if not self.is_valid_homo(row): continue # Update the block series and compute ratio b1 = [int(k) for k in row['block1'].split('_')] df1 = df.copy() df1[b1] += 1 ratio = (len(df1[df1 > 0]) - len(df[df > 0])) / len(b1) if ratio < 0.5: continue df[b1] += 1 collinear_indices.append(index) return collinear_indices def is_valid_homo(self, row): # Check if the homology values are within the specified range return self.homo[0] <= row['homo' + self.multiple] <= self.homo[1] ================================================ FILE: build/lib/wgdi/block_info.py ================================================ import numpy as np import pandas as pd import wgdi.base as base class block_info: def __init__(self, options): self.repeat_number = 20 self.ks_col = 'ks_NG86' self.blast_reverse = False for k, v in options: setattr(self, str(k), v) print(f"{k} = {v}") self.repeat_number = int(self.repeat_number) self.blast_reverse = base.str_to_bool(self.blast_reverse) def block_position(self, collinearity, blast, gff1, gff2, ks): data = [] for block in collinearity: blk_homo, blk_ks = [], [] # Skip blocks with missing gene coordinates in GFF files if block[1][0][0] not in gff1.index or block[1][0][2] not in gff2.index: continue # Extract chromosome info chr1, chr2 = gff1.at[block[1][0][0], 'chr'], gff2.at[block[1][0][2], 'chr'] # Extract start and end positions array1, array2 = [float(i[1]) for i in block[1]], [float(i[3]) for i in block[1]] start1, end1 = array1[0], array1[-1] start2, end2 = array2[0], array2[-1] block1, block2 = [], [] for k in block[1]: block1.append(int(float(k[1]))) block2.append(int(float(k[3]))) # Check for KS values pair_ks = self.get_ks_value(ks, k) blk_ks.append(pair_ks) # Retrieve blast homo data if k[0]+","+k[2] in blast.index: blk_homo.append(blast.loc[k[0]+","+k[2], [f'homo{i}' for i in range(1, 6)]].values.tolist()) ks_median, ks_average = self.calculate_ks_statistics(blk_ks) homo = self.calculate_homo_statistics(blk_homo) blkks = '_'.join([str(k) for k in blk_ks]) block1 = '_'.join([str(k) for k in block1]) block2 = '_'.join([str(k) for k in block2]) # Calculate tandem ratio tandem_ratio = self.tandem_ratio(blast, gff2, block[1]) # Store the results data.append([ block[0], chr1, chr2, start1, end1, start2, end2, block[2], len(block[1]), ks_median, ks_average, *homo, block1, block2, blkks, tandem_ratio ]) # Create a DataFrame with the results data_df = pd.DataFrame(data, columns=[ 'id', 'chr1', 'chr2', 'start1', 'end1', 'start2', 'end2', 'pvalue', 'length', 'ks_median', 'ks_average', 'homo1', 'homo2', 'homo3', 'homo4', 'homo5', 'block1', 'block2', 'ks', 'tandem_ratio' ]) # Calculate density data_df['density1'] = data_df['length'] / ((data_df['end1'] - data_df['start1']).abs() + 1) data_df['density2'] = data_df['length'] / ((data_df['end2'] - data_df['start2']).abs() + 1) return data_df def get_ks_value(self, ks, k): """Return KS value for the given pair of genes.""" pair = f"{k[0]},{k[2]}" if pair in ks.index: return ks[pair] pair_rev = f"{k[2]},{k[0]}" if pair_rev in ks.index: return ks[pair_rev] return -1 def calculate_ks_statistics(self, blk_ks): """Calculate KS statistics: median and average.""" ks_arr = [k for k in blk_ks if k >= 0] if len(ks_arr) == 0: return -1, -1 ks_median = base.get_median(ks_arr) ks_average = sum(ks_arr) / len(ks_arr) return ks_median, ks_average def calculate_homo_statistics(self, blk_homo): """Calculate homo statistics by averaging across all blocks.""" df = pd.DataFrame(blk_homo) homo = df.mean().values if len(df) > 0 else [-1, -1, -1, -1, -1] return homo def blast_homo(self, blast, gff1, gff2, repeat_number): """Assign homo values based on blast data.""" index = [group.sort_values(by=11, ascending=False)[:repeat_number].index.tolist() for name, group in blast.groupby([0])] blast = blast.loc[np.concatenate([k[:repeat_number] for k in index], dtype=object), [0, 1]] blast = blast.assign(homo1=np.nan, homo2=np.nan, homo3=np.nan, homo4=np.nan, homo5=np.nan) # Assign homo values for i in range(1, 6): bluenum = i + 5 redindex = np.concatenate([k[:i] for k in index], dtype=object) blueindex = np.concatenate([k[i:bluenum] for k in index], dtype=object) grayindex = np.concatenate([k[bluenum:repeat_number] for k in index], dtype=object) blast.loc[redindex, f'homo{i}'] = 1 blast.loc[blueindex, f'homo{i}'] = 0 blast.loc[grayindex, f'homo{i}'] = -1 blast['chr1_order'] = blast[0].map(gff1['order']) blast['chr2_order'] = blast[1].map(gff2['order']) return blast def tandem_ratio(self, blast, gff2, block): """Calculate tandem ratio for a block.""" block = pd.DataFrame(block)[[0, 2]].rename(columns={0: 'id1', 2: 'id2'}) block['order2'] = block['id2'].map(gff2['order']) # Filter block_blast data block_blast = blast[(blast[0].isin(block['id1'].values)) & (blast[1].isin(block['id2'].values))].copy() block_blast = pd.merge(block_blast, block, left_on=0, right_on='id1', how='left') block_blast['difference'] = (block_blast['chr2_order'] - block_blast['order2']).abs() # Filter based on difference and calculate ratio block_blast = block_blast[(block_blast['difference'] <= self.repeat_number) & (block_blast['difference'] > 0)] return len(block_blast[0].unique()) / len(block) * len(block_blast) / (len(block) + len(block_blast)) def run(self): """Main function to run the analysis.""" # Initialize required datasets lens1 = base.newlens(self.lens1, self.position) lens2 = base.newlens(self.lens2, self.position) gff1 = base.newgff(self.gff1) gff2 = base.newgff(self.gff2) # Filter GFF files based on chromosome indices gff1 = gff1[gff1['chr'].isin(lens1.index)] gff2 = gff2[gff2['chr'].isin(lens2.index)] # Load blast data blast = base.newblast(self.blast, int(self.score), float(self.evalue), gff1, gff2, self.blast_reverse) blast = self.blast_homo(blast, gff1, gff2, self.repeat_number) blast.index = blast[0] + ',' + blast[1] # Get collinearity data collinearity = self.auto_file(gff1, gff2) # Load ks data if necessary ks = pd.Series([]) if self.ks == 'none' or self.ks == '' or not hasattr(self, 'ks') else base.read_ks(self.ks, self.ks_col) # Get the block position data data = self.block_position(collinearity, blast, gff1, gff2, ks) data['class1'] = 0 data['class2'] = 0 # Save results data.to_csv(self.savefile, index=None) def auto_file(self, gff1, gff2): """Auto-detect and read collinearity file.""" with open(self.collinearity) as f: p = ' '.join(f.readlines()[0:30]) # Handle different file formats if 'path length' in p or 'MAXIMUM GAP' in p: return base.read_colinearscan(self.collinearity) elif 'MATCH_SIZE' in p or '## Alignment' in p: return self.process_mcscanx(gff1, gff2) elif '# Alignment' in p: return base.read_collinearity(self.collinearity) elif '###' in p: return self.process_jcvi(gff1, gff2) def process_mcscanx(self, gff1, gff2): """Process MCScanX format collinearity data.""" col = base.read_mcscanx(self.collinearity) collinearity = [] for block in col: newblock = [k for k in block[1] if k[0] in gff1.index and k[2] in gff2.index] if newblock: for k in newblock: k[1], k[3] = gff1.at[k[0], 'order'], gff2.at[k[2], 'order'] collinearity.append([block[0], newblock, block[2]]) return collinearity def process_jcvi(self, gff1, gff2): """Process JCVI format collinearity data.""" col = base.read_jcvi(self.collinearity) collinearity = [] for block in col: newblock = [k for k in block[1] if k[0] in gff1.index and k[2] in gff2.index] if newblock: for k in newblock: k[1], k[3] = gff1.at[k[0], 'order'], gff2.at[k[2], 'order'] collinearity.append([block[0], newblock, block[2]]) return collinearity ================================================ FILE: build/lib/wgdi/block_ks.py ================================================ import re import matplotlib.pyplot as plt import numpy as np import pandas as pd import wgdi.base as base class block_ks: def __init__(self, options): # Default parameters self.markersize = 0.8 self.figsize = 'default' self.tandem_length = 200 self.blockinfo_reverse = False self.tandem = False self.area = [0, 3] self.position = 'order' self.ks_col = 'ks_NG86' self.pvalue = 0.01 # Overriding default parameters with options for k, v in options: setattr(self, str(k), v) print(f"{k} = {v}") # Parsing area as a float list self.area = [float(k) for k in str(self.area).split(',')] self.markersize = float(self.markersize) self.tandem_length = int(self.tandem_length) self.blockinfo_reverse = base.str_to_bool(self.blockinfo_reverse) self.remove_tandem = base.str_to_bool(self.remove_tandem) def block_position(self, bkinfo, lens1, lens2, step1, step2): pos, pairs = [], [] # Create mappings for chromosome positions dict_y_chr = dict(zip(lens1.index, np.append([0], lens1.cumsum()[:-1].values))) dict_x_chr = dict(zip(lens2.index, np.append([0], lens2.cumsum()[:-1].values))) # Iterate through block information for _, row in bkinfo.iterrows(): block1 = row['block1'].split('_') block2 = row['block2'].split('_') ks = row['ks'].split('_') locy_median = (dict_y_chr[row['chr1']] + 0.5 * (row['end1'] + row['start1'])) * step1 locx_median = (dict_x_chr[row['chr2']] + 0.5 * (row['end2'] + row['start2'])) * step2 pos.append([locx_median, locy_median, row['ks_median']]) # Ensure ks length matches block length if len(block1) != len(ks): ks = ks[1:] for i in range(len(block1)): locy = (dict_y_chr[row['chr1']] + float(block1[i])) * step1 locx = (dict_x_chr[row['chr2']] + float(block2[i])) * step2 pairs.append([locx, locy, float(ks[i])]) return pos, pairs def remove_tandem(self, bkinfo): # Filter for same-chromosome blocks group = bkinfo[bkinfo['chr1'] == bkinfo['chr2']].copy() # Calculate block start and end differences group['start'] = group['start1'] - group['start2'] group['end'] = group['end1'] - group['end2'] # Remove tandems based on threshold index = group[(group['start'].abs() <= self.tandem_length) | (group['end'].abs() <= self.tandem_length)].index return bkinfo.drop(index) def run(self): # Initialize axis and chromosome lens axis = [0, 1, 1, 0] lens1 = base.newlens(self.lens1, self.position) lens2 = base.newlens(self.lens2, self.position) # Parse figsize if re.search(r'\d', self.figsize): self.figsize = [float(k) for k in self.figsize.split(',')] else: self.figsize = np.array([1, float(lens1.sum()) / float(lens2.sum())]) * 10 # Calculate step sizes step1 = 1 / float(lens1.sum()) step2 = 1 / float(lens2.sum()) # Create figure and axes fig, ax = plt.subplots(figsize=self.figsize) plt.rcParams['ytick.major.pad'] = 0 ax.xaxis.set_ticks_position('top') # Plot dotplot frame base.dotplot_frame(fig, ax, lens1, lens2, step1, step2, self.genome1_name, self.genome2_name, [0, 1]) # Load block information bkinfo = pd.read_csv(self.blockinfo) # Handle reverse block information if self.blockinfo_reverse == True: bkinfo[['chr1', 'chr2']] = bkinfo[['chr2', 'chr1']] bkinfo[['block1', 'block2']] = bkinfo[['block2', 'block1']] # Filter block information bkinfo['chr1'] = bkinfo['chr1'].astype(str) bkinfo['chr2'] = bkinfo['chr2'].astype(str) bkinfo = bkinfo[(bkinfo['length'] >= int(self.block_length)) & (bkinfo['chr1'].isin(lens1.index)) & (bkinfo['chr2'].isin(lens2.index)) & (bkinfo['pvalue'] < float(self.pvalue))] # Remove tandem duplicates if required if self.tandem == False: bkinfo = self.remove_tandem(bkinfo) # Calculate positions and pairs pos, pairs = self.block_position(bkinfo, lens1, lens2, step1, step2) # Filter pairs by ks value df = pd.DataFrame(pairs, columns=['loc1', 'loc2', 'ks']) df = df[(df['ks'] >= self.area[0]) & (df['ks'] <= self.area[1])] df.drop_duplicates(inplace=True) # Plot scatter cm = plt.cm.get_cmap('gist_rainbow') sc = plt.scatter(df['loc1'], df['loc2'], s=self.markersize, c=df['ks'], alpha=0.9, edgecolors=None, linewidths=0, marker='o', vmin=self.area[0], vmax=self.area[1], cmap=cm) # Add colorbar cbar = fig.colorbar(sc, shrink=0.5, pad=0.03, fraction=0.1) align = dict(family='DejaVu Sans', style='normal', horizontalalignment="center", verticalalignment="center") cbar.set_label('Ks', labelpad=12.5, fontsize=16, **align) # Set axis and save figure ax.axis(axis) plt.subplots_adjust(left=0.09, right=0.96, top=0.93, bottom=0.03) plt.savefig(self.savefig, dpi=500) plt.show() ================================================ FILE: build/lib/wgdi/circos.py ================================================ import re import sys import matplotlib as mpl import matplotlib.patches as mpatches import matplotlib.pyplot as plt import numpy as np import pandas as pd import wgdi.base as base class circos(): def __init__(self, options): self.figsize = '10,10' self.position = 'order' self.label_size = 9 self.label_radius = 0.015 self.column_names = [None]*100 for k, v in options: setattr(self, str(k), v) print(k, ' = ', v) self.figsize = [float(k) for k in self.figsize.split(',')] self.ring_width = float(self.ring_width) if hasattr(self, 'legend_square'): self.legend_square = [float(k) for k in self.legend_square.split(',')] else: self.legend_square = 0.04, 0.04 def plot_circle(self, loc_chr, radius, color='black', lw=1, alpha=1, linestyle='-'): for k in loc_chr: start, end = loc_chr[k] t = np.arange(start, end, 0.005) x, y = (radius) * np.cos(t), (radius) * np.sin(t) plt.plot(x, y, linestyle=linestyle, color=color, lw=lw, alpha=alpha) def plot_labels(self, root, labels, loc_chr, radius, horizontalalignment="center", verticalalignment="center", fontsize=6, color='black'): for k in loc_chr: loc = sum(loc_chr[k]) * 0.5 x, y = radius * np.cos(loc), radius * np.sin(loc) self.Wedge(root, (x, y), self.label_radius, 0, 360, self.label_radius, 'white', 1) if 1 * np.pi < loc < 2 * np.pi: loc += np.pi plt.text(x, y, labels[k], horizontalalignment=horizontalalignment, verticalalignment=verticalalignment, fontsize=fontsize, color=color, rotation=0) def Wedge(self, ax, loc, radius, start, end, width, color, alpha): p = mpatches.Wedge(loc, radius, start, end, width=width, edgecolor=None, facecolor=color, alpha=alpha) ax.add_patch(p) def plot_bar(self, df, radius, length, lw, color, alpha): for k in df[df.columns[0]].drop_duplicates().values: if str(k) not in color.keys(): color[str(k)] = 'black' if k in ['', np.nan]: continue df_chr = df.groupby(df.columns[0]).get_group(k) x1, y1 = radius * \ np.cos(df_chr['rad']), radius * np.sin(df_chr['rad']) x2, y2 = (radius + length) * \ np.cos(df_chr['rad']), (radius + length) * \ np.sin(df_chr['rad']) x = np.array( [x1.values, x2.values, [np.nan] * x1.size]).flatten('F') y = np.array( [y1.values, y2.values, [np.nan] * x1.size]).flatten('F') plt.plot(x, y, linestyle='-', color=color[str(k)], lw=lw, alpha=alpha) def chr_location(self, lens, angle_gap, angle): start, end, loc_chr = 0, 0.2*angle_gap, {} for k in lens.index: end += angle_gap + angle * (float(lens[k])) start = end - angle * (float(lens[k])) loc_chr[k] = [float(start), float(end)] return loc_chr def deal_alignment(self, alignment, gff, lens, loc_chr, angle): alignment.replace('\s+', '', inplace=True) alignment.replace('.', '', inplace=True) print(alignment.dropna(subset=[2, 3],how='all')) # exit(0) newalignment = alignment.copy() for i in range(len(alignment.columns)): alignment[i] = alignment[i].astype(str) newalignment[i] = alignment[i].map(gff['chr'].to_dict()) newalignment['loc'] = alignment[0].map(gff[self.position].to_dict()) newalignment[0] = newalignment[0].astype('str') newalignment['loc'] = newalignment['loc'].astype('float') newalignment = newalignment[newalignment[0].isin(lens.index) == True] newalignment['rad'] = np.nan for name, group in newalignment.groupby(0): if str(name) not in loc_chr: continue newalignment.loc[group.index, 'rad'] = loc_chr[str( name)][0]+angle * group['loc'] print(newalignment.dropna(subset=[2, 3,4],how='all')) return newalignment def deal_ancestor(self, alignment, gff, lens, loc_chr, angle, al): alignment.replace('\s+', '', inplace=True) alignment.replace('.', np.nan, inplace=True) newalignment = pd.merge(alignment, gff, left_on=0, right_on=gff.index) newalignment['rad'] = np.nan for name, group in newalignment.groupby('chr'): if str(name) not in loc_chr: continue newalignment.loc[group.index, 'rad'] = loc_chr[str( name)][0]+angle * group[self.position] newalignment.index = newalignment[0] newalignment[0] = newalignment[0].map(newalignment['rad'].to_dict()) data = [] for index_al, row_al in al.iterrows(): for k in alignment.columns[1:]: alignment[k] = alignment[k].astype(str) group = newalignment[(newalignment['chr'] == row_al['chr']) & ( newalignment['order'] >= row_al['start']) & (newalignment['order'] <= row_al['end'])].copy() group.loc[:, k] = group.loc[:, k].map( newalignment['rad']).values group.dropna(subset=[k], inplace=True) group.index = group.index.map(newalignment['rad'].to_dict()) group['color'] = row_al['color'] group = group[group[k].notnull()] data += group[[0, k, 'color']].values.tolist() df = pd.DataFrame(data, columns=['loc1', 'loc2', 'color']) return df def plot_collinearity(self, data, radius, lw=0.02, alpha=1): for name, group in data.groupby('color'): x, y = np.array([]), np.array([]) for index, row in group.iterrows(): ex1x, ex1y = radius * \ np.cos(row['loc1']), radius*np.sin(row['loc1']) ex2x, ex2y = radius * \ np.cos(row['loc2']), radius*np.sin(row['loc2']) ex3x, ex3y = radius * (1-abs(row['loc1']-row['loc2'])/np.pi) * np.cos((row['loc1']+row['loc2'])*0.5), radius * ( 1-abs(row['loc1']-row['loc2'])/np.pi) * np.sin((row['loc1']+row['loc2'])*0.5) x1 = [ex1x, 0.5*ex3x, ex2x] y1 = [ex1y, 0.5*ex3y, ex2y] step = .002 t = np.arange(0, 1+step, step) xt = base.Bezier3(x1, t) yt = base.Bezier3(y1, t) x = np.hstack((x, xt, np.nan)) y = np.hstack((y, yt, np.nan)) plt.plot(x, y, color=name, lw=lw, alpha=alpha) def plot_legend(self, ax, chr_color, width, height): (x1, x2) = ax.get_xlim() (y1, y2) = ax.get_ylim() a = 1000 for k, v in enumerate(chr_color.keys(), 0): h = y1-k//a*height*2 k = k % a if x1 + width * k > x2-width: a = k h = y1-k//a*height*2 k = k % a loc = [x1 + width * k, h] base.Rectangle(ax, loc, height, width, chr_color[v], 1) plt.text(loc[0] + width*0.382, h-0.618*height, v, fontsize=12) ax.set_ylim(h-2*height, y2) def run(self): fig, ax = plt.subplots(figsize=self.figsize) mpl.rcParams['agg.path.chunksize'] = 100000000 lens = base.newlens(self.lens, self.position) radius, angle_gap = float(self.radius), float(self.angle_gap) angle = (2 * np.pi - (int(len(lens))+1.5) * angle_gap) / (int(lens.sum())) loc_chr = self.chr_location(lens, angle_gap, angle) list_colors = [str(k).strip() for k in re.split(',|:', self.colors)] chr_color = dict(zip(list_colors[::2], list_colors[1::2])) gff = base.newgff(self.gff) if hasattr(self, 'ancestor'): ancestor = pd.read_csv(self.ancestor, header=None) al = pd.read_csv(self.ancestor_location, sep='\t', header=None) al.rename(columns={0: 'chr', 1: 'start', 2: 'end', 3: 'color'}, inplace=True) al['chr'] = al['chr'].astype(str) data = self.deal_ancestor(ancestor, gff, lens, loc_chr, angle, al) self.plot_collinearity(data, radius, lw=0.1, alpha=0.8) if hasattr(self, 'alignment'): alignment = pd.read_csv(self.alignment, header=None) print(alignment) newalignment = self.deal_alignment( alignment, gff, lens, loc_chr, angle) if ',' in self.column_names: names = [str(k) for k in self.column_names.split(',')] else: names = [None]*len(newalignment.columns) n = 0 align = dict(family='Arial', verticalalignment="center", horizontalalignment="center") print(newalignment) for k, v in enumerate(newalignment.columns[1:-2]): r = radius + self.ring_width*(k+1) print(k,v,r) self.plot_circle(loc_chr, r, lw=0.5, alpha=1, color='grey') self.plot_bar(newalignment[[v, 'rad']], r + self.ring_width * 0.15, self.ring_width*0.7, 0.15, chr_color, 1) if n % 2 == 0: loc = 0.05 x, y = (r+self.ring_width*0.5) * \ np.cos(loc), (r+self.ring_width*0.5) * np.sin(loc) plt.text(x, y, names[n], rotation=loc * 180 / np.pi, fontsize=self.label_size, **align) else: loc = -0.08 x, y = (r+self.ring_width*0.5) * \ np.cos(loc), (r+self.ring_width*0.5) * np.sin(loc) plt.text(x, y, names[n], fontsize=self.label_size, rotation=loc * 180 / np.pi, **align) n += 1 if hasattr(self, 'ancestor'): colors = al['color'].drop_duplicates().values.tolist() ancestor_chr_color = dict(zip(range(1, len(colors)+1), colors)) self.plot_legend(ax, ancestor_chr_color, self.legend_square[0], self.legend_square[1]) if hasattr(self, 'alignment'): del chr_color['nan'] self.plot_legend( ax, chr_color, self.legend_square[0], self.legend_square[1]) labels = self.chr_label + lens.index labels = dict(zip(lens.index, labels)) self.plot_labels(ax, labels, loc_chr, radius + self.ring_width*0.3, fontsize=self.label_size) plt.axis('off') a = (ax.get_ylim()[1]-ax.get_ylim()[0]) / \ (ax.get_xlim()[1]-ax.get_xlim()[0]) fig.set_size_inches(self.figsize[0], self.figsize[0]*a, forward=True) plt.savefig(self.savefig, dpi=500) plt.show() sys.exit(0) ================================================ FILE: build/lib/wgdi/collinearity.py ================================================ import numpy as np import pandas as pd class collinearity: def __init__(self, options, points): # Default values self.gap_penalty = -1 self.over_length = 0 self.mg1 = 40 self.mg2 = 40 self.pvalue = 1 self.over_gap = 3 self.points = points self.p_value = 0 self.coverage_ratio = 0.8 # Set user-defined options for k, v in options: setattr(self, str(k), v) # Initialize grading and mg values self.grading = [50, 40, 25] if not hasattr(self, 'grading') else [int(k) for k in self.grading.split(',')] self.mg1, self.mg2 = [40, 40] if not hasattr(self, 'mg') else [int(k) for k in self.mg.split(',')] # Convert string values to floats self.pvalue = float(self.pvalue) self.coverage_ratio = float(self.coverage_ratio) def get_matrix(self): """Initialize the matrix for the collinearity points.""" self.points['usedtimes1'] = 0 self.points['usedtimes2'] = 0 self.points['times'] = 1 self.points['score1'] = self.points['grading'] self.points['score2'] = self.points['grading'] self.points['path1'] = self.points.index.to_numpy().reshape(len(self.points), 1).tolist() self.points['path2'] = self.points['path1'] self.points_init = self.points.copy() self.mat_points = self.points def run(self): """Run the main collinearity processing.""" self.get_matrix() self.score_matrix() data = [] # Process points for maxPath in the positive direction points1 = self.points[['loc1', 'loc2', 'score1', 'path1', 'usedtimes1']].sort_values(by=['score1'], ascending=False) points1.drop(index=points1[points1['usedtimes1'] < 1].index, inplace=True) points1.columns = ['loc1', 'loc2', 'score', 'path', 'usedtimes'] while (self.over_length >= self.over_gap or len(points1) >= self.over_gap): if self.max_path(points1): if self.p_value > self.pvalue: continue data.append([self.path, self.p_value, self.score]) # Process points for maxPath in the negative direction points2 = self.points[['loc1', 'loc2', 'score2', 'path2', 'usedtimes2']].sort_values(by=['score2'], ascending=False) points2.drop(index=points2[points2['usedtimes2'] < 1].index, inplace=True) points2.columns = ['loc1', 'loc2', 'score', 'path', 'usedtimes'] while (self.over_length >= self.over_gap) or (len(points2) >= self.over_gap): if self.max_path(points2): if self.p_value > self.pvalue: continue data.append([self.path, self.p_value, self.score]) return data def score_matrix(self): """Calculate the scoring matrix for the points.""" for index, row, col in self.points[['loc1', 'loc2']].itertuples(): # Get points within a certain range points = self.points[(self.points['loc1'] > row) & (self.points['loc2'] > col) & (self.points['loc1'] < row + self.mg1) & (self.points['loc2'] < col + self.mg2)] row_i_old, gap = row, self.mg2 for index_ij, row_i, col_j, grading in points[['loc1', 'loc2', 'grading']].itertuples(): if col_j - col > gap and row_i > row_i_old: break score = grading + (row_i - row + col_j - col) * self.gap_penalty score1 = score + self.points.at[index, 'score1'] if score > 0 and self.points.at[index_ij, 'score1'] < score1: self.points.at[index_ij, 'score1'] = score1 self.points.at[index, 'usedtimes1'] += 1 self.points.at[index_ij, 'usedtimes1'] += 1 self.points.at[index_ij, 'path1'] = self.points.at[index, 'path1'] + [index_ij] gap = min(col_j - col, gap) row_i_old = row_i # Reverse processing to handle negative direction points_reverse = self.points.sort_values(by=['loc1', 'loc2'], ascending=[False, True]) for index, row, col in points_reverse[['loc1', 'loc2']].itertuples(): points = points_reverse[(points_reverse['loc1'] < row) & (points_reverse['loc2'] > col) & (points_reverse['loc1'] > row - self.mg1) & (points_reverse['loc2'] < col + self.mg2)] row_i_old, gap = row, self.mg2 for index_ij, row_i, col_j, grading in points[['loc1', 'loc2', 'grading']].itertuples(): if col_j - col > gap and row_i < row_i_old: break score = grading + (row - row_i + col_j - col) * self.gap_penalty score2 = score + self.points.at[index, 'score2'] if score > 0 and self.points.at[index_ij, 'score2'] < score2: self.points.at[index_ij, 'score2'] = score2 self.points.at[index, 'usedtimes2'] += 1 self.points.at[index_ij, 'usedtimes2'] += 1 self.points.at[index_ij, 'path2'] = self.points.at[index, 'path2'] + [index_ij] gap = min(col_j - col, gap) row_i_old = row_i def max_path(self, points): """Find the maximum path for the given points.""" if len(points) == 0: self.over_length = 0 return False # Initialize path score and index self.score, self.path_index = points.loc[points.index[0], ['score', 'path']] self.path = points[points.index.isin(self.path_index)] self.over_length = len(self.path_index) # Check if the block overlaps with other blocks if self.over_length >= self.over_gap and len(self.path) / self.over_length > self.coverage_ratio: points.drop(index=self.path.index, inplace=True) [loc1_min, loc2_min], [loc1_max, loc2_max] = self.path[['loc1', 'loc2']].agg(['min', 'max']).to_numpy() # Calculate p-value gap_init = self.points_init[(loc1_min <= self.points_init['loc1']) & (self.points_init['loc1'] <= loc1_max) & (loc2_min <= self.points_init['loc2']) & (self.points_init['loc2'] <= loc2_max)].copy() self.p_value = self.p_value_estimated(gap_init, loc1_max - loc1_min + 1, loc2_max - loc2_min + 1) self.path = self.path.sort_values(by=['loc1'], ascending=[True])[['loc1', 'loc2']] return True else: points.drop(index=points.index[0], inplace=True) return False def p_value_estimated(self, gap, L1, L2): """Estimate p-value based on the given gap and lengths.""" N1 = gap['times'].sum() N = len(gap) self.points_init.loc[gap.index, 'times'] += 1 m = len(self.path) a = (1 - self.score / m / self.grading[0]) * (N1 - m + 1) / N * (L1 - m + 1) * (L2 - m + 1) / L1 / L2 return round(a, 4) ================================================ FILE: build/lib/wgdi/dotplot.py ================================================ import re import matplotlib.pyplot as plt import numpy as np import pandas as pd import wgdi.base as base class dotplot(): def __init__(self, options): self.multiple = 1 self.score = 100 self.evalue = 1e-5 self.repeat_number = 20 self.markersize = 0.5 self.figsize = 'default' self.position = 'order' self.ancestor_top = None self.ancestor_left = None self.blast_reverse = False for k, v in options: setattr(self, str(k), v) print(k, ' = ', v) if self.ancestor_top == 'none' or self.ancestor_top == '': self.ancestor_top = None if self.ancestor_left == 'none' or self.ancestor_left == '': self.ancestor_left = None base.str_to_bool(self.blast_reverse) def pair_positon(self, blast, gff1, gff2, rednum, repeat_number): blast['color'] = '' blast['loc1'] = blast[0].map(gff1['loc']) blast['loc2'] = blast[1].map(gff2['loc']) bluenum = 5+rednum index = [group.sort_values(by=[11], ascending=[False])[:repeat_number].index.tolist() for name, group in blast.groupby([0])] reddata = np.array([k[:rednum] for k in index], dtype=object) bluedata = np.array([k[rednum:bluenum] for k in index], dtype=object) graydata = np.array([k[bluenum:repeat_number] for k in index], dtype=object) if len(reddata): redindex = np.concatenate(reddata) else: redindex = [] if len(bluedata): blueindex = np.concatenate(bluedata) else: blueindex = [] if len(graydata): grayindex = np.concatenate(graydata) else: grayindex = [] blast.loc[redindex, 'color'] = 'red' blast.loc[blueindex, 'color'] = 'blue' blast.loc[grayindex, 'color'] = 'gray' return blast[blast['color'].str.contains(r'\w')] def run(self): axis = [0, 1, 1, 0] left, right, top, bottom = 0.07, 0.97, 0.93, 0.03 lens1 = base.newlens(self.lens1, self.position) lens2 = base.newlens(self.lens2, self.position) step1 = 1 / float(lens1.sum()) step2 = 1 / float(lens2.sum()) if self.ancestor_left != None: axis[0] = -0.02 lens_ancestor_left = pd.read_csv( self.ancestor_left, sep="\t", header=None) lens_ancestor_left[0] = lens_ancestor_left[0].astype(str) lens_ancestor_left[3] = lens_ancestor_left[3].astype(str) lens_ancestor_left[4] = lens_ancestor_left[4].astype(int) lens_ancestor_left[4] = lens_ancestor_left[4] / lens_ancestor_left[4].max() lens_ancestor_left = lens_ancestor_left[lens_ancestor_left[0].isin( lens1.index)] if self.ancestor_top != None: axis[3] = -0.02 lens_ancestor_top = pd.read_csv( self.ancestor_top, sep="\t", header=None) lens_ancestor_top[0] = lens_ancestor_top[0].astype(str) lens_ancestor_top[3] = lens_ancestor_top[3].astype(str) lens_ancestor_top[4] = lens_ancestor_top[4].astype(int) lens_ancestor_top[4] = lens_ancestor_top[4] / lens_ancestor_top[4].max() lens_ancestor_top = lens_ancestor_top[lens_ancestor_top[0].isin( lens2.index)] if re.search(r'\d', self.figsize): self.figsize = [float(k) for k in self.figsize.split(',')] else: self.figsize = np.array( [1, float(lens1.sum())/float(lens2.sum())])*10 plt.rcParams['ytick.major.pad'] = 0 fig, ax = plt.subplots(figsize=self.figsize) ax.xaxis.set_ticks_position('top') base.dotplot_frame(fig, ax, lens1, lens2, step1, step2, self.genome1_name, self.genome2_name, [axis[0], axis[3]]) gff1 = base.newgff(self.gff1) gff2 = base.newgff(self.gff2) gff1 = base.gene_location(gff1, lens1, step1, self.position) gff2 = base.gene_location(gff2, lens2, step2, self.position) if self.ancestor_top != None: top = top self.aree_left = self.ancestor_posion(ax, gff2, lens_ancestor_top, 'top') if self.ancestor_left != None: left = left self.aree_top = self.ancestor_posion(ax, gff1, lens_ancestor_left, 'left') print('read gffs') blast = base.newblast(self.blast, int(self.score), float(self.evalue), gff1, gff2, self.blast_reverse) if len(blast) ==0: print('Stoped! \n\nThe gene id in blast file does not correspond to gff1 and gff2.') exit(0) print('read blast') df = self.pair_positon(blast, gff1, gff2, int(self.multiple), int(self.repeat_number)) print('deal blast') ax.scatter(df['loc2'], df['loc1'], s=float(self.markersize), c=df['color'], alpha=0.5, edgecolors=None, linewidths=0, marker='o') ax.axis(axis) plt.subplots_adjust(left=left, right=right, top=top, bottom=bottom) plt.savefig(self.savefig, dpi=300) plt.show() def ancestor_posion(self, ax, gff, lens, mark): data = [] for index, row in lens.iterrows(): loc1 = gff[(gff['chr'] == row[0]) & ( gff['order'] == int(row[1]))].index loc2 = gff[(gff['chr'] == row[0]) & ( gff['order'] == int(row[2])-1)].index loc1, loc2 = gff.loc[[loc1[0], loc2[0]], 'loc'] if mark == 'top': width = abs(loc1-loc2) loc = [min(loc1, loc2), 0] height = -0.02 base.Rectangle(ax, loc, height, width, row[3], row[4]) if mark == 'left': height = abs(loc1-loc2) loc = [-0.02, min(loc1, loc2), ] width = 0.02 base.Rectangle(ax, loc, height, width, row[3], row[4]) data.append([loc, height, width, row[3], row[4]]) return data ================================================ FILE: build/lib/wgdi/example/__init__.py ================================================ ================================================ FILE: build/lib/wgdi/example/align.conf ================================================ [alignment] blockinfo = block information file (.csv) blockinfo_reverse = false classid = class1 gff1 = gff1 file gff2 = gff2 file lens1 = lens1 file lens2 = lens2 file genome1_name = Genome1 name genome2_name = Genome2 name markersize = 0.5 ks_area = -1,3 position = order colors = red,blue,green figsize = 10,10 savefile = savefile(.csv) savefig= save image(.png, .pdf, .svg) ================================================ FILE: build/lib/wgdi/example/alignmenttrees.conf ================================================ [alignmenttrees] alignment = alignment file (.csv) gff = gff file (reference genome, If alignment has no reference species, delete it) lens = lens file (If alignment has no reference species, delete it) dir = output folder sequence_file = sequence file (.fa) cds_file = cds file (.fa) codon_positon = 1,2,3 (1,2 mean codon1&2; 1,2,3 mean no codon removed) trees_file = trees (.nwk) align_software = (mafft,muscle) tree_software = (iqtree,fasttree) threads = 1 (Number,AUTO) model = MFP trimming = (trimal,divvier) minimum = 4 delete_detail = true ================================================ FILE: build/lib/wgdi/example/ancestral_karyotype.conf ================================================ [ancestral_karyotype] gff = gff file (cat the relevant 'gff' files into a file) pep_file = pep file (cat the relevant 'pep.fa' files into a file) ancestor = ancestor file (this file requires you to provide) mark = aak ancestor_gff = result file ancestor_lens = result file ancestor_pep = result file ancestor_file = result file ================================================ FILE: build/lib/wgdi/example/ancestral_karyotype_repertoire.conf ================================================ [ancestral_karyotype_repertoire] blockinfo = block information (*.csv) # blockinfo: processed *.csv blockinfo_reverse = False gff1 = gff1 file (ancestor's gff) gff2 = gff2 file (the other species's gff) gap = 5 mark = aak1s ancestor = ancestor file #current ancestor file ancestor_new = result file ancestor_pep = ancestor pep file #cat all pep files together ancestor_pep_new = result file ancestor_gff = result file ancestor_lens = result file ================================================ FILE: build/lib/wgdi/example/blockinfo.conf ================================================ [blockinfo] blast = blast file gff1 = gff1 file gff2 = gff2 file lens1 = lens1 file lens2 = lens2 file collinearity = collinearity file score = 100 evalue = 1e-5 repeat_number = 20 position = order ks = ks file ks_col = ks_NG86 savefile = block information (*.csv) ================================================ FILE: build/lib/wgdi/example/blockks.conf ================================================ [blockks] lens1 = lens1 file lens2 = lens2 file genome1_name = Genome1 name genome2_name = Genome2 name blockinfo = block information (*.csv) pvalue = 0.2 tandem = true tandem_length = 200 markersize = 1 area = 0,2 block_length = minimum length figsize = 8,8 savefig = save image(.png, .pdf, .svg) ================================================ FILE: build/lib/wgdi/example/circos.conf ================================================ [circos] gff = gff file lens = lens file radius = 0.2 angle_gap = 0.05 ring_width = 0.015 colors = 1:c,2:m,3:blue,4:gold,5:red,6:lawngreen,7:darkgreen,8:k,9:darkred,10:gray alignment = alignment file chr_label = chr ancestor = ancestor alignment file ancestor_location = ancestor file figsize = 10,10 label_size = 9 position = order legend_square = 0.04, 0.04 column_names = 1,2,3,4,5 savefig = result(.png, .pdf, .svg) ================================================ FILE: build/lib/wgdi/example/collinearity.conf ================================================ [collinearity] gff1 = gff1 file gff2 = gff2 file lens1 = lens1 file lens2 = lens2 file blast = blast file blast_reverse = false comparison = genomes multiple = 1 process = 8 evalue = 1e-5 score = 100 grading = 50,30,25 mg = 25,25 pvalue = 1 repeat_number = 20 positon = order savefile = collinearity file ================================================ FILE: build/lib/wgdi/example/conf.ini ================================================ [ini] mafft_path = /home/sunpc/micromamba/envs/wgdi/bin/mafft pal2nal_path = /home/sunpc/micromamba/envs/wgdi/bin/pal2nal.pl yn00_path = /home/sunpc/micromamba/envs/wgdi/bin/yn00 muscle_path = /home/sunpc/micromamba/envs/wgdi/bin/muscle iqtree_path = /home/sunpc/micromamba/envs/wgdi/bin/iqtree trimal_path = /home/sunpc/micromamba/envs/wgdi/bin/trimal fasttree_path = /home/sunpc/micromamba/envs/wgdi/bin/fasttree divvier_path = /home/sunpc/micromamba/envs/wgdi/bin/divvier ================================================ FILE: build/lib/wgdi/example/corr.conf ================================================ [correspondence] blockinfo = blockinfo file(.csv) lens1 = lens1 file lens2 = lens2 file tandem = true tandem_length = 200 pvalue = 0.2 block_length = 5 tandem_ratio = 0.5 multiple = 1 homo = -1,1 savefile = savefile(.csv) ================================================ FILE: build/lib/wgdi/example/dotplot.conf ================================================ [dotplot] blast = blast file gff1 = gff1 file gff2 = gff2 file lens1 = lens1 file lens2 = lens2 file genome1_name = Genome1 name genome2_name = Genome2 name multiple = 1 score = 100 evalue = 1e-5 repeat_number = 10 position = order blast_reverse = false ancestor_left = ancestor file or none ancestor_top = ancestor file or none markersize = 0.5 figsize = 10,10 savefig = savefile(.png, .pdf, .svg) ================================================ FILE: build/lib/wgdi/example/fusion_positions_database.conf ================================================ [fusion_positions_database] pep = pep file gff = gff file fusion_positions = fusion_positions file # Number of gene sets on each side of the breakpoint ancestor_gff = result file ancestor_lens = result file ancestor_pep = result file ancestor_file = result file ================================================ FILE: build/lib/wgdi/example/fusions_detection.conf ================================================ [fusions_detection] blockinfo = block information (*.csv) ancestor = ancestor file #The number of genes spanned by a synteny block on both sides of a breakpoint. min_genes_per_side = 5 density = 0.3 filtered_blockinfo = result blockinfo (.csv) ================================================ FILE: build/lib/wgdi/example/karyotype.conf ================================================ [karyotype] ancestor = ancestor chromosome file width = 0.5 figsize = 10,6.18 savefig = save image(.png, .pdf, .svg) ================================================ FILE: build/lib/wgdi/example/karyotype_mapping.conf ================================================ [karyotype_mapping] blast = blast file blast_reverse = false gff1 = gff1 file gff2 = gff2 file score = 100 evalue = 1e-5 repeat_number = 5 ancestor_left = ancestor location file (Only one of ('left', 'top') can be reserved) ancestor_top = ancestor location file the_other_lens = the other lens file blockinfo = block information (*.csv) blockinfo_reverse = false limit_length = 5 the_other_ancestor_file = result file ================================================ FILE: build/lib/wgdi/example/ks.conf ================================================ [ks] cds_file = cds file #cat all cds files together pep_file = pep file #cat all pep files together align_software = muscle pairs_file = gene pairs file ks_file = ks result ================================================ FILE: build/lib/wgdi/example/ks_fit_result.csv ================================================ ,color,linewidth,linestyle,,,,,, csa_csa,red,2,-,2.532090116,1.510453744,0.229652282,1.638111687,2.048906176,0.345639862 vvi_vvi,blue,2,-,3.00367275,1.288717936,0.177816426,,, vvi_oin_gamma,orange,2,-,1.910418336,1.328469514,0.262257112,,, vvi_oin,orange,2,--,4.948194212,0.882608858,0.10426873,,, vvi_csa,green,2,--,2.470770292464022,1.4131842495219498,0.21391959288821544,,, ================================================ FILE: build/lib/wgdi/example/ksfigure.conf ================================================ [ksfigure] ksfit = ksfit result(*.csv) labelfontsize = 15 legendfontsize = 15 xlabel = none ylabel = none title = none area = 0,2 figsize = 10,6.18 shadow = true (true/false) savefig = save image(.png, .pdf, .svg) ================================================ FILE: build/lib/wgdi/example/kspeaks.conf ================================================ [kspeaks] blockinfo = block information (*.csv) pvalue = 0.2 tandem = true block_length = int number ks_area = 0,10 multiple = 1 homo = 0,1 fontsize = 9 area = 0,3 figsize = 10,6.18 savefig = saving image(.png,.pdf) savefile = ks medain savefile ================================================ FILE: build/lib/wgdi/example/peaksfit.conf ================================================ [peaksfit] blockinfo = block information (*.csv) mode = median bins_number = 200 ks_area = 0,10 fontsize = 9 area = 0,3 figsize = 10,6.18 shadow = true savefig = saving image(.png,.pdf,.svg) ================================================ FILE: build/lib/wgdi/example/pindex.conf ================================================ [pindex] alignment = alignment file (.csv) gff = gff file lens =lens file gap = 50 retention = 0.05 diff = 0.05 remove_delta = (true/false) savefile = result file(.csv) ================================================ FILE: build/lib/wgdi/example/polyploidy_classification.conf ================================================ [polyploidy classification] blockinfo = block information (*.csv) ancestor_left = ancestor file ancestor_top = ancestor file classid = class1,class2 same_protochromosome = False same_subgenome = False savefile = result file(.csv) ================================================ FILE: build/lib/wgdi/example/retain.conf ================================================ [retain] alignment = alignment file gff = gff file lens = lens file colors = red,blue,green refgenome = shorthand figsize = 10,12 step = 50 ylabel = y label savefile = retain file (result) savefig = result(.png, .pdf, .svg) ================================================ FILE: build/lib/wgdi/example/shared_fusion.conf ================================================ [shared_fusion] blockinfo = block information (*.csv) # The new lens file is the output filtered by lens file. lens1 = lens file, new lens file lens2 = lens file, new lens file ancestor_left = ancestor file ancestor_top = ancestor file classid = class1,class2 limit_length = 5 filtered_blockinfo = result blockinfo (.csv) ================================================ FILE: build/lib/wgdi/fusion_positions_database.py ================================================ import pandas as pd import os from Bio import SeqIO class fusion_positions_database: def __init__(self, options): for k, v in options: setattr(self, k, v) print(f'{k} = {v}') def run(self): # Load and remove duplicates from data gff = pd.read_csv(self.gff, sep="\t", header=None, dtype={0: str, 5: int}).drop_duplicates() pep = SeqIO.to_dict(SeqIO.parse(self.pep, "fasta")) df = pd.read_csv(self.fusion_positions, sep="\t", header=None, dtype={0: str, 1: int, 2:int, 3:str}).drop_duplicates() # Load ancestral sequence file if it exists seqs = SeqIO.to_dict(SeqIO.parse(self.ancestor_pep, "fasta")) if os.path.exists(self.ancestor_pep) else {} sf_gff, sf_lens = [], [] # Process fusion positions for _, row in df.iterrows(): newchr = row[3] newgff = gff[(gff[0] == row[0]) & (gff[5] >= row[1] - row[2]) & (gff[5] < row[1] + row[2])].copy() newgff['id'] = [f"{newchr}s{str(row[0]).zfill(2)}g{str(i).zfill(3)}" for i in range(1, len(newgff) + 1)] sf_position = row[1] - newgff.iloc[0, 5] sf_lens.append([newchr, sf_position, len(newgff)]) # For each gene in the filtered GFF region for _, gff_row in newgff.iterrows(): if gff_row[1] in pep and gff_row['id'] not in seqs: gene = pep[gff_row[1]][:] gene.id, gene.description = gff_row['id'], '' seqs[gff_row['id']] = gene # Collect data for the final GFF output sf_gff.append([gff_row['id'], newchr, sf_position, gff_row[2], gff_row[3], gff_row[4], gff_row[1]]) # Write sequences to FASTA file SeqIO.write(seqs.values(), self.ancestor_pep, 'fasta') # Save filtered GFF data if sf_gff: sf_gff = pd.DataFrame(sf_gff) sf_gff.rename(columns={3: 'start', 4: 'end', 5: 'strand'}, inplace=True) sf_gff['order'] = sf_gff[0].str[-3:].astype(int) sf_gff[[1, 0, 'start', 'end', 'strand', 'order', 6]].to_csv(self.ancestor_gff, sep="\t", mode='a', index=False, header=None) sf_lens = pd.DataFrame(sf_lens).drop_duplicates() sf_lens.to_csv(self.ancestor_lens, sep="\t", mode='a', index=False, header=None) # Generate ancestral sequence data ancestor = [] for _, row in sf_lens.iterrows(): ancestor.append([row[0], 1, row[1], 'red', 1]) ancestor.append([row[0], row[1] + 1, row[2], 'blue', 1]) pd.DataFrame(ancestor).to_csv(self.ancestor_file, sep="\t", mode='a', index=False, header=None) # Remove duplicates from the output files for file in [self.ancestor_gff, self.ancestor_lens, self.ancestor_file]: df = pd.read_csv(file, header=None).drop_duplicates().to_csv(file, index=False, header=None) ================================================ FILE: build/lib/wgdi/fusions_detection.py ================================================ import pandas as pd from tabulate import tabulate class fusions_detection: def __init__(self, options): self.min_genes_per_side = 5 self.density = 0.3 for k, v in options: setattr(self, k, v) print(f"{k} = {v}") self.min_genes_per_side = int(self.min_genes_per_side) self.density = float(self.density) def run(self): # Load the ancestor file and process the positions ancestor = pd.read_csv(self.ancestor, sep='\t', header=None) position = ancestor.groupby(0)[2].unique().apply(pd.Series) bkinfo = pd.read_csv(self.blockinfo) newbkinfo = bkinfo.head(0) # Iterate over each row in the position dataframe for index, row in position.iterrows(): # Filter the bkinfo dataframe based on chr2 and density filtered_group = bkinfo[(bkinfo['chr2'] == index) & (bkinfo['density2'] >= self.density)].copy() # Split the block2 column and stack the resulting series df = filtered_group['block2'].str.split('_', expand=True).stack().astype(int) # Count the number of genes greater and less than the current position filtered_group['greater'] = (df > row[0]).groupby(level=0).sum() filtered_group['less'] = (df < row[0]).groupby(level=0).sum() # Filter the group based on the minimum number of genes per side filtered_group = filtered_group[(filtered_group['greater'] >= self.min_genes_per_side) & (filtered_group['less'] >= self.min_genes_per_side)] # Concatenate the filtered group with the newbkinfo dataframe newbkinfo = pd.concat([newbkinfo, filtered_group]) if len(newbkinfo) ==0: print("\nNo shared fusion breakpoints detected") exit(0) # Get and print the shared fusion positions newbkinfo.to_csv(self.filtered_blockinfo, header=True, index=False) non_overlap_counts = newbkinfo.groupby('chr2').apply(self.count_non_overlapping) data = [(chr2, count) for chr2, count in non_overlap_counts.items()] print("\nThe following are the shared fusion breakpoints and counts:") print(tabulate(data, headers=["Fusion Breakpoint", "Count"], tablefmt="github")) def count_non_overlapping(self, group): if len(group) == 1: return 1 grouped = group.groupby('chr1') total_count = 0 for chr1, chr_group in grouped: chr_group = chr_group.sort_values(by='start1').reset_index(drop=True) count = 0 current_end = -1 for _, row in chr_group.iterrows(): start1, end1 = row['start1'], row['end1'] if start1 > current_end: count += 1 current_end = end1 total_count += count return total_count ================================================ FILE: build/lib/wgdi/karyotype.py ================================================ import matplotlib.pyplot as plt import pandas as pd import wgdi.base as base class karyotype(): def __init__(self, options): self.width = 0.5 for k, v in options: setattr(self, str(k), v) print(str(k), ' = ', v) if hasattr(self, 'figsize'): self.figsize = [float(k) for k in self.figsize.split(',')] else: self.figsize = 10, 6.18 if hasattr(self, 'width'): self.width = float(self.width) else: self.width = 0.5 def run(self): fig, ax = plt.subplots(figsize=self.figsize) ancestor_lens = pd.read_csv( self.ancestor, sep="\t", header=None) ancestor_lens[0] = ancestor_lens[0].astype(str) ancestor_lens[3] = ancestor_lens[3].astype(str) ancestor_lens[4] = ancestor_lens[4].astype(int) ancestor_lens[4] = ancestor_lens[4] / ancestor_lens[4].max() chrs = ancestor_lens[0].drop_duplicates().to_list() ax.bar(chrs, 10, color='white', alpha=0) for index, row in ancestor_lens.iterrows(): base.Rectangle(ax, [chrs.index(row[0])-self.width*0.5, row[1]], row[2]-row[1], self.width, row[3], row[4]) ax.tick_params(labelsize=15) ax.spines['top'].set_visible(False) ax.spines['right'].set_visible(False) ax.spines['left'].set_visible(False) ax.spines['bottom'].set_visible(False) ax.set_xticks([]) ax.set_yticks([]) plt.savefig(self.savefig, dpi=500) plt.show() ================================================ FILE: build/lib/wgdi/karyotype_mapping.py ================================================ import numpy as np import pandas as pd import wgdi.base as base class karyotype_mapping: def __init__(self, options): # Initialize default attributes self.blast_reverse = False self.blockinfo_reverse = False self.position = 'order' self.block_length = 5 self.limit_length = 5 self.repeat_number = 20 self.score = 100 self.evalue = 1e-5 # Update attributes with provided keyword arguments and print them for k, v in options: setattr(self, k, v) print(f"{k} = {v}") self.blast_reverse = base.str_to_bool(self.blast_reverse) self.blockinfo_reverse = base.str_to_bool(self.blockinfo_reverse) self.limit_length = int(self.limit_length) def karyotype_left(self, pairs, ancestor, gff1, gff2): # Loop through each row in ancestor to set color and classification in gff1 for _, row in ancestor.iterrows(): loc_min, loc_max = sorted([row[1], row[2]]) index1 = gff1[(gff1['chr'] == row[0]) & (gff1['order'] >= loc_min) & (gff1['order'] <= loc_max)].index gff1.loc[index1, ['color', 'classification']] = row[3], row[4] # Merge pairs with gff1 and update gff2 with color and classification data = pd.merge(pairs, gff1, left_on=0, right_index=True, how='left') data.drop_duplicates(subset=[1], inplace=True) data.set_index(1, inplace=True) gff2.loc[data.index, ['color', 'classification']] = data[['color', 'classification']] return gff2 def karyotype_top(self, pairs, ancestor, gff1, gff2): # Loop through each row in ancestor to set color and classification in gff2 for _, row in ancestor.iterrows(): loc_min, loc_max = sorted([row[1], row[2]]) index1 = gff2[(gff2['chr'] == row[0]) & (gff2['order'] >= loc_min) & (gff2['order'] <= loc_max)].index gff2.loc[index1, ['color', 'classification']] = row[3], row[4] # Merge pairs with gff2 and update gff1 with color and classification data = pd.merge(pairs, gff2, left_on=1, right_index=True, how='left') data.drop_duplicates(subset=[0], inplace=True) data.set_index(0, inplace=True) gff1.loc[data.index, ['color', 'classification']] = data[['color', 'classification']] return gff1 def karyotype_map(self, gff, lens): # Filter gff based on lens index and non-null color gff = gff[gff['chr'].isin(lens.index) & gff['color'].notnull()] ancestor = [] # Group by chromosome and process each group to create ancestor records for chr, group in gff.groupby('chr'): color, class_id, arr = '', 1, [] for _, row in group.iterrows(): if color == row['color'] and class_id == row['classification']: arr.append(row['order']) else: if len(arr) >= self.limit_length: ancestor.append([chr, min(arr), max(arr), color, class_id, len(arr)]) color, class_id = row['color'], row['classification'] arr = [] if len(ancestor) >= 1 and color == ancestor[-1][3] and class_id == ancestor[-1][4] and chr == ancestor[-1][0]: arr.append(ancestor[-1][1]) arr += np.random.randint(ancestor[-1][1], ancestor[-1][2], size=ancestor[-1][5]-1).tolist() ancestor.pop() arr.append(row['order']) if len(arr) >= self.limit_length: ancestor.append([chr, min(arr), max(arr), color, class_id, len(arr)]) ancestor = pd.DataFrame(ancestor) # Adjust min and max positions for each chromosome group for chr, group in ancestor.groupby(0): ancestor.loc[group.index[0], 1] = 1 ancestor.loc[group.index[-1], 2] = lens[chr] ancestor[4] = ancestor[4].astype(int) return ancestor[[0, 1, 2, 3, 4, 5]] def colinear_gene_pairs(self, bkinfo, gff1, gff2): gff1 = gff1.reset_index() gff2 = gff2.reset_index() gff1_indexed = gff1.set_index(['chr', 'order']) gff2_indexed = gff2.set_index(['chr', 'order']) data = [] for _, row in bkinfo.iterrows(): b1 = list(map(int, row['block1'].split('_'))) b2 = list(map(int, row['block2'].split('_'))) for order1, order2 in zip(b1, b2): a = gff1_indexed.loc[(row['chr1'], order1), 1] b = gff2_indexed.loc[(row['chr2'], order2), 1] data.append([a, b]) return pd.DataFrame(data) def new_ancestor(self, ancestor, gff1, gff2, blast): # Iterate through ancestor rows to adjust positions based on neighboring rows for i in range(1, len(ancestor)): if ancestor.iloc[i, 0] == ancestor.iloc[i-1, 0]: area = ancestor.iloc[i, 1] - ancestor.iloc[i-1, 2] if area <= 5: ancestor.iloc[i-1, 2] = ancestor.iloc[i, 1] - 1 else: index1 = gff1[(gff1['chr'] == ancestor.iloc[i, 0]) & (gff1['order'] >= ancestor.iloc[i-1, 2]+1) & (gff1['order'] <= ancestor.iloc[i, 1]-1)].index index2 = gff2[gff2['color'] == ancestor.iloc[i-1, 3]].index index3 = gff2[gff2['color'] == ancestor.iloc[i, 3]].index newblast1 = blast[(blast[0].isin(index1)) & (blast[1].isin(index2))] newblast2 = blast[(blast[0].isin(index1)) & (blast[1].isin(index3))] if len(newblast1) >= len(newblast2): ancestor.iloc[i-1, 2] = ancestor.iloc[i, 1] - 1 else: ancestor.iloc[i, 1] = ancestor.iloc[i-1, 2] + 1 for chr, group in ancestor.groupby(0): if len(group) == 1: continue newgff1 = gff1[gff1['chr'] == chr] for i in range(1, len(group)): if group.iloc[i, 5] > 200: continue index_left = newgff1[(newgff1['order'] >= group.iloc[i, 1]) & (newgff1['order'] <= group.iloc[i, 2])].index blast_left = blast[blast[0].isin(index_left)] index_prev = gff2[gff2['color'] == group.iloc[i-1, 3]].index blast_prev = blast_left[blast_left[1].isin(index_prev)] index_curr = gff2[gff2['color'] == group.iloc[i, 3]].index blast_curr = blast_left[blast_left[1].isin(index_curr)] if len(blast_curr) <= len(blast_prev): ancestor.loc[group.index[i],3] = ancestor.loc[group.index[i]-1,3] if i < len(group)-1: index_next = gff2[gff2['color'] == group.iloc[i+1, 3]].index blast_next = blast_left[blast_left[1].isin(index_next)] if len(blast_next) > max(len(blast_prev),len(blast_curr)): ancestor.loc[group.index[i],3] = ancestor.loc[group.index[i]+1,3] ancestor['group'] = (ancestor[0].shift(1) != ancestor[0]) | (ancestor[3].shift(1) != ancestor[3]) | (ancestor[4].shift(1) != ancestor[4]) ancestor['group'] = ancestor['group'].cumsum() result = ancestor.groupby('group').agg({ 0: 'first', 1: 'min', 2: 'max', 3: 'first', 4: 'first', }).reset_index(drop=True) return result def run(self): # Read and process block information bkinfo = pd.read_csv(self.blockinfo, index_col='id') bkinfo['chr1'] = bkinfo['chr1'].astype(str) bkinfo['chr2'] = bkinfo['chr2'].astype(str) if self.blockinfo_reverse == True: bkinfo[['chr1', 'chr2']] = bkinfo[['chr2', 'chr1']] bkinfo[['block1', 'block2']] = bkinfo[['block2', 'block1']] bkinfo = bkinfo[bkinfo['length'] > int(self.block_length)] # Read GFF and lens data gff1 = base.newgff(self.gff1) gff2 = base.newgff(self.gff2) lens = base.newlens(self.the_other_lens, self.position) blast = base.newblast(self.blast, int(self.score), float(self.evalue), gff1, gff2, self.blast_reverse) # blast.drop_duplicates(subset=[0], keep='first', inplace=True) # Find colinear gene pairs pairs = self.colinear_gene_pairs(bkinfo, gff1, gff2) # Depending on available attributes, call either karyotype_top or karyotype_left if hasattr(self, 'ancestor_top'): ancestor = base.read_classification(self.ancestor_top) data = self.karyotype_top(pairs, ancestor, gff1, gff2) elif hasattr(self, 'ancestor_left'): ancestor = base.read_classification(self.ancestor_left) data = self.karyotype_left(pairs, ancestor, gff1, gff2) gff1, gff2 = gff2, gff1 blast.iloc[:, :2] = blast.iloc[:, [1, 0]].to_numpy() else: print('Missing ancestor file.') exit(0) # Map the data and create the final ancestor file the_other_ancestor_file = self.karyotype_map(data, lens) the_other_ancestor_file = self.new_ancestor(the_other_ancestor_file, gff1, gff2, blast) the_other_ancestor_file.to_csv(self.the_other_ancestor_file, sep='\t', header=False, index=False) ================================================ FILE: build/lib/wgdi/ks.py ================================================ import os import sys import numpy as np import pandas as pd from Bio import SeqIO import subprocess from Bio.Phylo.PAML import yn00 import wgdi.base as base class ks: def __init__(self, options): base_conf = base.config() self.pair_pep_file = 'pair.pep' self.pair_cds_file = 'pair.cds' self.prot_align_file = 'prot.aln' self.mrtrans = 'pair.mrtrans' self.pair_yn = 'pair.yn' for k, v in base_conf: setattr(self, str(k), v) for k, v in options: setattr(self, str(k), v) print(f'{str(k)} = {v}') def auto_file(self): pairs = [] with open(self.pairs_file) as f: p = ' '.join(f.readlines()[:30]) # Detect file format and process accordingly if 'path length' in p or 'MAXIMUM GAP' in p: collinearity = base.read_colinearscan(self.pairs_file) pairs = [[v[0], v[2]] for k in collinearity for v in k[1]] elif 'MATCH_SIZE' in p or '## Alignment' in p: collinearity = base.read_mcscanx(self.pairs_file) pairs = [[v[0], v[2]] for k in collinearity for v in k[1]] elif '# Alignment' in p: collinearity = base.read_collinearity(self.pairs_file) pairs = [[v[0], v[2]] for k in collinearity for v in k[1]] elif '###' in p: collinearity = base.read_jcvi(self.pairs_file) pairs = [[v[0], v[2]] for k in collinearity for v in k[1]] elif ',' in p: collinearity = pd.read_csv(self.pairs_file, header=None) pairs = collinearity.values.tolist() else: collinearity = pd.read_csv(self.pairs_file, header=None, sep='\t') pairs = collinearity.values.tolist() df = pd.DataFrame(pairs).drop_duplicates() df[0] = df[0].astype(str) df[1] = df[1].astype(str) df.index = df[0] + ',' + df[1] return df def run(self): # Load sequence data cds = SeqIO.to_dict(SeqIO.parse(self.cds_file, "fasta")) pep = SeqIO.to_dict(SeqIO.parse(self.pep_file, "fasta")) df_pairs = self.auto_file() # Check if ks file exists and load it, otherwise create a new one if os.path.exists(self.ks_file): ks = pd.read_csv(self.ks_file, sep='\t').drop_duplicates() kscopy = ks.copy() names = ks.columns.tolist() names[0], names[1] = names[1], names[0] kscopy.columns = names ks = pd.concat([ks, kscopy]) ks['id'] = ks['id1'] + ',' + ks['id2'] df_pairs.drop(np.intersect1d(df_pairs.index, ks['id'].to_numpy()), inplace=True) ks_file = open(self.ks_file, 'a+') else: ks_file = open(self.ks_file, 'w') ks_file.write('\t'.join(['id1', 'id2', 'ka_NG86', 'ks_NG86', 'ka_YN00', 'ks_YN00']) + '\n') # Filter valid pairs based on sequence data df_pairs = df_pairs[ (df_pairs[0].isin(cds.keys())) & (df_pairs[1].isin(cds.keys())) & (df_pairs[0].isin(pep.keys())) & (df_pairs[1].isin(pep.keys())) ] pairs = df_pairs[[0, 1]].to_numpy() if len(pairs) > 0 and pairs[0][0][:3] == pairs[0][1][:3]: allpairs = [] pair_hash = {} for k in pairs: if k[0] + ',' + k[1] in pair_hash or k[1] + ',' + k[0] in pair_hash: continue else: pair_hash[k[0] + ',' + k[1]] = 1 pair_hash[k[1] + ',' + k[0]] = 1 allpairs.append(k) pairs = allpairs for k in pairs: cds_gene1, cds_gene2 = cds[k[0]], cds[k[1]] cds_gene1.id, cds_gene2.id = 'gene1', 'gene2' pep_gene1, pep_gene2 = pep[k[0]], pep[k[1]] pep_gene1.id, pep_gene2.id = 'gene1', 'gene2' # Write sequences to files SeqIO.write([cds[k[0]], cds[k[1]]], self.pair_cds_file, "fasta") SeqIO.write([pep[k[0]], pep[k[1]]], self.pair_pep_file, "fasta") # Compute Ka/Ks values kaks = self.pair_kaks(['gene1', 'gene2']) if kaks is None: continue ks_file.write('\t'.join([str(i) for i in list(k) + list(kaks)]) + '\n') ks_file.close() # Clean up temporary files for file in [ self.pair_pep_file, self.pair_cds_file, self.mrtrans, self.pair_yn, self.prot_align_file, '2YN.dN', '2YN.dS', '2YN.t', 'rst', 'rst1', 'yn00.ctl', 'rub' ]: try: os.remove(file) except OSError: pass def pair_kaks(self, k): self.align() pal = self.pal2nal() if not pal: return [] kaks = self.run_yn00() if kaks is None: return [] kaks_new = [ kaks[k[0]][k[1]]['NG86']['dN'], kaks[k[0]][k[1]]['NG86']['dS'], kaks[k[0]][k[1]]['YN00']['dN'], kaks[k[0]][k[1]]['YN00']['dS'] ] return kaks_new def align(self): if self.align_software == 'mafft': try: command = [self.mafft_path, '--quiet', self.pair_pep_file, '>', self.prot_align_file] subprocess.run(" ".join(command), shell=True, check=True) except subprocess.CalledProcessError as e: print(f"Error while running MAFFT: {e}") elif self.align_software == 'muscle': try: command = [self.muscle_path, '-align', self.pair_pep_file, '-output', self.prot_align_file, '-quiet'] subprocess.run(" ".join(command), shell=True, check=True) except subprocess.CalledProcessError as e: print(f"Error while running Muscle: {e}") def pal2nal(self): args = ['perl', self.pal2nal_path, self.prot_align_file, self.pair_cds_file, '-output paml -nogap', '>' + self.mrtrans] command = ' '.join(args) try: os.system(command) except: return False return True def run_yn00(self): yn = yn00.Yn00() yn.alignment = self.mrtrans yn.out_file = self.pair_yn yn.set_options(icode=0, commonf3x4=0, weighting=0, verbose=1) try: run_result = yn.run(command=self.yn00_path) except: run_result = None return run_result ================================================ FILE: build/lib/wgdi/ks_peaks.py ================================================ import matplotlib.pyplot as plt import numpy as np import pandas as pd from scipy.stats.kde import gaussian_kde import wgdi.base as base class kspeaks: def __init__(self, options): # Default values self.tandem_length = 200 self.figsize = 10, 6.18 self.fontsize = 9 self.block_length = 3 self.area = 0, 3 self.tandem = True # Set options passed in for k, v in options: setattr(self, str(k), v) print(f'{str(k)} = {v}') # Convert string values to lists of floats self.homo = [float(k) for k in self.homo.split(',')] self.ks_area = [float(k) for k in self.ks_area.split(',')] self.figsize = [float(k) for k in self.figsize.split(',')] self.area = [float(k) for k in self.area.split(',')] self.pvalue = float(self.pvalue) self.block_length = int(self.block_length) self.tandem = base.str_to_bool(self.tandem) def remove_tandem(self, bkinfo): """ Remove tandem duplications based on start and end position differences. """ group = bkinfo[bkinfo['chr1'] == bkinfo['chr2']].copy() group.loc[:, 'start'] = group.loc[:, 'start1'] - group.loc[:, 'start2'] group.loc[:, 'end'] = group.loc[:, 'end1'] - group.loc[:, 'end2'] # Drop rows where start or end difference is within tandem length index = group[(group['start'].abs() <= self.tandem_length) | (group['end'].abs() <= self.tandem_length)].index bkinfo = bkinfo.drop(index) return bkinfo def ks_kde(self, df): """ Perform kernel density estimation (KDE) on Ks data. """ # Clean up 'ks' column by removing leading underscores df.loc[df['ks'].str.startswith('_'), 'ks'] = df.loc[df['ks'].str.startswith('_'), 'ks'].str[1:] ks = df['ks'].str.split('_') arr = [] ks_ave = [] # Collect individual Ks values and calculate average Ks per row for v in ks.values: v = [float(k) for k in v if float(k) >= 0] if len(v) == 0: continue arr.extend(v) ks_ave.append(sum(v) / len(v)) # Mean of each row's Ks values # KDE for three distributions: median, average, total kdemedian = gaussian_kde(df['ks_median'].values) kdemedian.set_bandwidth(bw_method=kdemedian.factor / 3.) kdeaverage = gaussian_kde(ks_ave) kdeaverage.set_bandwidth(bw_method=kdeaverage.factor / 3.) kdetotal = gaussian_kde(arr) kdetotal.set_bandwidth(bw_method=kdetotal.factor / 3.) return [kdemedian, kdeaverage, kdetotal] def run(self): """ Main method to process the data, perform KDE, and generate the plot. """ plt.rcParams['ytick.major.pad'] = 0 fig, ax = plt.subplots(figsize=self.figsize) # Read the block info file bkinfo = pd.read_csv(self.blockinfo) bkinfo['chr1'] = bkinfo['chr1'].astype(str) bkinfo['chr2'] = bkinfo['chr2'].astype(str) bkinfo['length'] = bkinfo['length'].astype(int) # Filter based on block length and p-value bkinfo = bkinfo[(bkinfo['length'] > self.block_length) & (bkinfo['pvalue'] < self.pvalue)] # Remove tandem duplications if needed if self.tandem == False: bkinfo = self.remove_tandem(bkinfo) # Further filtering based on homozygous range and Ks area bkinfo = bkinfo[bkinfo[f'homo{self.multiple}'] >= self.homo[0]] bkinfo = bkinfo[bkinfo[f'homo{self.multiple}'] <= self.homo[1]] bkinfo = bkinfo[bkinfo['ks_median'] >= self.ks_area[0]] bkinfo = bkinfo[bkinfo['ks_median'] <= self.ks_area[1]] # Perform KDE on the Ks data kdemedian, kdeaverage, kdetotal = self.ks_kde(bkinfo) # Define the range for the x-axis (Ks values) dist_space = np.linspace(self.area[0], self.area[1], 500) # Plot the KDE results ax.plot(dist_space, kdemedian(dist_space), color='red', label='block median') ax.plot(dist_space, kdeaverage(dist_space), color='black', label='block average') ax.plot(dist_space, kdetotal(dist_space), color='blue', label='all pairs') # Set plot labels, grid, and limits ax.grid() ax.set_xlabel(r'${K_{s}}$', fontsize=20) ax.set_ylabel('Frequency', fontsize=20) ax.tick_params(labelsize=18) ax.set_xlim(self.area) ax.legend(fontsize=20) # Adjust layout for better display plt.subplots_adjust(left=0.09, right=0.96, top=0.93, bottom=0.12) # Save the figure plt.savefig(self.savefig, dpi=500) plt.show() # Save the filtered data to CSV bkinfo.to_csv(self.savefile, index=False) ================================================ FILE: build/lib/wgdi/ksfigure.py ================================================ import re import sys import matplotlib.pyplot as plt import numpy as np import pandas as pd import wgdi.base as base from scipy import stats class ksfigure(): def __init__(self, options): self.figsize = 10, 6.18 self.legendfontsize = 30 self.labelfontsize = 9 self.area = 0, 3 self.shadow = True self.mode = 'median' for k, v in options: setattr(self, str(k), v) print(str(k), ' = ', v) if self.xlabel == 'none' or self.xlabel == '': self.xlabel = r'Synonymous nucleotide subsititution (${K_{s}}$)' if self.ylabel == 'none' or self.ylabel == '': self.ylabel = 'kernel density of syntenic blocks' if self.title == 'none' or self.title == '': self.title = '' self.figsize = [float(k) for k in self.figsize.split(',')] self.area = [float(k) for k in self.area.split(',')] self.shadow = base.str_to_bool(self.shadow) def Gaussian_distribution(self, t, k): y = np.zeros(len(t)) for i in range(0, int((len(k) - 1) / 3)+1): if np.isnan(k[3 * i + 2]): continue k[3 * i + 2] = float(k[3 * i + 2])/np.sqrt(2) k[3 * i + 0] = float(k[3 * i + 0]) * \ np.sqrt(2*np.pi)*float(k[3 * i + 2]) y1 = stats.norm.pdf( t, float(k[3 * i + 1]), float(k[3 * i + 2])) * float(k[3 * i + 0]) y = y+y1 return y def run(self): plt.rcParams['ytick.major.pad'] = 0 fig, ax = plt.subplots(figsize=self.figsize) ksfit = pd.read_csv(self.ksfit, index_col=0) t = np.arange(self.area[0], self.area[1], 0.0005) col = [k for k in ksfit.columns if re.match('Unnamed:', k)] for index, row in ksfit.iterrows(): ax.plot(t, self.Gaussian_distribution( t, row[col].values), linestyle=row['linestyle'], color=row['color'],alpha=0.8, label=index, linewidth=row['linewidth']) if self.shadow == True: ax.fill_between(t, 0, self.Gaussian_distribution(t, row[col].values), color=row['color'], alpha=0.15, interpolate=True, edgecolor=None, label=index,) align = dict(family='Arial', verticalalignment="center", horizontalalignment="center") ax.set_xlabel(self.xlabel, fontsize=self.labelfontsize, labelpad=20, **align) ax.set_ylabel(self.ylabel, fontsize=self.labelfontsize, labelpad=20, **align) ax.set_title(self.title, weight='bold', fontsize=self.labelfontsize, **align) plt.tick_params(labelsize=10) handles,labels = ax.get_legend_handles_labels() df = pd.DataFrame({ 'handles': handles, 'labels': labels}) df.drop_duplicates(subset='labels', keep='first', inplace=True) handles, labels = df['handles'].tolist(), df['labels'].tolist() if self.shadow == True: plt.legend(handles=handles,labels=labels,loc='upper right', prop={ 'family': 'Arial', 'style': 'italic', 'size': self.legendfontsize}) else: plt.legend(handles=handles,labels=labels,loc='upper right', prop={ 'family': 'Arial', 'style': 'italic', 'size': self.legendfontsize}) plt.gca().spines['top'].set_visible(False) plt.gca().spines['right'].set_visible(False) plt.savefig(self.savefig, dpi=500) plt.show() sys.exit(0) ================================================ FILE: build/lib/wgdi/peaksfit.py ================================================ import re import sys import matplotlib.pyplot as plt import numpy as np import pandas as pd from scipy.optimize import curve_fit from scipy.stats import gaussian_kde, linregress import wgdi.base as base class peaksfit(): def __init__(self, options): self.figsize = 10, 6.18 self.fontsize = 9 self.area = 0, 3 self.mode = 'median' self.histogram_only = False for k, v in options: setattr(self, str(k), v) print(str(k), ' = ', v) self.figsize = [float(k) for k in self.figsize.split(',')] self.area = [float(k) for k in self.area.split(',')] self.bins_number = int(self.bins_number) self.peaks = 1 self.histogram_only = base.str_to_bool(self.histogram_only) def ks_values(self, df): df.loc[df['ks'].str.startswith('_'),'ks']= df.loc[df['ks'].str.startswith('_'),'ks'].str[1:] ks = df['ks'].str.split('_') ks_total = [] ks_average = [] for v in ks.values: ks_total.extend([float(k) for k in v]) ks_average = df['ks_average'].values ks_median = df['ks_median'].values return [ks_median, ks_average, ks_total] def gaussian_fuc(self, x, *params): y = np.zeros_like(x) for i in range(0, len(params), 3): amp = float(params[i]) ctr = float(params[i+1]) wid = float(params[i+2]) y = y + amp * np.exp(-((x - ctr)/wid)**2) return y def kde_fit(self, data, x): kde = gaussian_kde(data) kde.set_bandwidth(bw_method=kde.factor/3.) p = kde(x) guess = [1,1, 1]*self.peaks popt, pcov = curve_fit(self.gaussian_fuc, x, p, guess, maxfev = 80000) popt = [abs(k) for k in popt] data = [] y = self.gaussian_fuc(x, *popt) for i in range(0, len(popt), 3): array = [popt[i], popt[i+1], popt[i+2]] data.append(self.gaussian_fuc(x, *array)) slope, intercept, r_value, p_value, std_err = linregress(p, y) print("\nR-square: "+str(r_value**2)) print("The gaussian fitting curve parameters are :") print(' | '.join([str(k) for k in popt])) return y, data def run(self): plt.rcParams['ytick.major.pad'] = 0 fig, ax = plt.subplots(figsize=self.figsize) bkinfo = pd.read_csv(self.blockinfo) ks_median, ks_average, ks_total = self.ks_values(bkinfo) data = eval('ks_'+self.mode) data = [k for k in data if self.area[0] <= k <= self.area[1]] x = np.linspace(self.area[0], self.area[1], self.bins_number) n, bins, patches = ax.hist(data, int( self.bins_number), density=1, facecolor='blue', alpha=0.3, label='Histogram') if self.histogram_only == True: pass else: y, fit = self.kde_fit(data, x) ax.plot(x, y, color='black', linestyle='-', label='Gaussian fitting') ax.grid() align = dict(family='Arial', verticalalignment="center", horizontalalignment="center") ax.set_xlabel(r'${K_{s}}$', fontsize=20) ax.set_ylabel('Frequency', fontsize=20) ax.tick_params(labelsize=18) ax.legend(fontsize=20) ax.set_xlim(self.area) plt.subplots_adjust(left=0.09, right=0.96, top=0.93, bottom=0.12) plt.savefig(self.savefig, dpi=500) plt.show() sys.exit(0) ================================================ FILE: build/lib/wgdi/pindex.py ================================================ import os import sys import numpy as np import pandas as pd import wgdi.base as base class pindex(): def __init__(self, options): self.remove_delta = True self.position = 'order' self.retention = 0.05 self.diff = 0.05 self.gap = 50 for k, v in options: setattr(self, str(k), v) print(k, ' = ', v) self.gap = int(self.gap) self.retention = float(self.retention) self.diff = float(self.diff) def Pindex(self, sub1, sub2): r1 = self.retain(sub1) r2 = self.retain(sub2) r = [] for i in range(len(r2)): if(r1[i] < self.retention or r2[i] < self.retention): r.append(0) continue d = (r1[i]-r2[i])/(r1[i]+r2[i])*0.5 if d > self.diff: r.append(1) elif -d > self.diff: r.append(-1) else: r.append(0) a, b, c = len([i for i in r if i == 1]), len( [i for i in r if i == -1]), len([i for i in r if i == 0]) return [a, -b, c, len(r)] def retain(self, arr): a = [] for i in range(0, len(arr), 2*self.gap): start, end = i-self.gap, i+self.gap genenum, retainnum = 0, 0 for j in range(start, end): if((j >= int(len(arr))) or (j < 0)): continue else: retainnum += arr[j] genenum += 1 a.append(float(retainnum/genenum)) return a def run(self): alignment = pd.read_csv(self.alignment, header=None, index_col=0) alignment.replace(r'\w+', 1, regex=True, inplace=True) alignment.replace('.', 0, inplace=True) alignment.fillna(0, inplace=True) gff = base.newgff(self.gff) lens = base.newlens(self.lens, self.position) gff = gff[gff['chr'].isin(lens.index)] alignment = alignment.join(gff[['chr', self.position]], how='left') alignment.dropna(axis=0, how='any', inplace=True) p = self.cal_pindex(alignment) print('Polyploidy-index: ', p) sys.exit(0) def cal_pindex(self, alignment): data, df = [], [] columns = alignment.columns[:-2].tolist() for i in range(len(columns)-1): for j in range(i+1, len(columns)): b = [] for chr, group in alignment.groupby('chr'): sub1 = group.loc[:, columns[i]].tolist() sub2 = group.loc[:, columns[j]].tolist() p = self.Pindex(sub1, sub2) b.append(p) df.append([i, j, chr]+p) sub_diver = sum([abs(k[0]+k[1]) for k in b]) if self.remove_delta == True: sub_total = sum([abs(k[1])+abs(k[0]) for k in b]) if sub_total == 0: c = 0 else: c = sub_diver/sub_total else: sub_total = sum([abs(k[1])+abs(k[0])+abs(k[2]) for k in b]) c = sub_diver/sub_total data.append(c) df = pd.DataFrame(df, columns=[ 'sub1', 'sub2', 'chr', 'sub1_high', 'sub2_high', 'No_diff', 'Total']) df['sub2_high'] = df['sub2_high'].abs() self.infomation(df) print('\nPolyploidy-index between subgenomes are ', data) return sum(data)/len(data) def turn_percentage(self, x): return '(%.2f%%)' % (x * 100) def infomation(self, df): data = [] for names, group in df.groupby(['sub1', 'sub2']): newgroup = pd.concat([group.head(1), group], axis=0, ignore_index=True) cols = ['sub1_high', 'sub2_high', 'No_diff', 'Total'] newgroup.loc[0, cols] = group.loc[:, cols].sum() group1 = newgroup.copy() group1[cols] = group1[cols].astype(str) newgroup['sub1_high'] = ( newgroup['sub1_high'] / newgroup['Total']).apply(self.turn_percentage) newgroup['sub2_high'] = ( newgroup['sub2_high'] / newgroup['Total']).apply(self.turn_percentage) newgroup['No_diff'] = ( newgroup['No_diff'] / newgroup['Total']).apply(self.turn_percentage) newgroup['Total'] = ( newgroup['Total'] / group['Total'].sum()).apply(self.turn_percentage) newgroup[cols] = group1[cols]+newgroup[cols] group_list = [] a = newgroup[['chr']+cols].columns.to_numpy() a[0] = 'Chromosome' a[1], a[2] = 'Sub_'+str(names[0]+1), 'Sub_'+str(names[1]+1) group_list.append(a) b = newgroup[['chr']+cols].to_numpy() b[0][0] = 'Total' for k in b: group_list.append(k) group_list = np.array(group_list).T for k in group_list: data.append(k) data = pd.DataFrame(data) data.to_csv(self.savefile, header=None, index=None) ================================================ FILE: build/lib/wgdi/polyploidy_classification.py ================================================ import pandas as pd import wgdi.base as base class polyploidy_classification: def __init__(self, options): self.same_protochromosome = False self.same_subgenome = False for k, v in options: setattr(self, str(k), v) print(f"{k} = {v}") self.same_protochromosome = base.str_to_bool(self.same_protochromosome) self.same_subgenome = base.str_to_bool(self.same_subgenome) # Initialize classid with a default value if not provided self.classid = [str(k) for k in getattr(self, 'classid', 'class1,class2').split(',')] def run(self): # Read input files ancestor_left = base.read_classification(self.ancestor_left) ancestor_top = base.read_classification(self.ancestor_top) bkinfo = pd.read_csv(self.blockinfo) # Ensure chr1 and chr2 are treated as strings bkinfo['chr1'] = bkinfo['chr1'].astype(str) bkinfo['chr2'] = bkinfo['chr2'].astype(str) # Filter rows where chr1 and chr2 match ancestor values bkinfo = bkinfo[bkinfo['chr1'].isin(ancestor_left[0].values) & bkinfo['chr2'].isin(ancestor_top[0].values)] # Initialize additional columns bkinfo[self.classid[0]] = 0 bkinfo[self.classid[1]] = 0 bkinfo[self.classid[0] + '_color'] = '' bkinfo[self.classid[1] + '_color'] = '' bkinfo['diff'] = 0.0 # Processing the first classification (ancestor_left vs chr1) for name, group in bkinfo.groupby('chr1'): d1 = ancestor_left[ancestor_left[0] == name] for index1, row1 in group.iterrows(): a, b = sorted([row1['start1'], row1['end1']]) a, b = int(a), int(b) for index2, row2 in d1.iterrows(): c, d = sorted([row2[1], row2[2]]) h = len([k for k in range(a, b) if k in range(c, d)]) / (b - a) if h > bkinfo.loc[index1, 'diff']: bkinfo.loc[index1, 'diff'] = float(h) bkinfo.loc[index1, self.classid[0]] = row2[4] bkinfo.loc[index1, self.classid[0] + '_color'] = row2[3] # Reset 'diff' and process the second classification (ancestor_top vs chr2) bkinfo['diff'] = 0.0 for name, group in bkinfo.groupby('chr2'): d2 = ancestor_top[ancestor_top[0] == name] for index1, row1 in group.iterrows(): a, b = sorted([row1['start2'], row1['end2']]) a, b = int(a), int(b) for index2, row2 in d2.iterrows(): c, d = sorted([row2[1], row2[2]]) h = len([k for k in range(a, b) if k in range(c, d)]) / (b - a) if h > bkinfo.loc[index1, 'diff']: bkinfo.loc[index1, 'diff'] = float(h) bkinfo.loc[index1, self.classid[1]] = row2[4] bkinfo.loc[index1, self.classid[1] + '_color'] = row2[3] # Uncomment if you want to filter rows where both colors match if self.same_protochromosome == True: bkinfo = bkinfo[bkinfo[self.classid[1] + '_color'] == bkinfo[self.classid[0] + '_color']] if self.same_subgenome == True: bkinfo = bkinfo[bkinfo[self.classid[1]] == bkinfo[self.classid[0]]] # Save the result to a CSV file bkinfo.to_csv(self.savefile, index=False) ================================================ FILE: build/lib/wgdi/retain.py ================================================ import matplotlib.pyplot as plt import pandas as pd import wgdi.base as base class retain: def __init__(self, options): self.position = 'order' # Initialize the options by setting attributes dynamically for k, v in options: setattr(self, str(k), v) print(f"{str(k)} = {v}") # Handle the ylim parameter, which defines the y-axis limits self.ylim = [float(k) for k in self.ylim.split(',')] if hasattr(self, 'ylim') else [0, 1] # Handle the colors and figsize parameters self.colors = [str(k) for k in self.colors.split(',')] self.figsize = [float(k) for k in self.figsize.split(',')] def run(self): # Load GFF and lens data gff = base.newgff(self.gff) lens = base.newlens(self.lens, self.position) # Filter GFF data based on lens chromosome index gff = gff[gff['chr'].isin(lens.index)] # Load alignment data and join with GFF alignment = pd.read_csv(self.alignment, header=None, index_col=0) alignment = alignment.join(gff[['chr', self.position]], how='left') # Perform alignment processing self.retain = self.align_chr(alignment) # Save the processed data to a file self.retain[self.retain.columns[:-2]].to_csv(self.savefile, sep='\t', header=None) # Create a figure for plotting fig, axs = plt.subplots(len(lens), 1, sharex=True, sharey=True, figsize=tuple(self.figsize)) fig.add_subplot(111, frameon=False) align = dict(family='DejaVu Sans', verticalalignment="center", horizontalalignment="center") # Hide all the spines and ticks on the plot for spine in plt.gca().spines.values(): spine.set_visible(False) plt.tick_params(top=False, bottom=False, left=False, right=False, labelleft=False, labelbottom=False) # Group the retain data by chromosome and plot each chromosome's data groups = self.retain.groupby('chr') for i, chr_name in enumerate(lens.index): group = groups.get_group(chr_name) if len(lens) == 1: for j, col in enumerate(self.retain.columns[:-2]): axs.plot(group['order'].values, group[col].values, linestyle='-', color=self.colors[j], linewidth=1) axs.spines['right'].set_visible(False) axs.spines['top'].set_visible(False) axs.set_ylim(self.ylim) axs.tick_params(labelsize=12) else: # Plot each column's data for the current chromosome for j, col in enumerate(self.retain.columns[:-2]): axs[i].plot(group['order'].values, group[col].values, linestyle='-', color=self.colors[j], linewidth=1) # Hide the right and top spines for each subplot axs[i].spines['right'].set_visible(False) axs[i].spines['top'].set_visible(False) axs[i].set_ylim(self.ylim) axs[i].tick_params(labelsize=12) for i, chr_name in enumerate(lens.index): if len(lens) == 1: x, y = axs.get_xlim()[1] * 0.90, axs.get_ylim()[1] * 0.8 axs.text(x, y, f"{self.refgenome} {chr_name}", fontsize=14, **align) else: # Add a label for the reference genome and chromosome x, y = axs[i].get_xlim()[1] * 0.90, axs[i].get_ylim()[1] * 0.8 axs[i].text(x, y, f"{self.refgenome} {chr_name}", fontsize=14, **align) # Adjust layout and save the figure as an image plt.ylabel(f"{self.ylabel}\n\n\n\n", fontsize=18, **align) plt.subplots_adjust(left=0.1, right=0.95, top=0.95, bottom=0.05) plt.savefig(self.savefig, dpi=500) plt.show() def align_chr(self, alignment): """ Perform the alignment processing for each chromosome by updating the values. """ for i in alignment.columns[:-2]: # Update values: set '1' for valid values, '0' for invalid, and fill NaN with 0 alignment.loc[alignment[i].str.contains(r'\w', na=False), i] = 1 alignment.loc[alignment[i] == '.', i] = 0 alignment.loc[alignment[i] == ' ', i] = 0 alignment[i] = alignment[i].astype('float64').fillna(0) # Apply the moving average function to each group by chromosome for chr_name, group in alignment.groupby(['chr']): a = self.moving_average(group[i].values.tolist()) alignment.loc[group.index, i] = a return alignment def moving_average(self, arr): """ Calculate a moving average over a specified window size. This function smooths the input array using a sliding window. """ a = [] for i in range(len(arr)): # Define the window range start, end = max(0, i - int(self.step)), min(len(arr), i + int(self.step)) ave = sum(arr[start:end]) / (end - start) a.append(ave) return a ================================================ FILE: build/lib/wgdi/run.py ================================================ import argparse import os import shutil import sys import wgdi import wgdi.base as base from wgdi.align_dotplot import align_dotplot from wgdi.block_correspondence import block_correspondence from wgdi.block_info import block_info from wgdi.block_ks import block_ks from wgdi.circos import circos from wgdi.dotplot import dotplot from wgdi.karyotype import karyotype from wgdi.karyotype_mapping import karyotype_mapping from wgdi.ks import ks from wgdi.ks_peaks import kspeaks from wgdi.ksfigure import ksfigure from wgdi.peaksfit import peaksfit from wgdi.pindex import pindex from wgdi.polyploidy_classification import polyploidy_classification from wgdi.retain import retain from wgdi.run_colliearity import mycollinearity from wgdi.trees import trees from wgdi.ancestral_karyotype import ancestral_karyotype from wgdi.ancestral_karyotype_repertoire import ancestral_karyotype_repertoire from wgdi.shared_fusion import shared_fusion from wgdi.fusion_positions_database import fusion_positions_database from wgdi.fusions_detection import fusions_detection # Argument parser setup parser = argparse.ArgumentParser( prog='wgdi', usage='%(prog)s [options]', epilog="", formatter_class=argparse.RawDescriptionHelpFormatter ) parser.description = '''\ WGDI(Whole-Genome Duplication Integrated): A user-friendly toolkit for evolutionary analyses of whole-genome duplications and ancestral karyotypes. https://wgdi.readthedocs.io/en/latest/ -------------------------------------- ''' parser.add_argument("-v", "--version", action='version', version='0.75') parser.add_argument("-d", dest="dotplot", help="Show homologous gene dotplot") parser.add_argument("-icl", dest="improvedcollinearity", help="Improved version of ColinearScan ") parser.add_argument("-ks", dest="calks", help="Calculate Ka/Ks for homologous gene pairs by YN00") parser.add_argument("-bk", dest="blockks", help="Show Ks of blocks in a dotplot") parser.add_argument("-bi", dest="blockinfo", help="Collinearity and Ks speculate whole genome duplication") parser.add_argument("-c", dest="correspondence", help="Extract event-related genomic alignment") parser.add_argument("-kp", dest="kspeaks", help="A simple way to get ks peaks") parser.add_argument("-kf", dest="ksfigure", help="A simple way to draw ks distribution map") parser.add_argument("-pf", dest="peaksfit", help="Gaussian fitting of ks distribution") parser.add_argument("-pc", dest="polyploidy_classification", help="Polyploid distinguish among subgenomes") parser.add_argument("-a", dest="alignment", help="Show event-related genomic alignment in a dotplot") parser.add_argument("-k", dest="karyotype", help="Show genome evolution from reconstructed ancestors") parser.add_argument("-ak", dest="ancestral_karyotype", help="Generation of ancestral karyotypes from chromosomes that retain same structures in genomes") parser.add_argument("-akr", dest="ancestral_karyotype_repertoire", help="Incorporate genes from collinearity blocks into the ancestral karyotype repertoire") parser.add_argument("-km", dest="karyotype_mapping", help="Mapping from the known karyotype result to this species") parser.add_argument("-fpd", dest="fusion_positions_database", help="Extract the fusion positions dataset") parser.add_argument("-fd", dest="fusions_detection", help="Determine whether these fusion events occur in other genomes") parser.add_argument("-sf", dest="shared_fusion", help="Quickly find shared fusions between species") parser.add_argument("-at", dest="alignmenttrees", help="Collinear genes construct phylogenetic trees") parser.add_argument("-p", dest="pindex", help="Polyploidy-index characterize the degree of divergence between subgenomes of a polyploidy") parser.add_argument("-r", dest="retain", help="Show subgenomes in gene retention or genome fractionation") parser.add_argument("-ci", dest="circos", help="A simple way to run circos") parser.add_argument("-conf", dest="configure", help="Display and modify the environment variable") args = parser.parse_args() # Function to run subprograms based on options def run_subprogram(program, conf, name): options = base.load_conf(conf, name) r = program(options) r.run() # Function to configure environment def run_configure(): base.rewrite(args.configure, 'ini') # Main function to decide which module to run based on input arguments def module_to_run(argument, conf): switcher = { 'dotplot': (dotplot, conf, 'dotplot'), 'correspondence': (block_correspondence, conf, 'correspondence'), 'alignment': (align_dotplot, conf, 'alignment'), 'retain': (retain, conf, 'retain'), 'blockks': (block_ks, conf, 'blockks'), 'blockinfo': (block_info, conf, 'blockinfo'), 'calks': (ks, conf, 'ks'), 'circos': (circos, conf, 'circos'), 'kspeaks': (kspeaks, conf, 'kspeaks'), 'peaksfit': (peaksfit, conf, 'peaksfit'), 'ksfigure': (ksfigure, conf, 'ksfigure'), 'pindex': (pindex, conf, 'pindex'), 'alignmenttrees': (trees, conf, 'alignmenttrees'), 'improvedcollinearity': (mycollinearity, conf, 'collinearity'), 'configure': run_configure, 'polyploidy_classification': (polyploidy_classification, conf, 'polyploidy classification'), 'karyotype': (karyotype, conf, 'karyotype'), 'ancestral_karyotype': (ancestral_karyotype, conf, 'ancestral_karyotype'), 'karyotype_mapping': (karyotype_mapping, conf, 'karyotype_mapping'), 'ancestral_karyotype_repertoire': (ancestral_karyotype_repertoire, conf, 'ancestral_karyotype_repertoire'), 'shared_fusion': (shared_fusion, conf, 'shared_fusion'), 'fusion_positions_database': (fusion_positions_database, conf, 'fusion_positions_database'), 'fusions_detection': (fusions_detection, conf, 'fusions_detection'), } if argument == 'configure': run_configure() else: program, conf, name = switcher.get(argument) if program: run_subprogram(program, conf, name) # Main entry point def main(): path = wgdi.__path__[0] options = { 'dotplot': 'dotplot.conf', 'correspondence': 'corr.conf', 'alignment': 'align.conf', 'retain': 'retain.conf', 'blockks': 'blockks.conf', 'blockinfo': 'blockinfo.conf', 'calks': 'ks.conf', 'circos': 'circos.conf', 'kspeaks': 'kspeaks.conf', 'ksfigure': 'ksfigure.conf', 'pindex': 'pindex.conf', 'alignmenttrees': 'alignmenttrees.conf', 'peaksfit': 'peaksfit.conf', 'configure': 'conf.ini', 'improvedcollinearity': 'collinearity.conf', 'polyploidy_classification': 'polyploidy_classification.conf', 'karyotype': 'karyotype.conf', 'ancestral_karyotype': 'ancestral_karyotype.conf', 'ancestral_karyotype_repertoire': 'ancestral_karyotype_repertoire.conf', 'karyotype_mapping': 'karyotype_mapping.conf', 'shared_fusion': 'shared_fusion.conf', 'fusion_positions_database': 'fusion_positions_database.conf', 'fusions_detection': 'fusions_detection.conf', } for arg in vars(args): value = getattr(args, arg) if value is not None: if value in ['?', 'help', 'example']: with open(os.path.join(path, 'example', options[arg])) as f: print(f.read()) if arg == 'ksfigure' and not os.path.exists('ks_fit_result.csv'): shutil.copy2(os.path.join(wgdi.__path__[0], 'example/ks_fit_result.csv'), os.getcwd()) elif not os.path.exists(value): print(f'{value} not exists') sys.exit(0) else: module_to_run(arg, value) if __name__ == "__main__": main() ================================================ FILE: build/lib/wgdi/run_colliearity.py ================================================ import gc import re import sys from multiprocessing import Pool import numpy as np import pandas as pd import wgdi.base as base import wgdi.collinearity as improvedcollinearity class mycollinearity(): def __init__(self, options): # Initialize parameters with default values self.repeat_number = 10 self.multiple = 1 self.score = 100 self.evalue = 1e-5 self.blast_reverse = False self.over_gap = 5 self.comparison = 'genomes' self.options = options for k, v in options: setattr(self, str(k), v) print(f"{str(k)} = {v}") self.position = 'order' # Parse grading values if hasattr(self, 'grading'): self.grading = [int(k) for k in self.grading.split(',')] else: self.grading = [50, 40, 25] # Ensure process is an integer if hasattr(self, 'process'): self.process = int(self.process) else: self.process = 4 self.over_gap = int(self.over_gap ) base.str_to_bool(self.blast_reverse) def deal_blast_for_chromosomes(self, blast, rednum, repeat_number): bluenum = rednum blast = blast.sort_values(by=[0, 11], ascending=[True, False]) def assign_grading(group): group['cumcount'] = group.groupby(1).cumcount() group = group[group['cumcount'] <= repeat_number] group['grading'] = pd.cut( group['cumcount'], bins=[-1, 0, bluenum, repeat_number], labels=self.grading, right=True ) return group newblast = blast.groupby(['chr1', 'chr2']).apply(assign_grading).reset_index(drop=True) newblast['grading'] = newblast['grading'].astype(int) return newblast[newblast['grading'] > 0] def deal_blast_for_genomes(self, blast, rednum, repeat_number): # Initialize the grading column blast['grading'] = 0 # Define the blue number as the sum of rednum and the predefined constant bluenum = 4 + rednum # Get the indices for each group by sorting the 11th column in descending order index = [group.sort_values(by=[11], ascending=[False])[:repeat_number].index.tolist() for name, group in blast.groupby([0])] # Split the indices into red, blue, and gray groups reddata = np.array([k[:rednum] for k in index], dtype=object) bluedata = np.array([k[rednum:bluenum] for k in index], dtype=object) graydata = np.array([k[bluenum:repeat_number] for k in index], dtype=object) # Concatenate the results into flat lists redindex = np.concatenate(reddata) if reddata.size else [] blueindex = np.concatenate(bluedata) if bluedata.size else [] grayindex = np.concatenate(graydata) if graydata.size else [] # Update the grading column based on the group indices blast.loc[redindex, 'grading'] = self.grading[0] blast.loc[blueindex, 'grading'] = self.grading[1] blast.loc[grayindex, 'grading'] = self.grading[2] # Return only the rows with non-zero grading return blast[blast['grading'] > 0] def run(self): # Read and process lens files lens1 = base.newlens(self.lens1, 'order') lens2 = base.newlens(self.lens2, 'order') # Read and process gff files gff1 = base.newgff(self.gff1) gff2 = base.newgff(self.gff2) # Filter gff data based on lens indices gff1 = gff1[gff1['chr'].isin(lens1.index)] gff2 = gff2[gff2['chr'].isin(lens2.index)] # Process blast data blast = base.newblast(self.blast, int(self.score), float(self.evalue),gff1, gff2, self.blast_reverse) # Map positions and chromosome information blast['loc1'] = blast[0].map(gff1[self.position]) blast['loc2'] = blast[1].map(gff2[self.position]) blast['chr1'] = blast[0].map(gff1['chr']) blast['chr2'] = blast[1].map(gff2['chr']) # Apply blast filtering and grading if self.comparison.lower() == 'genomes': blast = self.deal_blast_for_genomes(blast, int(self.multiple), int(self.repeat_number)) if self.comparison.lower() == 'chromosomes': blast = self.deal_blast_for_chromosomes(blast, int(self.multiple), int(self.repeat_number)) print(f"The filtered homologous gene pairs are {len(blast)}.\n") if len(blast) < 1: print("Stopped!\n\nIt may be that the id1 and id2 in the BLAST file do not match with (gff1, lens1) and (gff2, lens2).") sys.exit(1) # Group blast data by 'chr1' and 'chr2' total = [] for (chr1, chr2), group in blast.groupby(['chr1', 'chr2']): total.append([chr1, chr2, group]) del blast, group gc.collect() # Determine chunk size for multiprocessing n = int(np.ceil(len(total) / float(self.process))) result, data = '', [] try: # Initialize multiprocessing Pool pool = Pool(self.process) for i in range(0, len(total), n): # Apply single_pool function asynchronously data.append(pool.apply_async( self.single_pool, args=(total[i:i + n], gff1, gff2, lens1, lens2) )) pool.close() pool.join() except: pool.terminate() for k in data: # Collect results from async tasks text = k.get() if text: result += text # Write final output to file result = re.split('\n', result) fout = open(self.savefile, 'w') num = 1 for line in result: if re.match(r"# Alignment", line): # Replace alignment number s = f'# Alignment {num}:' fout.write(s + line.split(':')[1] + '\n') num += 1 continue if len(line) > 0: fout.write(line + '\n') fout.close() sys.exit(0) def single_pool(self, group, gff1, gff2, lens1, lens2): text = '' for bk in group: chr1, chr2 = str(bk[0]), str(bk[1]) print(f'Running {chr1} vs {chr2}') # Extract and sort points points = bk[2][['loc1', 'loc2', 'grading']].sort_values( by=['loc1', 'loc2'], ascending=[True, True] ) # Initialize collinearity analysis collinearity = improvedcollinearity.collinearity( self.options, points) data = collinearity.run() if not data: continue # Extract gene information gf1 = gff1[gff1['chr'] == chr1].reset_index().set_index('order')[[1, 'strand']] gf2 = gff2[gff2['chr'] == chr2].reset_index().set_index('order')[[1, 'strand']] n = 1 for block, evalue, score in data: if len(block) < self.over_gap: continue # Map gene names and strands block['name1'] = block['loc1'].map(gf1[1]) block['name2'] = block['loc2'].map(gf2[1]) block['strand1'] = block['loc1'].map(gf1['strand']) block['strand2'] = block['loc2'].map(gf2['strand']) block['strand'] = np.where( block['strand1'] == block['strand2'], '1', '-1' ) # Prepare text output block['text'] = block.apply( lambda x: f"{x['name1']} {x['loc1']} {x['name2']} {x['loc2']} {x['strand']}\n", axis=1 ) # Determine alignment mark a, b = block['loc2'].head(2).values mark = 'plus' if a < b else 'minus' # Append alignment information text += f'# Alignment {n}: score={score} pvalue={evalue} N={len(block)} {chr1}&{chr2} {mark}\n' text += ''.join(block['text'].values) n += 1 return text ================================================ FILE: build/lib/wgdi/shared_fusion.py ================================================ import pandas as pd import wgdi.base as base class shared_fusion: def __init__(self, options): for k, v in options: setattr(self, str(k), v) print(f"{k} = {v}") # Handle classid and limit_length options self.classid = [str(k) for k in self.classid.split(',')] if hasattr(self, 'classid') else ['class1', 'class2'] self.limit_length = int(self.limit_length) if hasattr(self, 'limit_length') else 20 # Clean and split lens files self.lens1 = self.lens1.replace(' ', '').split(',') self.lens2 = self.lens2.replace(' ', '').split(',') def run(self): # Read classification files and block information ancestor_left = base.read_classification(self.ancestor_left) ancestor_top = base.read_classification(self.ancestor_top) bkinfo = pd.read_csv(self.blockinfo) # Preprocess blockinfo columns bkinfo['chr1'] = bkinfo['chr1'].astype(str) bkinfo['chr2'] = bkinfo['chr2'].astype(str) bkinfo['start1'] = bkinfo['start1'].astype(int) bkinfo['end1'] = bkinfo['end1'].astype(int) bkinfo['start2'] = bkinfo['start2'].astype(int) bkinfo['end2'] = bkinfo['end2'].astype(int) # Filter based on ancestor chromosomes bkinfo = bkinfo[(bkinfo['chr1'].isin(ancestor_left[0].values)) & (bkinfo['chr2'].isin(ancestor_top[0].values))] # Read lens files lens1 = pd.read_csv(self.lens1[0], sep='\t', header=None) lens2 = pd.read_csv(self.lens2[0], sep='\t', header=None) lens1[0] = lens1[0].astype(str) lens2[0] = lens2[0].astype(str) # Perform block fusion analysis blockinfoout = self.block_fusions(bkinfo, ancestor_left, ancestor_top) # Apply filters based on breakpoints and length blockinfoout = blockinfoout[(blockinfoout['breakpoints1'] == 1) & (blockinfoout['breakpoints2'] == 1)] blockinfoout = blockinfoout[(blockinfoout['break_length1'] >= self.limit_length) & (blockinfoout['break_length2'] >= self.limit_length)] # Save the filtered block info blockinfoout.to_csv(self.filtered_blockinfo, index=False) # Filter lens data based on the blockinfoout lens1 = lens1[lens1[0].isin(blockinfoout['chr1'].values)] lens2 = lens2[lens2[0].isin(blockinfoout['chr2'].values)] # Save filtered lens data lens1.to_csv(self.lens1[1], sep='\t', index=False, header=False) lens2.to_csv(self.lens2[1], sep='\t', index=False, header=False) def block_fusions(self, bkinfo, ancestor_left, ancestor_top): # Initialize new columns in the bkinfo dataframe bkinfo['breakpoints1'] = 0 bkinfo['breakpoints2'] = 0 bkinfo['break_length1'] = 0 bkinfo['break_length2'] = 0 for index, row in bkinfo.iterrows(): # Process species 1 (chr1) a, b = sorted([row['start1'], row['end1']]) d1 = ancestor_left[(ancestor_left[0] == row['chr1']) & (ancestor_left[2] >= a) & (ancestor_left[1] <= b)] if len(d1) > 1: bkinfo.loc[index, 'breakpoints1'] = 1 breaklength_max = 0 for _, row2 in d1.iterrows(): length_in = len([k for k in range(a, b) if k in range(row2[1], row2[2])]) length_out = (b - a) - length_in breaklength_max = max(breaklength_max, min(length_in, length_out) + 1) bkinfo.loc[index, 'break_length1'] = breaklength_max # Process species 2 (chr2) c, d = sorted([row['start2'], row['end2']]) d2 = ancestor_top[(ancestor_top[0] == row['chr2']) & (ancestor_top[2] >= c) & (ancestor_top[1] <= d)] if len(d2) > 1: bkinfo.loc[index, 'breakpoints2'] = 1 breaklength_max = 0 for _, row2 in d2.iterrows(): length_in = len([k for k in range(c, d) if k in range(row2[1], row2[2])]) length_out = (d - c) - length_in breaklength_max = max(breaklength_max, min(length_in, length_out) + 1) bkinfo.loc[index, 'break_length2'] = breaklength_max return bkinfo ================================================ FILE: build/lib/wgdi/trees.py ================================================ import os import shutil from io import StringIO import numpy as np import pandas as pd from Bio import AlignIO, Seq, SeqIO, SeqRecord import subprocess import wgdi.base as base class trees(): def __init__(self, options): base_conf = base.config() self.position = 'order' self.alignfile = '' self.align_trimming = '' self.trimming = 'trimal' self.threads = '1' self.minimum = 4 self.tree_software = 'iqtree' self.delete_detail = True for k, v in base_conf: setattr(self, str(k), v) for k, v in options: setattr(self, str(k), v) print(str(k), ' = ', v) if hasattr(self, 'codon_position'): self.codon_position = [ int(k)-1 for k in self.codon_position.split(',')] else: self.codon_position = [0, 1, 2] self.delete_detail = base.str_to_bool(self.delete_detail) def grouping(self, alignment): data = [] indexs = [] if not os.path.exists(self.dir): os.makedirs(self.dir) sequence = SeqIO.to_dict(SeqIO.parse(self.sequence_file, "fasta")) if hasattr(self, 'cds_file'): seq_cds = SeqIO.to_dict(SeqIO.parse(self.cds_file, "fasta")) for index, row in alignment.iterrows(): file = base.gen_md5_id(str(row.values)) self.sequencefile = os.path.join(self.dir, file+'.fasta') self.alignfile = os.path.join(self.dir, file+'.aln') self.align_trimming = self.alignfile+'.trimming' self.treefile = os.path.join(self.dir, file+'.aln.treefile') if os.path.isfile(self.treefile) and os.path.isfile(self.alignfile): data.append(self.treefile) indexs.append(index) continue ids = [] ids_cds = [] for i in range(len(row)): if type(row[i]) == float and np.isnan(row[i]): continue gene_sequence = sequence[row[i]] gene_sequence.id = str(int(i)+1) gene_sequence.description = '' ids.append(gene_sequence) SeqIO.write(ids, self.sequencefile, "fasta") self.align() if hasattr(self, 'cds_file'): self.seqcdsfile = os.path.join(self.dir, file+'.cds.fasta') for i in range(len(row)): if type(row[i]) == float and np.isnan(row[i]): continue gene_cds = seq_cds[row[i]] gene_cds.id = str(int(i)+1) ids_cds.append(gene_cds) SeqIO.write(ids_cds, self.seqcdsfile, "fasta") self.pal2nal() self.codon() if self.trimming.upper() == 'TRIMAL': self.trimal() if self.trimming.upper() == 'DIVVIER': self.divvier() self.buildtrees() if os.path.isfile(self.treefile): data.append(self.treefile) return data def codon(self): if self.codon_position == [0, 1, 2]: shutil.move(self.alignfile+'.mrtrans', self.alignfile) return True records = list(SeqIO.parse(self.alignfile+'.mrtrans', 'fasta')) if len(records) == 0: return False newrecords = [] def final_list(test_list, x, y): return [ test_list[i+j] for i in range(0, len(test_list), x) for j in y] for k in records: if len(k.seq) % 3 > 0: return False seq = final_list(k.seq, 3, self.codon_position) k.seq = ''.join(seq) newrecords.append(SeqRecord.SeqRecord( Seq.Seq(k.seq), id=k.id, description='')) SeqIO.write(newrecords, self.alignfile, 'fasta') return True def pal2nal(self): args = ['perl', self.pal2nal_path, self.alignfile, self.seqcdsfile, '-output fasta', '>'+self.alignfile+'.mrtrans'] command = ' '.join(args) try: os.system(command) except: return False return True def align(self): if self.align_software == 'mafft': try: command = [self.mafft_path,'--quiet', self.sequencefile, '>', self.alignfile] subprocess.run(" ".join(command), shell=True, check=True) except subprocess.CalledProcessError as e: print(f"Error while running MAFFT: {e}") if self.align_software == 'muscle': try: command = [self.muscle_path,'-align', self.sequencefile, '-output', self.alignfile, '-quiet'] subprocess.run(" ".join(command), shell=True, check=True) except subprocess.CalledProcessError as e: print(f"Error while running Muscle: {e}") def trimal(self): args = [self.trimal_path, '-in', self.alignfile, '-out', self.align_trimming, '-automated1'] command = ' '.join(args) try: os.system(command) except: return False return True def divvier(self): args = [self.divvier_path, '-mincol', '4', '-divvygap', self.alignfile] command = ' '.join(args) try: os.system(command) os.rename(self.alignfile+'.divvy.fas', self.align_trimming) except: return False return True def buildtrees(self): try: if self.tree_software.upper() == 'IQTREE': args = [self.iqtree_path, '-s', self.align_trimming, '-m', self.model, '-T', self.threads, '--quiet'] command = ' '.join(args) os.system(command) os.rename(self.align_trimming+'.treefile', self.treefile) elif self.tree_software.upper() == 'FASTTREE': args = [self.fasttree_path, self.align_trimming, '>', self.treefile] command = ' '.join(args) os.system(command) except: return False if self.delete_detail == True: for file in (self.sequencefile, self.align_trimming+'.bionj', self.align_trimming+'.iqtree', self.align_trimming+'.ckp.gz', self.align_trimming+'.log', self.align_trimming+'.mldist', self.align_trimming+'.model.gz'): try: os.remove(file) except OSError: pass return True def run(self): alignment = pd.read_csv(self.alignment, header=None) alignment.replace('.', np.nan, inplace=True) alignment.dropna(thresh=int(self.minimum), inplace=True) if hasattr(self, 'gff') and hasattr(self, 'lens'): gff = base.newgff(self.gff) lens = base.newlens(self.lens, self.position) alignment = pd.merge( alignment, gff[['chr', self.position]], left_on=0, right_on=gff.index, how='left') alignment.dropna(subset=['chr', 'order'], inplace=True) alignment['order'] = alignment['order'].astype(int) alignment = alignment[alignment['chr'].isin(lens.index)] alignment.drop(alignment.columns[-2:], axis=1, inplace=True) data = self.grouping(alignment) fout = open(self.trees_file, 'w') fout.close() for i in range(0, len(data), 100): trees = ' '.join([str(k) for k in data[i:i+100]]) args = ['cat', trees, '>>', self.trees_file] command = ' '.join([str(k) for k in args]) os.system(command) df = pd.read_csv(self.trees_file, header=None, sep='\t') df[0].to_csv(self.trees_file, index=None, sep='\t', header=False) print("done") ================================================ FILE: command.txt ================================================ python setup.py sdist bdist_wheel twine upload dist/* ================================================ FILE: setup.py ================================================ #!/usr/bin/env python # -*- coding: UTF-8 -*- from setuptools import find_packages, setup with open("README.md", "r", encoding='utf-8') as fh: long_description = fh.read() required = ['pandas>=1.1.0', 'numpy', 'biopython', 'matplotlib', 'scipy', 'tabulate'] setup( name="wgdi", version="0.75", author="Pengchuan Sun", author_email="sunpengchuan@gmail.com", description="A user-friendly toolkit for evolutionary analyses of whole-genome duplications and ancestral karyotypes", license="BSD License", long_description=long_description, long_description_content_type="text/markdown", url="https://github.com/SunPengChuan/wgdi", packages=find_packages(), package_data={'': ['*.conf','*.ini', '*.csv']}, classifiers=[ "Intended Audience :: Science/Research", "Programming Language :: Python :: 3", "License :: OSI Approved :: BSD License", "Operating System :: OS Independent", ], entry_points={ 'console_scripts': [ 'wgdi = wgdi.run:main', ] }, zip_safe=True, install_requires=required ) ================================================ FILE: wgdi/__init__.py ================================================ ================================================ FILE: wgdi/align_dotplot.py ================================================ import re import matplotlib.pyplot as plt import numpy as np import pandas as pd import wgdi.base as base class align_dotplot: def __init__(self, options): # Default values self.position = 'order' self.figsize = 'default' self.classid = 'class1' # Initialize from options for k, v in options: setattr(self, str(k), v) print(f'{k} = {v}') self.ks_area = [float(k) for k in getattr(self, 'ks_area', '-1,3').split(',')] self.colors = [str(k) for k in getattr(self, 'colors', 'red,blue,green,black,orange').split(',')] self.ancestor_top = None if getattr(self, 'ancestor_top', 'none') == 'none' else self.ancestor_top self.ancestor_left = None if getattr(self, 'ancestor_left', 'none') == 'none' else self.ancestor_left self.blockinfo_reverse = base.str_to_bool(self.blockinfo_reverse) def pair_position(self, alignment, loc1, loc2, colors): alignment.index = alignment.index.map(loc1) data = [] for i, k in enumerate(alignment.columns): df = alignment[k].map(loc2).dropna() for idx, row in df.items(): data.append([idx, row, colors[i]]) return pd.DataFrame(data, columns=['loc1', 'loc2', 'color']) def run(self): axis = [0, 1, 1, 0] # Lens generation and figure size lens1 = base.newlens(self.lens1, self.position) lens2 = base.newlens(self.lens2, self.position) if re.search(r'\d', self.figsize): self.figsize = [float(k) for k in self.figsize.split(',')] else: self.figsize = np.array([1, float(lens1.sum()) / float(lens2.sum())]) * 10 plt.rcParams['ytick.major.pad'] = 0 # Create plot fig, ax = plt.subplots(figsize=self.figsize) ax.xaxis.set_ticks_position('top') step1, step2 = 1 / float(lens1.sum()), 1 / float(lens2.sum()) # Process Ancestor Data if self.ancestor_left: axis[0] = -0.02 lens_ancestor_left = self.process_ancestor(self.ancestor_left, lens1.index) if self.ancestor_top: axis[3] = -0.02 lens_ancestor_top = self.process_ancestor(self.ancestor_top, lens2.index) base.dotplot_frame(fig, ax, lens1, lens2, step1, step2, self.genome1_name, self.genome2_name, [0, 1]) # Process GFF files gff1, gff2 = base.newgff(self.gff1), base.newgff(self.gff2) gff1 = base.gene_location(gff1, lens1, step1, self.position) gff2 = base.gene_location(gff2, lens2, step2, self.position) if self.ancestor_top: self.ancestor_position(ax, gff2, lens_ancestor_top, 'top') if self.ancestor_left: self.ancestor_position(ax, gff1, lens_ancestor_left, 'left') # Process block info and alignment bkinfo = self.process_blockinfo(lens1,lens2) align = self.alignment(gff1, gff2, bkinfo) alignment = align[gff1.columns[-len(bkinfo[self.classid].drop_duplicates()):]] alignment.to_csv(self.savefile, header=False) # Create scatter plot df = self.pair_position(alignment, gff1['loc'], gff2['loc'], self.colors) plt.scatter(df['loc2'], df['loc1'], s=float(self.markersize), c=df['color'], alpha=0.5, edgecolors=None, linewidths=0, marker='o') ax.axis(axis) plt.subplots_adjust(left=0.07, right=0.97, top=0.93, bottom=0.03) plt.savefig(self.savefig, dpi=500) plt.show() def process_ancestor(self, ancestor_file, lens_index): df = pd.read_csv(ancestor_file, sep="\t", header=None) df[0] = df[0].astype(str) df[3] = df[3].astype(str) df[4] = df[4].astype(int) df[4] = df[4] / df[4].max() return df[df[0].isin(lens_index)] def process_blockinfo(self, lens1, lens2): bkinfo = pd.read_csv(self.blockinfo, index_col='id') if self.blockinfo_reverse == True: bkinfo[['chr1', 'chr2']] = bkinfo[['chr2', 'chr1']] bkinfo[['block1', 'block2']] = bkinfo[['block2', 'block1']] bkinfo['chr1'] = bkinfo['chr1'].astype(str) bkinfo['chr2'] = bkinfo['chr2'].astype(str) bkinfo[self.classid] = bkinfo[self.classid].astype(str) return bkinfo[bkinfo['chr1'].isin(lens1.index) & (bkinfo['chr2'].isin(lens2.index))] def alignment(self, gff1, gff2, bkinfo): gff1['uid'] = gff1['chr'] + 'g' + gff1['order'].astype(str) gff2['uid'] = gff2['chr'] + 'g' + gff2['order'].astype(str) gff1['id'] = gff1.index gff2['id'] = gff2.index for cl, group in bkinfo.groupby(self.classid): name = f'l{cl}' gff1[name] = '' group = group.sort_values(by=['length'], ascending=True) for _, row in group.iterrows(): block = self.create_block_dataframe(row) if block.empty: continue block1_min, block1_max = block['block1'].agg(['min', 'max']) area = gff1[(gff1['chr'] == row['chr1']) & (gff1['order'] >= block1_min) & (gff1['order'] <= block1_max)].index block['id1'] = (row['chr1'] + 'g' + block['block1'].astype(str)).map( dict(zip(gff1['uid'], gff1.index))) block['id2'] = (row['chr2'] + 'g' + block['block2'].astype(str)).map( dict(zip(gff2['uid'], gff2.index))) gff1.loc[block['id1'].values, name] = block['id2'].values gff1.loc[gff1.index.isin(area) & gff1[name].eq(''), name] = '.' return gff1 def create_block_dataframe(self, row): b1, b2, ks = row['block1'].split('_'), row['block2'].split('_'), row['ks'].split('_') ks = list(map(float, ks[1:])) if ks[0] == '' else list(map(float, ks)) block = pd.DataFrame(np.array([b1, b2, ks]).T, columns=['block1', 'block2', 'ks']) block['block1'] = block['block1'].astype(int) block['block2'] = block['block2'].astype(int) block['ks'] = block['ks'].astype(float) return block[(block['ks'] <= self.ks_area[1]) & (block['ks'] >= self.ks_area[0])].drop_duplicates(subset=['block1'], keep='first') def ancestor_position(self, ax, gff, lens, mark): for _, row in lens.iterrows(): loc1 = gff[(gff['chr'] == row[0]) & (gff['order'] == int(row[1]))].index loc2 = gff[(gff['chr'] == row[0]) & (gff['order'] == int(row[2]))].index loc1, loc2 = gff.loc[[loc1[0], loc2[0]], 'loc'] if mark == 'top': width = abs(loc1-loc2) loc = [min(loc1, loc2), 0] height = -0.02 if mark == 'left': height = abs(loc1-loc2) loc = [-0.02, min(loc1, loc2), ] width = 0.02 base.Rectangle(ax, loc, height, width, row[3], row[4]) ================================================ FILE: wgdi/ancestral_karyotype.py ================================================ import pandas as pd from Bio import SeqIO import wgdi.base as base class ancestral_karyotype: def __init__(self, options): self.mark = 'aak' # Set attributes from options for k, v in options: setattr(self, str(k), v) print(f"{k} = {v}") def run(self): # Load and filter data gff = base.newgff(self.gff) ancestor = base.read_classification(self.ancestor) gff = gff[gff['chr'].isin(ancestor[0].values.tolist())] # Create new gff copy and initialize required variables newgff = gff.copy() data, num = [], 1 # Create dictionary mapping chromosome to order chr_arr = ancestor[3].drop_duplicates().to_list() chr_dict = {chr: idx + 1 for idx, chr in enumerate(chr_arr)} ancestor['order'] = ancestor[3].map(chr_dict) dict1, dict2 = {}, {} # Process ancestor and gff information for (cla, order), group in ancestor.groupby([4, 'order'], sort=[False, False]): for index, row in group.iterrows(): index1 = gff[(gff['chr'] == row[0]) & (gff['order'] >= row[1]) & (gff['order'] <= row[2])].index newgff.loc[index1, 'chr'] = str(num) # Store results in data for k in index1: data.append(newgff.loc[k, :].values.tolist() + [k]) dict1[str(num)] = cla dict2[str(num)] = group[3].values[0] num += 1 # Create dataframe from the data collected df = pd.DataFrame(data) # Filter based on peptide file pep = SeqIO.to_dict(SeqIO.parse(self.pep_file, "fasta")) df = df[df[6].isin(pep.keys())] # Assign new names and order for name, group in df.groupby(0): df.loc[group.index, 'order'] = range(1, len(group) + 1) df.loc[group.index, 'newname'] = [f"{self.mark}{name}g{i:05d}" for i in range(1, len(group) + 1)] # Set data types and sort df['order'] = df['order'].astype(int) df = df[[0, 'newname', 1, 2, 3, 'order', 6]].sort_values(by=[0, 'order']) # Save output files df.to_csv(self.ancestor_gff, sep="\t", index=False, header=None) lens = df.groupby(0).max()[[2, 'order']] lens.to_csv(self.ancestor_lens, sep="\t", header=None) # Add extra columns and save final results lens[1] = 1 lens['color'] = lens.index.map(dict2) lens['class'] = lens.index.map(dict1) lens[[1, 'order', 'color', 'class']].to_csv(self.ancestor_file, sep="\t", header=None) # Update peptide sequences with new IDs and save id_dict = df.set_index(6).to_dict()['newname'] seqs = [] for seq_record in SeqIO.parse(self.pep_file, "fasta"): if seq_record.id in id_dict: seq_record.id = id_dict[seq_record.id] seqs.append(seq_record) SeqIO.write(seqs, self.ancestor_pep, "fasta") ================================================ FILE: wgdi/ancestral_karyotype_repertoire.py ================================================ import numpy as np import pandas as pd from Bio import SeqIO import wgdi.base as base class ancestral_karyotype_repertoire(): def __init__(self, options): self.gap = 5 self.direction = 0.01 self.mark = 'aak1s' self.blockinfo_reverse = False for k, v in options: setattr(self, str(k), v) print(k, ' = ', v) self.blockinfo_reverse = base.str_to_bool(self.blockinfo_reverse) def run(self): gff1 = base.newgff(self.gff1) gff2 = base.newgff(self.gff2) bkinfo = pd.read_csv(self.blockinfo, index_col='id') if self.blockinfo_reverse == True: bkinfo[['chr1', 'chr2']] = bkinfo[['chr2', 'chr1']] bkinfo[['block1', 'block2']] = bkinfo[['block2', 'block1']] for index, row in bkinfo.iterrows(): block1, block2 = row['block1'].split('_'), row['block2'].split('_') block1, block2 = [int(k) for k in block1], [int(k) for k in block2] if int(block1[1])-int(block1[0]) < 0: self.direction = -0.01 for i in range(1, len(block2)): if abs(block1[i]-block1[i-1]) == 1 and abs(block2[i]-block2[i-1]) < int(self.gap): gff1_id = gff1[(gff1['chr'] == str(row['chr1'])) & ( gff1['order'] == block1[i])].index[0] order = gff1.loc[gff1_id, 'order'] gff1_row = gff1.loc[gff1_id, :].copy() for num in range(block2[i-1], block2[i]): order = order + self.direction id = gff2[(gff2['chr'] == str(row['chr2'])) & (gff2['order'] == num)].index[0] gff1_row['order'] = order gff1.loc[id, :] = gff1_row df = gff1.copy() df = df.sort_values(by=['chr', 'order']) for name, group in df.groupby(['chr']): df.loc[group.index, 'order'] = list(range(1, len(group)+1)) df.loc[group.index, 'newname'] = list( [str(self.mark)+str(name)+'g'+str(i).zfill(5) for i in range(1, len(group)+1)]) df['order'] = df['order'].astype(int) df['oldname'] = df.index columns = ['chr', 'newname', 'start', 'end', 'strand', 'order', 'oldname'] df[columns].to_csv(self.ancestor_gff, sep="\t", index=False, header=None) lens = df.groupby('chr').max()[['end', 'order']] lens['end'] = lens['end'].astype(np.int64) lens.to_csv(self.ancestor_lens, sep="\t", header=None) ancestor = base.read_classification(self.ancestor) for index, row in ancestor.iterrows(): ancestor.at[index, 1] = 1 ancestor.at[index, 2] = lens.at[str(row[0]),'order'] ancestor.to_csv(self.ancestor_new, sep="\t", index=False, header=None) id_dict = df['newname'].to_dict() seqs = [] for seq_record in SeqIO.parse(self.ancestor_pep, "fasta"): if seq_record.id in id_dict: seq_record.id = id_dict[seq_record.id] else: continue seq_record.description = '' seqs.append(seq_record) SeqIO.write(seqs, self.ancestor_pep_new, "fasta") ================================================ FILE: wgdi/base.py ================================================ import configparser import hashlib import os import re import matplotlib import matplotlib.patches as mpatches import numpy as np import pandas as pd from Bio import SeqIO import wgdi def gen_md5_id(item): """Generate MD5 hash for the given item.""" return hashlib.md5(item.encode('utf-8')).hexdigest() def config(): """Read configuration from the example conf.ini file.""" conf = configparser.ConfigParser() conf.read(os.path.join(wgdi.__path__[0], 'example/conf.ini')) return conf.items('ini') def load_conf(file, section): """Load configuration items from the specified section.""" conf = configparser.ConfigParser() conf.read(file) return conf.items(section) def rewrite(file, section): """Rewrite the configuration file to keep only the specified section.""" conf = configparser.ConfigParser() conf.read(file) if conf.has_section(section): for k in conf.sections(): if k != section: conf.remove_section(k) conf.write(open(os.path.join(wgdi.__path__[0], 'example/conf.ini'), 'w')) print('Option ini has been modified') else: print('Option ini no change') def read_colinearscan(file): """Read colinearscan output and parse into data structure.""" data, b, flag, num = [], [], 0, 1 with open(file) as f: for line in f: line = line.strip() if re.match(r"the", line): num = re.search(r'\d+', line).group() b = [] flag = 1 continue if re.match(r"\>LOCALE", line): flag = 0 p = re.split(':', line) if b: data.append([num, b, p[1]]) b = [] continue if flag == 1: a = re.split(r"\s", line) b.append(a) if b: data.append([num, b, p[1]]) return data def read_mcscanx(fn): """Read mcscanx output and parse into data structure.""" with open(fn) as f1: data, b = [], [] flag, num = 0, 0 for line in f1: line = line.strip() if re.match(r"## Alignment", line): flag = 1 if not b: arr = re.findall(r"[\d+\.]+", line)[0] continue data.append([num, b, 0]) b = [] num = re.findall(r"\d+", line)[0] continue if flag == 0: continue a = re.split(r"\:", line) c = re.split(r"\s+", a[1]) b.append([c[1], c[1], c[2], c[2]]) if b: data.append([num, b, 0]) return data def read_jcvi(fn): """Read jcvi output and parse into data structure.""" with open(fn) as f1: data, b = [], [] num = 1 for line in f1: line = line.strip() if re.match(r"###", line): if b: data.append([num, b, 0]) b = [] num += 1 continue a = re.split(r"\t", line) b.append([a[0], a[0], a[1], a[1]]) if b: data.append([num, b, 0]) return data def read_collinearity(fn): """Read collinearity output and parse into data structure.""" with open(fn) as f1: data, b = [], [] flag, arr = 0, [] for line in f1: line = line.strip() if re.match(r"# Alignment", line): flag = 1 if not b: arr = re.findall(r'[\.\d+]+', line) continue data.append([arr[0], b, arr[2]]) b = [] arr = re.findall(r'[\.\d+]+', line) continue if flag == 0: continue b.append(re.split(r"\s", line)) if b: data.append([arr[0], b, arr[2]]) return data def read_ks(file, col): """Read KS values from file and select specified column.""" ks = pd.read_csv(file, sep='\t') ks.drop_duplicates(subset=['id1', 'id2'], keep='first', inplace=True) ks[col] = ks[col].astype(float) ks = ks[ks[col] >= 0] ks.index = ks['id1'] + ',' + ks['id2'] return ks[col] def get_median(data): """Calculate the median of the data list.""" if not data: return 0 data_sorted = sorted(data) half = len(data_sorted) // 2 return (data_sorted[half] + data_sorted[-(half + 1)]) / 2 def cds_to_pep(cds_file, pep_file, fmt='fasta'): """Translate CDS sequences to peptide sequences and write to file.""" records = list(SeqIO.parse(cds_file, fmt)) for rec in records: rec.seq = rec.seq.translate() SeqIO.write(records, pep_file, 'fasta') return True def newblast(file, score, evalue, gene_loc1, gene_loc2, reverse): """Filter BLAST results based on score, evalue, and gene locations.""" blast = pd.read_csv(file, sep="\t", header=None) if reverse == 'true': blast[[0, 1]] = blast[[1, 0]] blast = blast[(blast[11] >= score) & (blast[10] < evalue) & (blast[1] != blast[0])] blast = blast[(blast[0].isin(gene_loc1.index)) & (blast[1].isin(gene_loc2.index))] blast.drop_duplicates(subset=[0, 1], keep='first', inplace=True) blast[0] = blast[0].astype(str) blast[1] = blast[1].astype(str) return blast def newgff(file): """Read GFF file and rename columns with appropriate data types.""" gff = pd.read_csv(file, sep="\t", header=None, index_col=1) gff.rename(columns={0: 'chr', 2: 'start', 3: 'end', 4: 'strand', 5: 'order'}, inplace=True) gff['chr'] = gff['chr'].astype(str) gff['start'] = gff['start'].astype(np.int64) gff['end'] = gff['end'].astype(np.int64) gff['strand'] = gff['strand'].astype(str) gff['order'] = gff['order'].astype(int) return gff def newlens(file, position): """Read lens file and select position based on 'order' or 'end'.""" lens = pd.read_csv(file, sep="\t", header=None, index_col=0) lens.index = lens.index.astype(str) if position == 'order': lens = lens[2] elif position == 'end': lens = lens[1] return lens def read_classification(file): """Read classification data and convert columns to appropriate types.""" classification = pd.read_csv(file, sep="\t", header=None) classification[0] = classification[0].astype(str) classification[1] = classification[1].astype(int) classification[2] = classification[2].astype(int) classification[3] = classification[3].astype(str) classification[4] = classification[4].astype(int) return classification def gene_location(gff, lens, step, position): """Calculate gene locations based on lens and step.""" gff = gff[gff['chr'].isin(lens.index)].copy() if gff.empty: print('Stoped! \n\nChromosomes in gff file and lens file do not correspond.') exit(0) dict_chr = dict(zip(lens.index, np.append(np.array([0]), lens.cumsum()[:-1].values))) gff['loc'] = '' for name, group in gff.groupby('chr'): gff.loc[group.index, 'loc'] = (dict_chr[name] + group[position]) * step return gff def dotplot_frame(fig, ax, lens1, lens2, step1, step2, genome1_name, genome2_name, arr, pad = 0): """Set up the dotplot frame with grid lines and labels.""" for k in lens1.cumsum()[:-1] * step1: ax.axhline(y=k, alpha=0.8, color='black', lw=0.5) for k in lens2.cumsum()[:-1] * step2: ax.axvline(x=k, alpha=0.8, color='black', lw=0.5) align = dict(family='DejaVu Sans', style='italic', horizontalalignment="center", verticalalignment="center") yticks = lens1.cumsum() * step1 - 0.5 * lens1 * step1 ax.set_yticks(yticks) ax.set_yticklabels(lens1.index, fontsize = 13, family='DejaVu Sans', style='normal') ax.tick_params(axis='y', which='major', pad = pad) ax.tick_params(axis='x', which='major', pad = pad) xticks = lens2.cumsum() * step2 - 0.5 * lens2 * step2 ax.set_xticks(xticks) ax.set_xticklabels(lens2.index, fontsize = 13, family='DejaVu Sans', style='normal') ax.xaxis.set_ticks_position('none') ax.yaxis.set_ticks_position('none') if arr[0] <= 0: ax.text(-0.06, 0.5, genome1_name, weight='semibold', fontsize=16, rotation=90, **align) else: ax.text(-0.06, 0.5, genome1_name, weight='semibold', fontsize=16, rotation=90, **align) if arr[1] < 0: ax.text(0.5, -0.06, genome2_name, weight='semibold', fontsize=16, **align) else: ax.text(0.5, -0.06, genome2_name, weight='semibold', fontsize=16, **align) def Bezier3(plist, t): """Calculate Bezier curve of degree 3.""" p0, p1, p2 = plist return p0 * (1 - t) ** 2 + 2 * p1 * t * (1 - t) + p2 * t ** 2 def Bezier4(plist, t): """Calculate Bezier curve of degree 4.""" p0, p1, p2, p3, p4 = plist return p0 * (1 - t) ** 4 + 4 * p1 * t * (1 - t) ** 3 + 6 * p2 * t ** 2 * (1 - t) ** 2 + 4 * p3 * (1 - t) * t ** 3 + p4 * t ** 4 def Rectangle(ax, loc, height, width, color, alpha): """Draw a rectangle on the axes with specified properties.""" p = mpatches.Rectangle(loc, width, height, edgecolor=None, facecolor=color, alpha=alpha) ax.add_patch(p) def str_to_bool(s): if isinstance(s, bool): return s return str(s).strip().lower() == 'true' ================================================ FILE: wgdi/block_correspondence.py ================================================ import re import numpy as np import pandas as pd import wgdi.base as base class block_correspondence(): def __init__(self, options): # Default values self.tandem = True self.pvalue = 0.2 self.position = 'order' self.block_length = 5 self.tandem_length = 200 self.tandem_ratio = 1 self.ks_hit = 0.5 # Set user-defined options for k, v in options: setattr(self, str(k), v) print(k, ' = ', v) # Parse ks_area and homo if present self.ks_area = [float(k) for k in getattr(self, 'ks_area', '-1,3').split(',')] self.homo = [float(k) for k in self.homo.split(',')] self.tandem_ratio = float(self.tandem_ratio) self.tandem = base.str_to_bool(self.tandem) def run(self): lens1 = base.newlens(self.lens1, self.position) lens2 = base.newlens(self.lens2, self.position) # Load block information from CSV bkinfo = pd.read_csv(self.blockinfo) bkinfo = self.preprocess_blockinfo(bkinfo, lens1, lens2) # Initialize correspondence DataFrame cor = self.initialize_correspondence(lens1, lens2) # If no tandem allowed, remove tandem regions if not self.tandem: bkinfo = self.remove_tandem(bkinfo) # Remove low KS hits bkinfo = self.remove_ks_hit(bkinfo) # Find collinearity regions and save results collinear_indices = self.collinearity_region(cor, bkinfo, lens1) bkinfo.loc[bkinfo.index.isin(collinear_indices), :].to_csv(self.savefile, index=False) def preprocess_blockinfo(self, bkinfo, lens1, lens2): bkinfo['chr1'] = bkinfo['chr1'].astype(str) bkinfo['chr2'] = bkinfo['chr2'].astype(str) # Filter by length, chromosome indices, and p-value bkinfo = bkinfo[(bkinfo['length'] >= int(self.block_length)) & (bkinfo['chr1'].isin(lens1.index)) & (bkinfo['chr2'].isin(lens2.index)) & (bkinfo['pvalue'] <= float(self.pvalue))] # Filter by tandem ratio if the column exists if 'tandem_ratio' in bkinfo.columns: bkinfo = bkinfo[bkinfo['tandem_ratio'] <= self.tandem_ratio] return bkinfo def initialize_correspondence(self, lens1, lens2): # Create correspondence DataFrame with initial values cor = [[k, i, 0, lens1[i], j, 0, lens2[j], float(self.homo[0]), float(self.homo[1])] for k in range(1, int(self.multiple) + 1) for i in lens1.index for j in lens2.index] cor = pd.DataFrame(cor, columns=['sub', 'chr1', 'start1', 'end1', 'chr2', 'start2', 'end2', 'homo1', 'homo2']) cor['chr1'] = cor['chr1'].astype(str) cor['chr2'] = cor['chr2'].astype(str) return cor def remove_tandem(self, bkinfo): # Remove tandem regions from the DataFrame group = bkinfo[bkinfo['chr1'] == bkinfo['chr2']].copy() group['start'] = group['start1'] - group['start2'] group['end'] = group['end1'] - group['end2'] tandem_condition = (group['start'].abs() <= int(self.tandem_length)) | (group['end'].abs() <= int(self.tandem_length)) index_to_remove = group[tandem_condition].index return bkinfo.drop(index_to_remove) def remove_ks_hit(self, bkinfo): # Remove records with insufficient KS hits for index, row in bkinfo.iterrows(): ks = self.get_ks_value(row['ks']) ks_ratio = len([k for k in ks if self.ks_area[0] <= k <= self.ks_area[1]]) / len(ks) if ks_ratio < self.ks_hit: bkinfo.drop(index, inplace=True) return bkinfo def get_ks_value(self, ks_str): # Extract and return KS values as floats ks = ks_str.split('_') ks = list(map(float, ks[1:])) if ks[0] == '' else list(map(float, ks)) return ks def collinearity_region(self, cor, bkinfo, lens): collinear_indices = [] for (chr1, chr2), group in bkinfo.groupby(['chr1', 'chr2']): group = group.sort_values(by=['length'], ascending=False) df = pd.Series(0, index=range(1, int(lens[str(chr1)]) + 1)) for index, row in group.iterrows(): # Check homology conditions if not self.is_valid_homo(row): continue # Update the block series and compute ratio b1 = [int(k) for k in row['block1'].split('_')] df1 = df.copy() df1[b1] += 1 ratio = (len(df1[df1 > 0]) - len(df[df > 0])) / len(b1) if ratio < 0.5: continue df[b1] += 1 collinear_indices.append(index) return collinear_indices def is_valid_homo(self, row): # Check if the homology values are within the specified range return self.homo[0] <= row['homo' + self.multiple] <= self.homo[1] ================================================ FILE: wgdi/block_info.py ================================================ import numpy as np import pandas as pd import wgdi.base as base class block_info: def __init__(self, options): self.repeat_number = 20 self.ks_col = 'ks_NG86' self.blast_reverse = False for k, v in options: setattr(self, str(k), v) print(f"{k} = {v}") self.repeat_number = int(self.repeat_number) self.blast_reverse = base.str_to_bool(self.blast_reverse) def block_position(self, collinearity, blast, gff1, gff2, ks): data = [] for block in collinearity: blk_homo, blk_ks = [], [] # Skip blocks with missing gene coordinates in GFF files if block[1][0][0] not in gff1.index or block[1][0][2] not in gff2.index: continue # Extract chromosome info chr1, chr2 = gff1.at[block[1][0][0], 'chr'], gff2.at[block[1][0][2], 'chr'] # Extract start and end positions array1, array2 = [float(i[1]) for i in block[1]], [float(i[3]) for i in block[1]] start1, end1 = array1[0], array1[-1] start2, end2 = array2[0], array2[-1] block1, block2 = [], [] for k in block[1]: block1.append(int(float(k[1]))) block2.append(int(float(k[3]))) # Check for KS values pair_ks = self.get_ks_value(ks, k) blk_ks.append(pair_ks) # Retrieve blast homo data if k[0]+","+k[2] in blast.index: blk_homo.append(blast.loc[k[0]+","+k[2], [f'homo{i}' for i in range(1, 6)]].values.tolist()) ks_median, ks_average = self.calculate_ks_statistics(blk_ks) homo = self.calculate_homo_statistics(blk_homo) blkks = '_'.join([str(k) for k in blk_ks]) block1 = '_'.join([str(k) for k in block1]) block2 = '_'.join([str(k) for k in block2]) # Calculate tandem ratio tandem_ratio = self.tandem_ratio(blast, gff2, block[1]) # Store the results data.append([ block[0], chr1, chr2, start1, end1, start2, end2, block[2], len(block[1]), ks_median, ks_average, *homo, block1, block2, blkks, tandem_ratio ]) # Create a DataFrame with the results data_df = pd.DataFrame(data, columns=[ 'id', 'chr1', 'chr2', 'start1', 'end1', 'start2', 'end2', 'pvalue', 'length', 'ks_median', 'ks_average', 'homo1', 'homo2', 'homo3', 'homo4', 'homo5', 'block1', 'block2', 'ks', 'tandem_ratio' ]) # Calculate density data_df['density1'] = data_df['length'] / ((data_df['end1'] - data_df['start1']).abs() + 1) data_df['density2'] = data_df['length'] / ((data_df['end2'] - data_df['start2']).abs() + 1) return data_df def get_ks_value(self, ks, k): """Return KS value for the given pair of genes.""" pair = f"{k[0]},{k[2]}" if pair in ks.index: return ks[pair] pair_rev = f"{k[2]},{k[0]}" if pair_rev in ks.index: return ks[pair_rev] return -1 def calculate_ks_statistics(self, blk_ks): """Calculate KS statistics: median and average.""" ks_arr = [k for k in blk_ks if k >= 0] if len(ks_arr) == 0: return -1, -1 ks_median = base.get_median(ks_arr) ks_average = sum(ks_arr) / len(ks_arr) return ks_median, ks_average def calculate_homo_statistics(self, blk_homo): """Calculate homo statistics by averaging across all blocks.""" df = pd.DataFrame(blk_homo) homo = df.mean().values if len(df) > 0 else [-1, -1, -1, -1, -1] return homo def blast_homo(self, blast, gff1, gff2, repeat_number): """Assign homo values based on blast data.""" index = [group.sort_values(by=11, ascending=False)[:repeat_number].index.tolist() for name, group in blast.groupby([0])] blast = blast.loc[np.concatenate([k[:repeat_number] for k in index], dtype=object), [0, 1]] blast = blast.assign(homo1=np.nan, homo2=np.nan, homo3=np.nan, homo4=np.nan, homo5=np.nan) # Assign homo values for i in range(1, 6): bluenum = i + 5 redindex = np.concatenate([k[:i] for k in index], dtype=object) blueindex = np.concatenate([k[i:bluenum] for k in index], dtype=object) grayindex = np.concatenate([k[bluenum:repeat_number] for k in index], dtype=object) blast.loc[redindex, f'homo{i}'] = 1 blast.loc[blueindex, f'homo{i}'] = 0 blast.loc[grayindex, f'homo{i}'] = -1 blast['chr1_order'] = blast[0].map(gff1['order']) blast['chr2_order'] = blast[1].map(gff2['order']) return blast def tandem_ratio(self, blast, gff2, block): """Calculate tandem ratio for a block.""" block = pd.DataFrame(block)[[0, 2]].rename(columns={0: 'id1', 2: 'id2'}) block['order2'] = block['id2'].map(gff2['order']) # Filter block_blast data block_blast = blast[(blast[0].isin(block['id1'].values)) & (blast[1].isin(block['id2'].values))].copy() block_blast = pd.merge(block_blast, block, left_on=0, right_on='id1', how='left') block_blast['difference'] = (block_blast['chr2_order'] - block_blast['order2']).abs() # Filter based on difference and calculate ratio block_blast = block_blast[(block_blast['difference'] <= self.repeat_number) & (block_blast['difference'] > 0)] return len(block_blast[0].unique()) / len(block) * len(block_blast) / (len(block) + len(block_blast)) def run(self): """Main function to run the analysis.""" # Initialize required datasets lens1 = base.newlens(self.lens1, self.position) lens2 = base.newlens(self.lens2, self.position) gff1 = base.newgff(self.gff1) gff2 = base.newgff(self.gff2) # Filter GFF files based on chromosome indices gff1 = gff1[gff1['chr'].isin(lens1.index)] gff2 = gff2[gff2['chr'].isin(lens2.index)] # Load blast data blast = base.newblast(self.blast, int(self.score), float(self.evalue), gff1, gff2, self.blast_reverse) blast = self.blast_homo(blast, gff1, gff2, self.repeat_number) blast.index = blast[0] + ',' + blast[1] # Get collinearity data collinearity = self.auto_file(gff1, gff2) # Load ks data if necessary ks = pd.Series([]) if self.ks == 'none' or self.ks == '' or not hasattr(self, 'ks') else base.read_ks(self.ks, self.ks_col) # Get the block position data data = self.block_position(collinearity, blast, gff1, gff2, ks) data['class1'] = 0 data['class2'] = 0 # Save results data.to_csv(self.savefile, index=None) def auto_file(self, gff1, gff2): """Auto-detect and read collinearity file.""" with open(self.collinearity) as f: p = ' '.join(f.readlines()[0:30]) # Handle different file formats if 'path length' in p or 'MAXIMUM GAP' in p: return base.read_colinearscan(self.collinearity) elif 'MATCH_SIZE' in p or '## Alignment' in p: return self.process_mcscanx(gff1, gff2) elif '# Alignment' in p: return base.read_collinearity(self.collinearity) elif '###' in p: return self.process_jcvi(gff1, gff2) def process_mcscanx(self, gff1, gff2): """Process MCScanX format collinearity data.""" col = base.read_mcscanx(self.collinearity) collinearity = [] for block in col: newblock = [k for k in block[1] if k[0] in gff1.index and k[2] in gff2.index] if newblock: for k in newblock: k[1], k[3] = gff1.at[k[0], 'order'], gff2.at[k[2], 'order'] collinearity.append([block[0], newblock, block[2]]) return collinearity def process_jcvi(self, gff1, gff2): """Process JCVI format collinearity data.""" col = base.read_jcvi(self.collinearity) collinearity = [] for block in col: newblock = [k for k in block[1] if k[0] in gff1.index and k[2] in gff2.index] if newblock: for k in newblock: k[1], k[3] = gff1.at[k[0], 'order'], gff2.at[k[2], 'order'] collinearity.append([block[0], newblock, block[2]]) return collinearity ================================================ FILE: wgdi/block_ks.py ================================================ import re import matplotlib.pyplot as plt import numpy as np import pandas as pd import wgdi.base as base class block_ks: def __init__(self, options): # Default parameters self.markersize = 0.8 self.figsize = 'default' self.tandem_length = 200 self.blockinfo_reverse = False self.tandem = False self.area = [0, 3] self.position = 'order' self.ks_col = 'ks_NG86' self.pvalue = 0.01 # Overriding default parameters with options for k, v in options: setattr(self, str(k), v) print(f"{k} = {v}") # Parsing area as a float list self.area = [float(k) for k in str(self.area).split(',')] self.markersize = float(self.markersize) self.tandem_length = int(self.tandem_length) self.blockinfo_reverse = base.str_to_bool(self.blockinfo_reverse) self.remove_tandem = base.str_to_bool(self.remove_tandem) def block_position(self, bkinfo, lens1, lens2, step1, step2): pos, pairs = [], [] # Create mappings for chromosome positions dict_y_chr = dict(zip(lens1.index, np.append([0], lens1.cumsum()[:-1].values))) dict_x_chr = dict(zip(lens2.index, np.append([0], lens2.cumsum()[:-1].values))) # Iterate through block information for _, row in bkinfo.iterrows(): block1 = row['block1'].split('_') block2 = row['block2'].split('_') ks = row['ks'].split('_') locy_median = (dict_y_chr[row['chr1']] + 0.5 * (row['end1'] + row['start1'])) * step1 locx_median = (dict_x_chr[row['chr2']] + 0.5 * (row['end2'] + row['start2'])) * step2 pos.append([locx_median, locy_median, row['ks_median']]) # Ensure ks length matches block length if len(block1) != len(ks): ks = ks[1:] for i in range(len(block1)): locy = (dict_y_chr[row['chr1']] + float(block1[i])) * step1 locx = (dict_x_chr[row['chr2']] + float(block2[i])) * step2 pairs.append([locx, locy, float(ks[i])]) return pos, pairs def remove_tandem(self, bkinfo): # Filter for same-chromosome blocks group = bkinfo[bkinfo['chr1'] == bkinfo['chr2']].copy() # Calculate block start and end differences group['start'] = group['start1'] - group['start2'] group['end'] = group['end1'] - group['end2'] # Remove tandems based on threshold index = group[(group['start'].abs() <= self.tandem_length) | (group['end'].abs() <= self.tandem_length)].index return bkinfo.drop(index) def run(self): # Initialize axis and chromosome lens axis = [0, 1, 1, 0] lens1 = base.newlens(self.lens1, self.position) lens2 = base.newlens(self.lens2, self.position) # Parse figsize if re.search(r'\d', self.figsize): self.figsize = [float(k) for k in self.figsize.split(',')] else: self.figsize = np.array([1, float(lens1.sum()) / float(lens2.sum())]) * 10 # Calculate step sizes step1 = 1 / float(lens1.sum()) step2 = 1 / float(lens2.sum()) # Create figure and axes fig, ax = plt.subplots(figsize=self.figsize) plt.rcParams['ytick.major.pad'] = 0 ax.xaxis.set_ticks_position('top') # Plot dotplot frame base.dotplot_frame(fig, ax, lens1, lens2, step1, step2, self.genome1_name, self.genome2_name, [0, 1]) # Load block information bkinfo = pd.read_csv(self.blockinfo) # Handle reverse block information if self.blockinfo_reverse == True: bkinfo[['chr1', 'chr2']] = bkinfo[['chr2', 'chr1']] bkinfo[['block1', 'block2']] = bkinfo[['block2', 'block1']] # Filter block information bkinfo['chr1'] = bkinfo['chr1'].astype(str) bkinfo['chr2'] = bkinfo['chr2'].astype(str) bkinfo = bkinfo[(bkinfo['length'] >= int(self.block_length)) & (bkinfo['chr1'].isin(lens1.index)) & (bkinfo['chr2'].isin(lens2.index)) & (bkinfo['pvalue'] < float(self.pvalue))] # Remove tandem duplicates if required if self.tandem == False: bkinfo = self.remove_tandem(bkinfo) # Calculate positions and pairs pos, pairs = self.block_position(bkinfo, lens1, lens2, step1, step2) # Filter pairs by ks value df = pd.DataFrame(pairs, columns=['loc1', 'loc2', 'ks']) df = df[(df['ks'] >= self.area[0]) & (df['ks'] <= self.area[1])] df.drop_duplicates(inplace=True) # Plot scatter cm = plt.cm.get_cmap('gist_rainbow') sc = plt.scatter(df['loc1'], df['loc2'], s=self.markersize, c=df['ks'], alpha=0.9, edgecolors=None, linewidths=0, marker='o', vmin=self.area[0], vmax=self.area[1], cmap=cm) # Add colorbar cbar = fig.colorbar(sc, shrink=0.5, pad=0.03, fraction=0.1) align = dict(family='DejaVu Sans', style='normal', horizontalalignment="center", verticalalignment="center") cbar.set_label('Ks', labelpad=12.5, fontsize=16, **align) # Set axis and save figure ax.axis(axis) plt.subplots_adjust(left=0.09, right=0.96, top=0.93, bottom=0.03) plt.savefig(self.savefig, dpi=500) plt.show() ================================================ FILE: wgdi/circos.py ================================================ import re import sys import matplotlib as mpl import matplotlib.patches as mpatches import matplotlib.pyplot as plt import numpy as np import pandas as pd import wgdi.base as base class circos(): def __init__(self, options): self.figsize = '10,10' self.position = 'order' self.label_size = 9 self.label_radius = 0.015 self.column_names = [None]*100 for k, v in options: setattr(self, str(k), v) print(k, ' = ', v) self.figsize = [float(k) for k in self.figsize.split(',')] self.ring_width = float(self.ring_width) if hasattr(self, 'legend_square'): self.legend_square = [float(k) for k in self.legend_square.split(',')] else: self.legend_square = 0.04, 0.04 def plot_circle(self, loc_chr, radius, color='black', lw=1, alpha=1, linestyle='-'): for k in loc_chr: start, end = loc_chr[k] t = np.arange(start, end, 0.005) x, y = (radius) * np.cos(t), (radius) * np.sin(t) plt.plot(x, y, linestyle=linestyle, color=color, lw=lw, alpha=alpha) def plot_labels(self, root, labels, loc_chr, radius, horizontalalignment="center", verticalalignment="center", fontsize=6, color='black'): for k in loc_chr: loc = sum(loc_chr[k]) * 0.5 x, y = radius * np.cos(loc), radius * np.sin(loc) self.Wedge(root, (x, y), self.label_radius, 0, 360, self.label_radius, 'white', 1) if 1 * np.pi < loc < 2 * np.pi: loc += np.pi plt.text(x, y, labels[k], horizontalalignment=horizontalalignment, verticalalignment=verticalalignment, fontsize=fontsize, color=color, rotation=0) def Wedge(self, ax, loc, radius, start, end, width, color, alpha): p = mpatches.Wedge(loc, radius, start, end, width=width, edgecolor=None, facecolor=color, alpha=alpha) ax.add_patch(p) def plot_bar(self, df, radius, length, lw, color, alpha): for k in df[df.columns[0]].drop_duplicates().values: if str(k) not in color.keys(): color[str(k)] = 'black' if k in ['', np.nan]: continue df_chr = df.groupby(df.columns[0]).get_group(k) x1, y1 = radius * \ np.cos(df_chr['rad']), radius * np.sin(df_chr['rad']) x2, y2 = (radius + length) * \ np.cos(df_chr['rad']), (radius + length) * \ np.sin(df_chr['rad']) x = np.array( [x1.values, x2.values, [np.nan] * x1.size]).flatten('F') y = np.array( [y1.values, y2.values, [np.nan] * x1.size]).flatten('F') plt.plot(x, y, linestyle='-', color=color[str(k)], lw=lw, alpha=alpha) def chr_location(self, lens, angle_gap, angle): start, end, loc_chr = 0, 0.2*angle_gap, {} for k in lens.index: end += angle_gap + angle * (float(lens[k])) start = end - angle * (float(lens[k])) loc_chr[k] = [float(start), float(end)] return loc_chr def deal_alignment(self, alignment, gff, lens, loc_chr, angle): alignment.replace('\s+', '', inplace=True) alignment.replace('.', '', inplace=True) print(alignment.dropna(subset=[2, 3],how='all')) # exit(0) newalignment = alignment.copy() for i in range(len(alignment.columns)): alignment[i] = alignment[i].astype(str) newalignment[i] = alignment[i].map(gff['chr'].to_dict()) newalignment['loc'] = alignment[0].map(gff[self.position].to_dict()) newalignment[0] = newalignment[0].astype('str') newalignment['loc'] = newalignment['loc'].astype('float') newalignment = newalignment[newalignment[0].isin(lens.index) == True] newalignment['rad'] = np.nan for name, group in newalignment.groupby(0): if str(name) not in loc_chr: continue newalignment.loc[group.index, 'rad'] = loc_chr[str( name)][0]+angle * group['loc'] print(newalignment.dropna(subset=[2, 3,4],how='all')) return newalignment def deal_ancestor(self, alignment, gff, lens, loc_chr, angle, al): alignment.replace('\s+', '', inplace=True) alignment.replace('.', np.nan, inplace=True) newalignment = pd.merge(alignment, gff, left_on=0, right_on=gff.index) newalignment['rad'] = np.nan for name, group in newalignment.groupby('chr'): if str(name) not in loc_chr: continue newalignment.loc[group.index, 'rad'] = loc_chr[str( name)][0]+angle * group[self.position] newalignment.index = newalignment[0] newalignment[0] = newalignment[0].map(newalignment['rad'].to_dict()) data = [] for index_al, row_al in al.iterrows(): for k in alignment.columns[1:]: alignment[k] = alignment[k].astype(str) group = newalignment[(newalignment['chr'] == row_al['chr']) & ( newalignment['order'] >= row_al['start']) & (newalignment['order'] <= row_al['end'])].copy() group.loc[:, k] = group.loc[:, k].map( newalignment['rad']).values group.dropna(subset=[k], inplace=True) group.index = group.index.map(newalignment['rad'].to_dict()) group['color'] = row_al['color'] group = group[group[k].notnull()] data += group[[0, k, 'color']].values.tolist() df = pd.DataFrame(data, columns=['loc1', 'loc2', 'color']) return df def plot_collinearity(self, data, radius, lw=0.02, alpha=1): for name, group in data.groupby('color'): x, y = np.array([]), np.array([]) for index, row in group.iterrows(): ex1x, ex1y = radius * \ np.cos(row['loc1']), radius*np.sin(row['loc1']) ex2x, ex2y = radius * \ np.cos(row['loc2']), radius*np.sin(row['loc2']) ex3x, ex3y = radius * (1-abs(row['loc1']-row['loc2'])/np.pi) * np.cos((row['loc1']+row['loc2'])*0.5), radius * ( 1-abs(row['loc1']-row['loc2'])/np.pi) * np.sin((row['loc1']+row['loc2'])*0.5) x1 = [ex1x, 0.5*ex3x, ex2x] y1 = [ex1y, 0.5*ex3y, ex2y] step = .002 t = np.arange(0, 1+step, step) xt = base.Bezier3(x1, t) yt = base.Bezier3(y1, t) x = np.hstack((x, xt, np.nan)) y = np.hstack((y, yt, np.nan)) plt.plot(x, y, color=name, lw=lw, alpha=alpha) def plot_legend(self, ax, chr_color, width, height): (x1, x2) = ax.get_xlim() (y1, y2) = ax.get_ylim() a = 1000 for k, v in enumerate(chr_color.keys(), 0): h = y1-k//a*height*2 k = k % a if x1 + width * k > x2-width: a = k h = y1-k//a*height*2 k = k % a loc = [x1 + width * k, h] base.Rectangle(ax, loc, height, width, chr_color[v], 1) plt.text(loc[0] + width*0.382, h-0.618*height, v, fontsize=12) ax.set_ylim(h-2*height, y2) def run(self): fig, ax = plt.subplots(figsize=self.figsize) mpl.rcParams['agg.path.chunksize'] = 100000000 lens = base.newlens(self.lens, self.position) radius, angle_gap = float(self.radius), float(self.angle_gap) angle = (2 * np.pi - (int(len(lens))+1.5) * angle_gap) / (int(lens.sum())) loc_chr = self.chr_location(lens, angle_gap, angle) list_colors = [str(k).strip() for k in re.split(',|:', self.colors)] chr_color = dict(zip(list_colors[::2], list_colors[1::2])) gff = base.newgff(self.gff) if hasattr(self, 'ancestor'): ancestor = pd.read_csv(self.ancestor, header=None) al = pd.read_csv(self.ancestor_location, sep='\t', header=None) al.rename(columns={0: 'chr', 1: 'start', 2: 'end', 3: 'color'}, inplace=True) al['chr'] = al['chr'].astype(str) data = self.deal_ancestor(ancestor, gff, lens, loc_chr, angle, al) self.plot_collinearity(data, radius, lw=0.1, alpha=0.8) if hasattr(self, 'alignment'): alignment = pd.read_csv(self.alignment, header=None) print(alignment) newalignment = self.deal_alignment( alignment, gff, lens, loc_chr, angle) if ',' in self.column_names: names = [str(k) for k in self.column_names.split(',')] else: names = [None]*len(newalignment.columns) n = 0 align = dict(family='Arial', verticalalignment="center", horizontalalignment="center") print(newalignment) for k, v in enumerate(newalignment.columns[1:-2]): r = radius + self.ring_width*(k+1) print(k,v,r) self.plot_circle(loc_chr, r, lw=0.5, alpha=1, color='grey') self.plot_bar(newalignment[[v, 'rad']], r + self.ring_width * 0.15, self.ring_width*0.7, 0.15, chr_color, 1) if n % 2 == 0: loc = 0.05 x, y = (r+self.ring_width*0.5) * \ np.cos(loc), (r+self.ring_width*0.5) * np.sin(loc) plt.text(x, y, names[n], rotation=loc * 180 / np.pi, fontsize=self.label_size, **align) else: loc = -0.08 x, y = (r+self.ring_width*0.5) * \ np.cos(loc), (r+self.ring_width*0.5) * np.sin(loc) plt.text(x, y, names[n], fontsize=self.label_size, rotation=loc * 180 / np.pi, **align) n += 1 if hasattr(self, 'ancestor'): colors = al['color'].drop_duplicates().values.tolist() ancestor_chr_color = dict(zip(range(1, len(colors)+1), colors)) self.plot_legend(ax, ancestor_chr_color, self.legend_square[0], self.legend_square[1]) if hasattr(self, 'alignment'): del chr_color['nan'] self.plot_legend( ax, chr_color, self.legend_square[0], self.legend_square[1]) labels = self.chr_label + lens.index labels = dict(zip(lens.index, labels)) self.plot_labels(ax, labels, loc_chr, radius + self.ring_width*0.3, fontsize=self.label_size) plt.axis('off') a = (ax.get_ylim()[1]-ax.get_ylim()[0]) / \ (ax.get_xlim()[1]-ax.get_xlim()[0]) fig.set_size_inches(self.figsize[0], self.figsize[0]*a, forward=True) plt.savefig(self.savefig, dpi=500) plt.show() sys.exit(0) ================================================ FILE: wgdi/collinearity.py ================================================ import numpy as np import pandas as pd class collinearity: def __init__(self, options, points): # Default values self.gap_penalty = -1 self.over_length = 0 self.mg1 = 40 self.mg2 = 40 self.pvalue = 1 self.over_gap = 3 self.points = points self.p_value = 0 self.coverage_ratio = 0.8 # Set user-defined options for k, v in options: setattr(self, str(k), v) # Initialize grading and mg values self.grading = [50, 40, 25] if not hasattr(self, 'grading') else [int(k) for k in self.grading.split(',')] self.mg1, self.mg2 = [40, 40] if not hasattr(self, 'mg') else [int(k) for k in self.mg.split(',')] # Convert string values to floats self.pvalue = float(self.pvalue) self.coverage_ratio = float(self.coverage_ratio) def get_matrix(self): """Initialize the matrix for the collinearity points.""" self.points['usedtimes1'] = 0 self.points['usedtimes2'] = 0 self.points['times'] = 1 self.points['score1'] = self.points['grading'] self.points['score2'] = self.points['grading'] self.points['path1'] = self.points.index.to_numpy().reshape(len(self.points), 1).tolist() self.points['path2'] = self.points['path1'] self.points_init = self.points.copy() self.mat_points = self.points def run(self): """Run the main collinearity processing.""" self.get_matrix() self.score_matrix() data = [] # Process points for maxPath in the positive direction points1 = self.points[['loc1', 'loc2', 'score1', 'path1', 'usedtimes1']].sort_values(by=['score1'], ascending=False) points1.drop(index=points1[points1['usedtimes1'] < 1].index, inplace=True) points1.columns = ['loc1', 'loc2', 'score', 'path', 'usedtimes'] while (self.over_length >= self.over_gap or len(points1) >= self.over_gap): if self.max_path(points1): if self.p_value > self.pvalue: continue data.append([self.path, self.p_value, self.score]) # Process points for maxPath in the negative direction points2 = self.points[['loc1', 'loc2', 'score2', 'path2', 'usedtimes2']].sort_values(by=['score2'], ascending=False) points2.drop(index=points2[points2['usedtimes2'] < 1].index, inplace=True) points2.columns = ['loc1', 'loc2', 'score', 'path', 'usedtimes'] while (self.over_length >= self.over_gap) or (len(points2) >= self.over_gap): if self.max_path(points2): if self.p_value > self.pvalue: continue data.append([self.path, self.p_value, self.score]) return data def score_matrix(self): """Calculate the scoring matrix for the points.""" for index, row, col in self.points[['loc1', 'loc2']].itertuples(): # Get points within a certain range points = self.points[(self.points['loc1'] > row) & (self.points['loc2'] > col) & (self.points['loc1'] < row + self.mg1) & (self.points['loc2'] < col + self.mg2)] row_i_old, gap = row, self.mg2 for index_ij, row_i, col_j, grading in points[['loc1', 'loc2', 'grading']].itertuples(): if col_j - col > gap and row_i > row_i_old: break score = grading + (row_i - row + col_j - col) * self.gap_penalty score1 = score + self.points.at[index, 'score1'] if score > 0 and self.points.at[index_ij, 'score1'] < score1: self.points.at[index_ij, 'score1'] = score1 self.points.at[index, 'usedtimes1'] += 1 self.points.at[index_ij, 'usedtimes1'] += 1 self.points.at[index_ij, 'path1'] = self.points.at[index, 'path1'] + [index_ij] gap = min(col_j - col, gap) row_i_old = row_i # Reverse processing to handle negative direction points_reverse = self.points.sort_values(by=['loc1', 'loc2'], ascending=[False, True]) for index, row, col in points_reverse[['loc1', 'loc2']].itertuples(): points = points_reverse[(points_reverse['loc1'] < row) & (points_reverse['loc2'] > col) & (points_reverse['loc1'] > row - self.mg1) & (points_reverse['loc2'] < col + self.mg2)] row_i_old, gap = row, self.mg2 for index_ij, row_i, col_j, grading in points[['loc1', 'loc2', 'grading']].itertuples(): if col_j - col > gap and row_i < row_i_old: break score = grading + (row - row_i + col_j - col) * self.gap_penalty score2 = score + self.points.at[index, 'score2'] if score > 0 and self.points.at[index_ij, 'score2'] < score2: self.points.at[index_ij, 'score2'] = score2 self.points.at[index, 'usedtimes2'] += 1 self.points.at[index_ij, 'usedtimes2'] += 1 self.points.at[index_ij, 'path2'] = self.points.at[index, 'path2'] + [index_ij] gap = min(col_j - col, gap) row_i_old = row_i def max_path(self, points): """Find the maximum path for the given points.""" if len(points) == 0: self.over_length = 0 return False # Initialize path score and index self.score, self.path_index = points.loc[points.index[0], ['score', 'path']] self.path = points[points.index.isin(self.path_index)] self.over_length = len(self.path_index) # Check if the block overlaps with other blocks if self.over_length >= self.over_gap and len(self.path) / self.over_length > self.coverage_ratio: points.drop(index=self.path.index, inplace=True) [loc1_min, loc2_min], [loc1_max, loc2_max] = self.path[['loc1', 'loc2']].agg(['min', 'max']).to_numpy() # Calculate p-value gap_init = self.points_init[(loc1_min <= self.points_init['loc1']) & (self.points_init['loc1'] <= loc1_max) & (loc2_min <= self.points_init['loc2']) & (self.points_init['loc2'] <= loc2_max)].copy() self.p_value = self.p_value_estimated(gap_init, loc1_max - loc1_min + 1, loc2_max - loc2_min + 1) self.path = self.path.sort_values(by=['loc1'], ascending=[True])[['loc1', 'loc2']] return True else: points.drop(index=points.index[0], inplace=True) return False def p_value_estimated(self, gap, L1, L2): """Estimate p-value based on the given gap and lengths.""" N1 = gap['times'].sum() N = len(gap) self.points_init.loc[gap.index, 'times'] += 1 m = len(self.path) a = (1 - self.score / m / self.grading[0]) * (N1 - m + 1) / N * (L1 - m + 1) * (L2 - m + 1) / L1 / L2 return round(a, 4) ================================================ FILE: wgdi/dotplot.py ================================================ import re import matplotlib.pyplot as plt import numpy as np import pandas as pd import wgdi.base as base class dotplot(): def __init__(self, options): self.multiple = 1 self.score = 100 self.evalue = 1e-5 self.repeat_number = 20 self.markersize = 0.5 self.figsize = 'default' self.position = 'order' self.ancestor_top = None self.ancestor_left = None self.blast_reverse = False for k, v in options: setattr(self, str(k), v) print(k, ' = ', v) if self.ancestor_top == 'none' or self.ancestor_top == '': self.ancestor_top = None if self.ancestor_left == 'none' or self.ancestor_left == '': self.ancestor_left = None base.str_to_bool(self.blast_reverse) def pair_positon(self, blast, gff1, gff2, rednum, repeat_number): blast['color'] = '' blast['loc1'] = blast[0].map(gff1['loc']) blast['loc2'] = blast[1].map(gff2['loc']) bluenum = 5+rednum index = [group.sort_values(by=[11], ascending=[False])[:repeat_number].index.tolist() for name, group in blast.groupby([0])] reddata = np.array([k[:rednum] for k in index], dtype=object) bluedata = np.array([k[rednum:bluenum] for k in index], dtype=object) graydata = np.array([k[bluenum:repeat_number] for k in index], dtype=object) if len(reddata): redindex = np.concatenate(reddata) else: redindex = [] if len(bluedata): blueindex = np.concatenate(bluedata) else: blueindex = [] if len(graydata): grayindex = np.concatenate(graydata) else: grayindex = [] blast.loc[redindex, 'color'] = 'red' blast.loc[blueindex, 'color'] = 'blue' blast.loc[grayindex, 'color'] = 'gray' return blast[blast['color'].str.contains(r'\w')] def run(self): axis = [0, 1, 1, 0] left, right, top, bottom = 0.07, 0.97, 0.93, 0.03 lens1 = base.newlens(self.lens1, self.position) lens2 = base.newlens(self.lens2, self.position) step1 = 1 / float(lens1.sum()) step2 = 1 / float(lens2.sum()) if self.ancestor_left != None: axis[0] = -0.02 lens_ancestor_left = pd.read_csv( self.ancestor_left, sep="\t", header=None) lens_ancestor_left[0] = lens_ancestor_left[0].astype(str) lens_ancestor_left[3] = lens_ancestor_left[3].astype(str) lens_ancestor_left[4] = lens_ancestor_left[4].astype(int) lens_ancestor_left[4] = lens_ancestor_left[4] / lens_ancestor_left[4].max() lens_ancestor_left = lens_ancestor_left[lens_ancestor_left[0].isin( lens1.index)] if self.ancestor_top != None: axis[3] = -0.02 lens_ancestor_top = pd.read_csv( self.ancestor_top, sep="\t", header=None) lens_ancestor_top[0] = lens_ancestor_top[0].astype(str) lens_ancestor_top[3] = lens_ancestor_top[3].astype(str) lens_ancestor_top[4] = lens_ancestor_top[4].astype(int) lens_ancestor_top[4] = lens_ancestor_top[4] / lens_ancestor_top[4].max() lens_ancestor_top = lens_ancestor_top[lens_ancestor_top[0].isin( lens2.index)] if re.search(r'\d', self.figsize): self.figsize = [float(k) for k in self.figsize.split(',')] else: self.figsize = np.array( [1, float(lens1.sum())/float(lens2.sum())])*10 plt.rcParams['ytick.major.pad'] = 0 fig, ax = plt.subplots(figsize=self.figsize) ax.xaxis.set_ticks_position('top') base.dotplot_frame(fig, ax, lens1, lens2, step1, step2, self.genome1_name, self.genome2_name, [axis[0], axis[3]]) gff1 = base.newgff(self.gff1) gff2 = base.newgff(self.gff2) gff1 = base.gene_location(gff1, lens1, step1, self.position) gff2 = base.gene_location(gff2, lens2, step2, self.position) if self.ancestor_top != None: top = top self.aree_left = self.ancestor_posion(ax, gff2, lens_ancestor_top, 'top') if self.ancestor_left != None: left = left self.aree_top = self.ancestor_posion(ax, gff1, lens_ancestor_left, 'left') print('read gffs') blast = base.newblast(self.blast, int(self.score), float(self.evalue), gff1, gff2, self.blast_reverse) if len(blast) ==0: print('Stoped! \n\nThe gene id in blast file does not correspond to gff1 and gff2.') exit(0) print('read blast') df = self.pair_positon(blast, gff1, gff2, int(self.multiple), int(self.repeat_number)) print('deal blast') ax.scatter(df['loc2'], df['loc1'], s=float(self.markersize), c=df['color'], alpha=0.5, edgecolors=None, linewidths=0, marker='o') ax.axis(axis) plt.subplots_adjust(left=left, right=right, top=top, bottom=bottom) plt.savefig(self.savefig, dpi=300) plt.show() def ancestor_posion(self, ax, gff, lens, mark): data = [] for index, row in lens.iterrows(): loc1 = gff[(gff['chr'] == row[0]) & ( gff['order'] == int(row[1]))].index loc2 = gff[(gff['chr'] == row[0]) & ( gff['order'] == int(row[2])-1)].index loc1, loc2 = gff.loc[[loc1[0], loc2[0]], 'loc'] if mark == 'top': width = abs(loc1-loc2) loc = [min(loc1, loc2), 0] height = -0.02 base.Rectangle(ax, loc, height, width, row[3], row[4]) if mark == 'left': height = abs(loc1-loc2) loc = [-0.02, min(loc1, loc2), ] width = 0.02 base.Rectangle(ax, loc, height, width, row[3], row[4]) data.append([loc, height, width, row[3], row[4]]) return data ================================================ FILE: wgdi/example/__init__.py ================================================ ================================================ FILE: wgdi/example/align.conf ================================================ [alignment] blockinfo = block information file (.csv) blockinfo_reverse = false classid = class1 gff1 = gff1 file gff2 = gff2 file lens1 = lens1 file lens2 = lens2 file genome1_name = Genome1 name genome2_name = Genome2 name markersize = 0.5 ks_area = -1,3 position = order colors = red,blue,green figsize = 10,10 savefile = savefile(.csv) savefig= save image(.png, .pdf, .svg) ================================================ FILE: wgdi/example/alignmenttrees.conf ================================================ [alignmenttrees] alignment = alignment file (.csv) gff = gff file (reference genome, If alignment has no reference species, delete it) lens = lens file (If alignment has no reference species, delete it) dir = output folder sequence_file = sequence file (.fa) cds_file = cds file (.fa) codon_positon = 1,2,3 (1,2 mean codon1&2; 1,2,3 mean no codon removed) trees_file = trees (.nwk) align_software = (mafft,muscle) tree_software = (iqtree,fasttree) threads = 1 (Number,AUTO) model = MFP trimming = (trimal,divvier) minimum = 4 delete_detail = true ================================================ FILE: wgdi/example/ancestral_karyotype.conf ================================================ [ancestral_karyotype] gff = gff file (cat the relevant 'gff' files into a file) pep_file = pep file (cat the relevant 'pep.fa' files into a file) ancestor = ancestor file (this file requires you to provide) mark = aak ancestor_gff = result file ancestor_lens = result file ancestor_pep = result file ancestor_file = result file ================================================ FILE: wgdi/example/ancestral_karyotype_repertoire.conf ================================================ [ancestral_karyotype_repertoire] blockinfo = block information (*.csv) # blockinfo: processed *.csv blockinfo_reverse = False gff1 = gff1 file (ancestor's gff) gff2 = gff2 file (the other species's gff) gap = 5 mark = aak1s ancestor = ancestor file #current ancestor file ancestor_new = result file ancestor_pep = ancestor pep file #cat all pep files together ancestor_pep_new = result file ancestor_gff = result file ancestor_lens = result file ================================================ FILE: wgdi/example/blockinfo.conf ================================================ [blockinfo] blast = blast file gff1 = gff1 file gff2 = gff2 file lens1 = lens1 file lens2 = lens2 file collinearity = collinearity file score = 100 evalue = 1e-5 repeat_number = 20 position = order ks = ks file ks_col = ks_NG86 savefile = block information (*.csv) ================================================ FILE: wgdi/example/blockks.conf ================================================ [blockks] lens1 = lens1 file lens2 = lens2 file genome1_name = Genome1 name genome2_name = Genome2 name blockinfo = block information (*.csv) pvalue = 0.2 tandem = true tandem_length = 200 markersize = 1 area = 0,2 block_length = minimum length figsize = 8,8 savefig = save image(.png, .pdf, .svg) ================================================ FILE: wgdi/example/circos.conf ================================================ [circos] gff = gff file lens = lens file radius = 0.2 angle_gap = 0.05 ring_width = 0.015 colors = 1:c,2:m,3:blue,4:gold,5:red,6:lawngreen,7:darkgreen,8:k,9:darkred,10:gray alignment = alignment file chr_label = chr ancestor = ancestor alignment file ancestor_location = ancestor file figsize = 10,10 label_size = 9 position = order legend_square = 0.04, 0.04 column_names = 1,2,3,4,5 savefig = result(.png, .pdf, .svg) ================================================ FILE: wgdi/example/collinearity.conf ================================================ [collinearity] gff1 = gff1 file gff2 = gff2 file lens1 = lens1 file lens2 = lens2 file blast = blast file blast_reverse = false comparison = genomes multiple = 1 process = 8 evalue = 1e-5 score = 100 grading = 50,30,25 mg = 25,25 pvalue = 1 repeat_number = 20 positon = order savefile = collinearity file ================================================ FILE: wgdi/example/conf.ini ================================================ [ini] mafft_path = /home/sunpc/micromamba/envs/wgdi/bin/mafft pal2nal_path = /home/sunpc/micromamba/envs/wgdi/bin/pal2nal.pl yn00_path = /home/sunpc/micromamba/envs/wgdi/bin/yn00 muscle_path = /home/sunpc/micromamba/envs/wgdi/bin/muscle iqtree_path = /home/sunpc/micromamba/envs/wgdi/bin/iqtree trimal_path = /home/sunpc/micromamba/envs/wgdi/bin/trimal fasttree_path = /home/sunpc/micromamba/envs/wgdi/bin/fasttree divvier_path = /home/sunpc/micromamba/envs/wgdi/bin/divvier ================================================ FILE: wgdi/example/corr.conf ================================================ [correspondence] blockinfo = blockinfo file(.csv) lens1 = lens1 file lens2 = lens2 file tandem = true tandem_length = 200 pvalue = 0.2 block_length = 5 tandem_ratio = 0.5 multiple = 1 homo = -1,1 savefile = savefile(.csv) ================================================ FILE: wgdi/example/dotplot.conf ================================================ [dotplot] blast = blast file gff1 = gff1 file gff2 = gff2 file lens1 = lens1 file lens2 = lens2 file genome1_name = Genome1 name genome2_name = Genome2 name multiple = 1 score = 100 evalue = 1e-5 repeat_number = 10 position = order blast_reverse = false ancestor_left = ancestor file or none ancestor_top = ancestor file or none markersize = 0.5 figsize = 10,10 savefig = savefile(.png, .pdf, .svg) ================================================ FILE: wgdi/example/fusion_positions_database.conf ================================================ [fusion_positions_database] pep = pep file gff = gff file fusion_positions = fusion_positions file # Number of gene sets on each side of the breakpoint ancestor_gff = result file ancestor_lens = result file ancestor_pep = result file ancestor_file = result file ================================================ FILE: wgdi/example/fusions_detection.conf ================================================ [fusions_detection] blockinfo = block information (*.csv) ancestor = ancestor file #The number of genes spanned by a synteny block on both sides of a breakpoint. min_genes_per_side = 5 density = 0.3 filtered_blockinfo = result blockinfo (.csv) ================================================ FILE: wgdi/example/karyotype.conf ================================================ [karyotype] ancestor = ancestor chromosome file width = 0.5 figsize = 10,6.18 savefig = save image(.png, .pdf, .svg) ================================================ FILE: wgdi/example/karyotype_mapping.conf ================================================ [karyotype_mapping] blast = blast file blast_reverse = false gff1 = gff1 file gff2 = gff2 file score = 100 evalue = 1e-5 repeat_number = 5 ancestor_left = ancestor location file (Only one of ('left', 'top') can be reserved) ancestor_top = ancestor location file the_other_lens = the other lens file blockinfo = block information (*.csv) blockinfo_reverse = false limit_length = 5 the_other_ancestor_file = result file ================================================ FILE: wgdi/example/ks.conf ================================================ [ks] cds_file = cds file #cat all cds files together pep_file = pep file #cat all pep files together align_software = muscle pairs_file = gene pairs file ks_file = ks result ================================================ FILE: wgdi/example/ks_fit_result.csv ================================================ ,color,linewidth,linestyle,,,,,, csa_csa,red,2,-,2.532090116,1.510453744,0.229652282,1.638111687,2.048906176,0.345639862 vvi_vvi,blue,2,-,3.00367275,1.288717936,0.177816426,,, vvi_oin_gamma,orange,2,-,1.910418336,1.328469514,0.262257112,,, vvi_oin,orange,2,--,4.948194212,0.882608858,0.10426873,,, vvi_csa,green,2,--,2.470770292464022,1.4131842495219498,0.21391959288821544,,, ================================================ FILE: wgdi/example/ksfigure.conf ================================================ [ksfigure] ksfit = ksfit result(*.csv) labelfontsize = 15 legendfontsize = 15 xlabel = none ylabel = none title = none area = 0,2 figsize = 10,6.18 shadow = true (true/false) savefig = save image(.png, .pdf, .svg) ================================================ FILE: wgdi/example/kspeaks.conf ================================================ [kspeaks] blockinfo = block information (*.csv) pvalue = 0.2 tandem = true block_length = int number ks_area = 0,10 multiple = 1 homo = 0,1 fontsize = 9 area = 0,3 figsize = 10,6.18 savefig = saving image(.png,.pdf) savefile = ks medain savefile ================================================ FILE: wgdi/example/peaksfit.conf ================================================ [peaksfit] blockinfo = block information (*.csv) mode = median bins_number = 200 ks_area = 0,10 fontsize = 9 area = 0,3 figsize = 10,6.18 shadow = true savefig = saving image(.png,.pdf,.svg) ================================================ FILE: wgdi/example/pindex.conf ================================================ [pindex] alignment = alignment file (.csv) gff = gff file lens =lens file gap = 50 retention = 0.05 diff = 0.05 remove_delta = (true/false) savefile = result file(.csv) ================================================ FILE: wgdi/example/polyploidy_classification.conf ================================================ [polyploidy classification] blockinfo = block information (*.csv) ancestor_left = ancestor file ancestor_top = ancestor file classid = class1,class2 same_protochromosome = False same_subgenome = False savefile = result file(.csv) ================================================ FILE: wgdi/example/retain.conf ================================================ [retain] alignment = alignment file gff = gff file lens = lens file colors = red,blue,green refgenome = shorthand figsize = 10,12 step = 50 ylabel = y label savefile = retain file (result) savefig = result(.png, .pdf, .svg) ================================================ FILE: wgdi/example/shared_fusion.conf ================================================ [shared_fusion] blockinfo = block information (*.csv) # The new lens file is the output filtered by lens file. lens1 = lens file, new lens file lens2 = lens file, new lens file ancestor_left = ancestor file ancestor_top = ancestor file classid = class1,class2 limit_length = 5 filtered_blockinfo = result blockinfo (.csv) ================================================ FILE: wgdi/fusion_positions_database.py ================================================ import pandas as pd import os from Bio import SeqIO class fusion_positions_database: def __init__(self, options): for k, v in options: setattr(self, k, v) print(f'{k} = {v}') def run(self): # Load and remove duplicates from data gff = pd.read_csv(self.gff, sep="\t", header=None, dtype={0: str, 5: int}).drop_duplicates() pep = SeqIO.to_dict(SeqIO.parse(self.pep, "fasta")) df = pd.read_csv(self.fusion_positions, sep="\t", header=None, dtype={0: str, 1: int, 2:int, 3:str}).drop_duplicates() # Load ancestral sequence file if it exists seqs = SeqIO.to_dict(SeqIO.parse(self.ancestor_pep, "fasta")) if os.path.exists(self.ancestor_pep) else {} sf_gff, sf_lens = [], [] # Process fusion positions for _, row in df.iterrows(): newchr = row[3] newgff = gff[(gff[0] == row[0]) & (gff[5] >= row[1] - row[2]) & (gff[5] < row[1] + row[2])].copy() newgff['id'] = [f"{newchr}s{str(row[0]).zfill(2)}g{str(i).zfill(3)}" for i in range(1, len(newgff) + 1)] sf_position = row[1] - newgff.iloc[0, 5] sf_lens.append([newchr, sf_position, len(newgff)]) # For each gene in the filtered GFF region for _, gff_row in newgff.iterrows(): if gff_row[1] in pep and gff_row['id'] not in seqs: gene = pep[gff_row[1]][:] gene.id, gene.description = gff_row['id'], '' seqs[gff_row['id']] = gene # Collect data for the final GFF output sf_gff.append([gff_row['id'], newchr, sf_position, gff_row[2], gff_row[3], gff_row[4], gff_row[1]]) # Write sequences to FASTA file SeqIO.write(seqs.values(), self.ancestor_pep, 'fasta') # Save filtered GFF data if sf_gff: sf_gff = pd.DataFrame(sf_gff) sf_gff.rename(columns={3: 'start', 4: 'end', 5: 'strand'}, inplace=True) sf_gff['order'] = sf_gff[0].str[-3:].astype(int) sf_gff[[1, 0, 'start', 'end', 'strand', 'order', 6]].to_csv(self.ancestor_gff, sep="\t", mode='a', index=False, header=None) sf_lens = pd.DataFrame(sf_lens).drop_duplicates() sf_lens.to_csv(self.ancestor_lens, sep="\t", mode='a', index=False, header=None) # Generate ancestral sequence data ancestor = [] for _, row in sf_lens.iterrows(): ancestor.append([row[0], 1, row[1], 'red', 1]) ancestor.append([row[0], row[1] + 1, row[2], 'blue', 1]) pd.DataFrame(ancestor).to_csv(self.ancestor_file, sep="\t", mode='a', index=False, header=None) # Remove duplicates from the output files for file in [self.ancestor_gff, self.ancestor_lens, self.ancestor_file]: df = pd.read_csv(file, header=None).drop_duplicates().to_csv(file, index=False, header=None) ================================================ FILE: wgdi/fusions_detection.py ================================================ import pandas as pd from tabulate import tabulate class fusions_detection: def __init__(self, options): self.min_genes_per_side = 5 self.density = 0.3 for k, v in options: setattr(self, k, v) print(f"{k} = {v}") self.min_genes_per_side = int(self.min_genes_per_side) self.density = float(self.density) def run(self): # Load the ancestor file and process the positions ancestor = pd.read_csv(self.ancestor, sep='\t', header=None) position = ancestor.groupby(0)[2].unique().apply(pd.Series) bkinfo = pd.read_csv(self.blockinfo) newbkinfo = bkinfo.head(0) # Iterate over each row in the position dataframe for index, row in position.iterrows(): # Filter the bkinfo dataframe based on chr2 and density filtered_group = bkinfo[(bkinfo['chr2'] == index) & (bkinfo['density2'] >= self.density)].copy() # Split the block2 column and stack the resulting series df = filtered_group['block2'].str.split('_', expand=True).stack().astype(int) # Count the number of genes greater and less than the current position filtered_group['greater'] = (df > row[0]).groupby(level=0).sum() filtered_group['less'] = (df < row[0]).groupby(level=0).sum() # Filter the group based on the minimum number of genes per side filtered_group = filtered_group[(filtered_group['greater'] >= self.min_genes_per_side) & (filtered_group['less'] >= self.min_genes_per_side)] # Concatenate the filtered group with the newbkinfo dataframe newbkinfo = pd.concat([newbkinfo, filtered_group]) if len(newbkinfo) ==0: print("\nNo shared fusion breakpoints detected") exit(0) # Get and print the shared fusion positions newbkinfo.to_csv(self.filtered_blockinfo, header=True, index=False) non_overlap_counts = newbkinfo.groupby('chr2').apply(self.count_non_overlapping) data = [(chr2, count) for chr2, count in non_overlap_counts.items()] print("\nThe following are the shared fusion breakpoints and counts:") print(tabulate(data, headers=["Fusion Breakpoint", "Count"], tablefmt="github")) def count_non_overlapping(self, group): if len(group) == 1: return 1 grouped = group.groupby('chr1') total_count = 0 for chr1, chr_group in grouped: chr_group = chr_group.sort_values(by='start1').reset_index(drop=True) count = 0 current_end = -1 for _, row in chr_group.iterrows(): start1, end1 = row['start1'], row['end1'] if start1 > current_end: count += 1 current_end = end1 total_count += count return total_count ================================================ FILE: wgdi/karyotype.py ================================================ import matplotlib.pyplot as plt import pandas as pd import wgdi.base as base class karyotype(): def __init__(self, options): self.width = 0.5 for k, v in options: setattr(self, str(k), v) print(str(k), ' = ', v) if hasattr(self, 'figsize'): self.figsize = [float(k) for k in self.figsize.split(',')] else: self.figsize = 10, 6.18 if hasattr(self, 'width'): self.width = float(self.width) else: self.width = 0.5 def run(self): fig, ax = plt.subplots(figsize=self.figsize) ancestor_lens = pd.read_csv( self.ancestor, sep="\t", header=None) ancestor_lens[0] = ancestor_lens[0].astype(str) ancestor_lens[3] = ancestor_lens[3].astype(str) ancestor_lens[4] = ancestor_lens[4].astype(int) ancestor_lens[4] = ancestor_lens[4] / ancestor_lens[4].max() chrs = ancestor_lens[0].drop_duplicates().to_list() ax.bar(chrs, 10, color='white', alpha=0) for index, row in ancestor_lens.iterrows(): base.Rectangle(ax, [chrs.index(row[0])-self.width*0.5, row[1]], row[2]-row[1], self.width, row[3], row[4]) ax.tick_params(labelsize=15) ax.spines['top'].set_visible(False) ax.spines['right'].set_visible(False) ax.spines['left'].set_visible(False) ax.spines['bottom'].set_visible(False) ax.set_xticks([]) ax.set_yticks([]) plt.savefig(self.savefig, dpi=500) plt.show() ================================================ FILE: wgdi/karyotype_mapping.py ================================================ import numpy as np import pandas as pd import wgdi.base as base class karyotype_mapping: def __init__(self, options): # Initialize default attributes self.blast_reverse = False self.blockinfo_reverse = False self.position = 'order' self.block_length = 5 self.limit_length = 5 self.repeat_number = 20 self.score = 100 self.evalue = 1e-5 # Update attributes with provided keyword arguments and print them for k, v in options: setattr(self, k, v) print(f"{k} = {v}") self.blast_reverse = base.str_to_bool(self.blast_reverse) self.blockinfo_reverse = base.str_to_bool(self.blockinfo_reverse) self.limit_length = int(self.limit_length) def karyotype_left(self, pairs, ancestor, gff1, gff2): # Loop through each row in ancestor to set color and classification in gff1 for _, row in ancestor.iterrows(): loc_min, loc_max = sorted([row[1], row[2]]) index1 = gff1[(gff1['chr'] == row[0]) & (gff1['order'] >= loc_min) & (gff1['order'] <= loc_max)].index gff1.loc[index1, ['color', 'classification']] = row[3], row[4] # Merge pairs with gff1 and update gff2 with color and classification data = pd.merge(pairs, gff1, left_on=0, right_index=True, how='left') data.drop_duplicates(subset=[1], inplace=True) data.set_index(1, inplace=True) gff2.loc[data.index, ['color', 'classification']] = data[['color', 'classification']] return gff2 def karyotype_top(self, pairs, ancestor, gff1, gff2): # Loop through each row in ancestor to set color and classification in gff2 for _, row in ancestor.iterrows(): loc_min, loc_max = sorted([row[1], row[2]]) index1 = gff2[(gff2['chr'] == row[0]) & (gff2['order'] >= loc_min) & (gff2['order'] <= loc_max)].index gff2.loc[index1, ['color', 'classification']] = row[3], row[4] # Merge pairs with gff2 and update gff1 with color and classification data = pd.merge(pairs, gff2, left_on=1, right_index=True, how='left') data.drop_duplicates(subset=[0], inplace=True) data.set_index(0, inplace=True) gff1.loc[data.index, ['color', 'classification']] = data[['color', 'classification']] return gff1 def karyotype_map(self, gff, lens): # Filter gff based on lens index and non-null color gff = gff[gff['chr'].isin(lens.index) & gff['color'].notnull()] ancestor = [] # Group by chromosome and process each group to create ancestor records for chr, group in gff.groupby('chr'): color, class_id, arr = '', 1, [] for _, row in group.iterrows(): if color == row['color'] and class_id == row['classification']: arr.append(row['order']) else: if len(arr) >= self.limit_length: ancestor.append([chr, min(arr), max(arr), color, class_id, len(arr)]) color, class_id = row['color'], row['classification'] arr = [] if len(ancestor) >= 1 and color == ancestor[-1][3] and class_id == ancestor[-1][4] and chr == ancestor[-1][0]: arr.append(ancestor[-1][1]) arr += np.random.randint(ancestor[-1][1], ancestor[-1][2], size=ancestor[-1][5]-1).tolist() ancestor.pop() arr.append(row['order']) if len(arr) >= self.limit_length: ancestor.append([chr, min(arr), max(arr), color, class_id, len(arr)]) ancestor = pd.DataFrame(ancestor) # Adjust min and max positions for each chromosome group for chr, group in ancestor.groupby(0): ancestor.loc[group.index[0], 1] = 1 ancestor.loc[group.index[-1], 2] = lens[chr] ancestor[4] = ancestor[4].astype(int) return ancestor[[0, 1, 2, 3, 4, 5]] def colinear_gene_pairs(self, bkinfo, gff1, gff2): gff1 = gff1.reset_index() gff2 = gff2.reset_index() gff1_indexed = gff1.set_index(['chr', 'order']) gff2_indexed = gff2.set_index(['chr', 'order']) data = [] for _, row in bkinfo.iterrows(): b1 = list(map(int, row['block1'].split('_'))) b2 = list(map(int, row['block2'].split('_'))) for order1, order2 in zip(b1, b2): a = gff1_indexed.loc[(row['chr1'], order1), 1] b = gff2_indexed.loc[(row['chr2'], order2), 1] data.append([a, b]) return pd.DataFrame(data) def new_ancestor(self, ancestor, gff1, gff2, blast): # Iterate through ancestor rows to adjust positions based on neighboring rows for i in range(1, len(ancestor)): if ancestor.iloc[i, 0] == ancestor.iloc[i-1, 0]: area = ancestor.iloc[i, 1] - ancestor.iloc[i-1, 2] if area <= 5: ancestor.iloc[i-1, 2] = ancestor.iloc[i, 1] - 1 else: index1 = gff1[(gff1['chr'] == ancestor.iloc[i, 0]) & (gff1['order'] >= ancestor.iloc[i-1, 2]+1) & (gff1['order'] <= ancestor.iloc[i, 1]-1)].index index2 = gff2[gff2['color'] == ancestor.iloc[i-1, 3]].index index3 = gff2[gff2['color'] == ancestor.iloc[i, 3]].index newblast1 = blast[(blast[0].isin(index1)) & (blast[1].isin(index2))] newblast2 = blast[(blast[0].isin(index1)) & (blast[1].isin(index3))] if len(newblast1) >= len(newblast2): ancestor.iloc[i-1, 2] = ancestor.iloc[i, 1] - 1 else: ancestor.iloc[i, 1] = ancestor.iloc[i-1, 2] + 1 for chr, group in ancestor.groupby(0): if len(group) == 1: continue newgff1 = gff1[gff1['chr'] == chr] for i in range(1, len(group)): if group.iloc[i, 5] > 200: continue index_left = newgff1[(newgff1['order'] >= group.iloc[i, 1]) & (newgff1['order'] <= group.iloc[i, 2])].index blast_left = blast[blast[0].isin(index_left)] index_prev = gff2[gff2['color'] == group.iloc[i-1, 3]].index blast_prev = blast_left[blast_left[1].isin(index_prev)] index_curr = gff2[gff2['color'] == group.iloc[i, 3]].index blast_curr = blast_left[blast_left[1].isin(index_curr)] if len(blast_curr) <= len(blast_prev): ancestor.loc[group.index[i],3] = ancestor.loc[group.index[i]-1,3] if i < len(group)-1: index_next = gff2[gff2['color'] == group.iloc[i+1, 3]].index blast_next = blast_left[blast_left[1].isin(index_next)] if len(blast_next) > max(len(blast_prev),len(blast_curr)): ancestor.loc[group.index[i],3] = ancestor.loc[group.index[i]+1,3] ancestor['group'] = (ancestor[0].shift(1) != ancestor[0]) | (ancestor[3].shift(1) != ancestor[3]) | (ancestor[4].shift(1) != ancestor[4]) ancestor['group'] = ancestor['group'].cumsum() result = ancestor.groupby('group').agg({ 0: 'first', 1: 'min', 2: 'max', 3: 'first', 4: 'first', }).reset_index(drop=True) return result def run(self): # Read and process block information bkinfo = pd.read_csv(self.blockinfo, index_col='id') bkinfo['chr1'] = bkinfo['chr1'].astype(str) bkinfo['chr2'] = bkinfo['chr2'].astype(str) if self.blockinfo_reverse == True: bkinfo[['chr1', 'chr2']] = bkinfo[['chr2', 'chr1']] bkinfo[['block1', 'block2']] = bkinfo[['block2', 'block1']] bkinfo = bkinfo[bkinfo['length'] > int(self.block_length)] # Read GFF and lens data gff1 = base.newgff(self.gff1) gff2 = base.newgff(self.gff2) lens = base.newlens(self.the_other_lens, self.position) blast = base.newblast(self.blast, int(self.score), float(self.evalue), gff1, gff2, self.blast_reverse) # blast.drop_duplicates(subset=[0], keep='first', inplace=True) # Find colinear gene pairs pairs = self.colinear_gene_pairs(bkinfo, gff1, gff2) # Depending on available attributes, call either karyotype_top or karyotype_left if hasattr(self, 'ancestor_top'): ancestor = base.read_classification(self.ancestor_top) data = self.karyotype_top(pairs, ancestor, gff1, gff2) elif hasattr(self, 'ancestor_left'): ancestor = base.read_classification(self.ancestor_left) data = self.karyotype_left(pairs, ancestor, gff1, gff2) gff1, gff2 = gff2, gff1 blast.iloc[:, :2] = blast.iloc[:, [1, 0]].to_numpy() else: print('Missing ancestor file.') exit(0) # Map the data and create the final ancestor file the_other_ancestor_file = self.karyotype_map(data, lens) the_other_ancestor_file = self.new_ancestor(the_other_ancestor_file, gff1, gff2, blast) the_other_ancestor_file.to_csv(self.the_other_ancestor_file, sep='\t', header=False, index=False) ================================================ FILE: wgdi/ks.py ================================================ import os import sys import numpy as np import pandas as pd from Bio import SeqIO import subprocess from Bio.Phylo.PAML import yn00 import wgdi.base as base class ks: def __init__(self, options): base_conf = base.config() self.pair_pep_file = 'pair.pep' self.pair_cds_file = 'pair.cds' self.prot_align_file = 'prot.aln' self.mrtrans = 'pair.mrtrans' self.pair_yn = 'pair.yn' for k, v in base_conf: setattr(self, str(k), v) for k, v in options: setattr(self, str(k), v) print(f'{str(k)} = {v}') def auto_file(self): pairs = [] with open(self.pairs_file) as f: p = ' '.join(f.readlines()[:30]) # Detect file format and process accordingly if 'path length' in p or 'MAXIMUM GAP' in p: collinearity = base.read_colinearscan(self.pairs_file) pairs = [[v[0], v[2]] for k in collinearity for v in k[1]] elif 'MATCH_SIZE' in p or '## Alignment' in p: collinearity = base.read_mcscanx(self.pairs_file) pairs = [[v[0], v[2]] for k in collinearity for v in k[1]] elif '# Alignment' in p: collinearity = base.read_collinearity(self.pairs_file) pairs = [[v[0], v[2]] for k in collinearity for v in k[1]] elif '###' in p: collinearity = base.read_jcvi(self.pairs_file) pairs = [[v[0], v[2]] for k in collinearity for v in k[1]] elif ',' in p: collinearity = pd.read_csv(self.pairs_file, header=None) pairs = collinearity.values.tolist() else: collinearity = pd.read_csv(self.pairs_file, header=None, sep='\t') pairs = collinearity.values.tolist() df = pd.DataFrame(pairs).drop_duplicates() df[0] = df[0].astype(str) df[1] = df[1].astype(str) df.index = df[0] + ',' + df[1] return df def run(self): # Load sequence data cds = SeqIO.to_dict(SeqIO.parse(self.cds_file, "fasta")) pep = SeqIO.to_dict(SeqIO.parse(self.pep_file, "fasta")) df_pairs = self.auto_file() # Check if ks file exists and load it, otherwise create a new one if os.path.exists(self.ks_file): ks = pd.read_csv(self.ks_file, sep='\t').drop_duplicates() kscopy = ks.copy() names = ks.columns.tolist() names[0], names[1] = names[1], names[0] kscopy.columns = names ks = pd.concat([ks, kscopy]) ks['id'] = ks['id1'] + ',' + ks['id2'] df_pairs.drop(np.intersect1d(df_pairs.index, ks['id'].to_numpy()), inplace=True) ks_file = open(self.ks_file, 'a+') else: ks_file = open(self.ks_file, 'w') ks_file.write('\t'.join(['id1', 'id2', 'ka_NG86', 'ks_NG86', 'ka_YN00', 'ks_YN00']) + '\n') # Filter valid pairs based on sequence data df_pairs = df_pairs[ (df_pairs[0].isin(cds.keys())) & (df_pairs[1].isin(cds.keys())) & (df_pairs[0].isin(pep.keys())) & (df_pairs[1].isin(pep.keys())) ] pairs = df_pairs[[0, 1]].to_numpy() if len(pairs) > 0 and pairs[0][0][:3] == pairs[0][1][:3]: allpairs = [] pair_hash = {} for k in pairs: if k[0] + ',' + k[1] in pair_hash or k[1] + ',' + k[0] in pair_hash: continue else: pair_hash[k[0] + ',' + k[1]] = 1 pair_hash[k[1] + ',' + k[0]] = 1 allpairs.append(k) pairs = allpairs for k in pairs: cds_gene1, cds_gene2 = cds[k[0]], cds[k[1]] cds_gene1.id, cds_gene2.id = 'gene1', 'gene2' pep_gene1, pep_gene2 = pep[k[0]], pep[k[1]] pep_gene1.id, pep_gene2.id = 'gene1', 'gene2' # Write sequences to files SeqIO.write([cds[k[0]], cds[k[1]]], self.pair_cds_file, "fasta") SeqIO.write([pep[k[0]], pep[k[1]]], self.pair_pep_file, "fasta") # Compute Ka/Ks values kaks = self.pair_kaks(['gene1', 'gene2']) if kaks is None: continue ks_file.write('\t'.join([str(i) for i in list(k) + list(kaks)]) + '\n') ks_file.close() # Clean up temporary files for file in [ self.pair_pep_file, self.pair_cds_file, self.mrtrans, self.pair_yn, self.prot_align_file, '2YN.dN', '2YN.dS', '2YN.t', 'rst', 'rst1', 'yn00.ctl', 'rub' ]: try: os.remove(file) except OSError: pass def pair_kaks(self, k): self.align() pal = self.pal2nal() if not pal: return [] kaks = self.run_yn00() if kaks is None: return [] kaks_new = [ kaks[k[0]][k[1]]['NG86']['dN'], kaks[k[0]][k[1]]['NG86']['dS'], kaks[k[0]][k[1]]['YN00']['dN'], kaks[k[0]][k[1]]['YN00']['dS'] ] return kaks_new def align(self): if self.align_software == 'mafft': try: command = [self.mafft_path, '--quiet', self.pair_pep_file, '>', self.prot_align_file] subprocess.run(" ".join(command), shell=True, check=True) except subprocess.CalledProcessError as e: print(f"Error while running MAFFT: {e}") elif self.align_software == 'muscle': try: command = [self.muscle_path, '-align', self.pair_pep_file, '-output', self.prot_align_file, '-quiet'] subprocess.run(" ".join(command), shell=True, check=True) except subprocess.CalledProcessError as e: print(f"Error while running Muscle: {e}") def pal2nal(self): args = ['perl', self.pal2nal_path, self.prot_align_file, self.pair_cds_file, '-output paml -nogap', '>' + self.mrtrans] command = ' '.join(args) try: os.system(command) except: return False return True def run_yn00(self): yn = yn00.Yn00() yn.alignment = self.mrtrans yn.out_file = self.pair_yn yn.set_options(icode=0, commonf3x4=0, weighting=0, verbose=1) try: run_result = yn.run(command=self.yn00_path) except: run_result = None return run_result ================================================ FILE: wgdi/ks_peaks.py ================================================ import matplotlib.pyplot as plt import numpy as np import pandas as pd from scipy.stats.kde import gaussian_kde import wgdi.base as base class kspeaks: def __init__(self, options): # Default values self.tandem_length = 200 self.figsize = 10, 6.18 self.fontsize = 9 self.block_length = 3 self.area = 0, 3 self.tandem = True # Set options passed in for k, v in options: setattr(self, str(k), v) print(f'{str(k)} = {v}') # Convert string values to lists of floats self.homo = [float(k) for k in self.homo.split(',')] self.ks_area = [float(k) for k in self.ks_area.split(',')] self.figsize = [float(k) for k in self.figsize.split(',')] self.area = [float(k) for k in self.area.split(',')] self.pvalue = float(self.pvalue) self.block_length = int(self.block_length) self.tandem = base.str_to_bool(self.tandem) def remove_tandem(self, bkinfo): """ Remove tandem duplications based on start and end position differences. """ group = bkinfo[bkinfo['chr1'] == bkinfo['chr2']].copy() group.loc[:, 'start'] = group.loc[:, 'start1'] - group.loc[:, 'start2'] group.loc[:, 'end'] = group.loc[:, 'end1'] - group.loc[:, 'end2'] # Drop rows where start or end difference is within tandem length index = group[(group['start'].abs() <= self.tandem_length) | (group['end'].abs() <= self.tandem_length)].index bkinfo = bkinfo.drop(index) return bkinfo def ks_kde(self, df): """ Perform kernel density estimation (KDE) on Ks data. """ # Clean up 'ks' column by removing leading underscores df.loc[df['ks'].str.startswith('_'), 'ks'] = df.loc[df['ks'].str.startswith('_'), 'ks'].str[1:] ks = df['ks'].str.split('_') arr = [] ks_ave = [] # Collect individual Ks values and calculate average Ks per row for v in ks.values: v = [float(k) for k in v if float(k) >= 0] if len(v) == 0: continue arr.extend(v) ks_ave.append(sum(v) / len(v)) # Mean of each row's Ks values # KDE for three distributions: median, average, total kdemedian = gaussian_kde(df['ks_median'].values) kdemedian.set_bandwidth(bw_method=kdemedian.factor / 3.) kdeaverage = gaussian_kde(ks_ave) kdeaverage.set_bandwidth(bw_method=kdeaverage.factor / 3.) kdetotal = gaussian_kde(arr) kdetotal.set_bandwidth(bw_method=kdetotal.factor / 3.) return [kdemedian, kdeaverage, kdetotal] def run(self): """ Main method to process the data, perform KDE, and generate the plot. """ plt.rcParams['ytick.major.pad'] = 0 fig, ax = plt.subplots(figsize=self.figsize) # Read the block info file bkinfo = pd.read_csv(self.blockinfo) bkinfo['chr1'] = bkinfo['chr1'].astype(str) bkinfo['chr2'] = bkinfo['chr2'].astype(str) bkinfo['length'] = bkinfo['length'].astype(int) # Filter based on block length and p-value bkinfo = bkinfo[(bkinfo['length'] > self.block_length) & (bkinfo['pvalue'] < self.pvalue)] # Remove tandem duplications if needed if self.tandem == False: bkinfo = self.remove_tandem(bkinfo) # Further filtering based on homozygous range and Ks area bkinfo = bkinfo[bkinfo[f'homo{self.multiple}'] >= self.homo[0]] bkinfo = bkinfo[bkinfo[f'homo{self.multiple}'] <= self.homo[1]] bkinfo = bkinfo[bkinfo['ks_median'] >= self.ks_area[0]] bkinfo = bkinfo[bkinfo['ks_median'] <= self.ks_area[1]] # Perform KDE on the Ks data kdemedian, kdeaverage, kdetotal = self.ks_kde(bkinfo) # Define the range for the x-axis (Ks values) dist_space = np.linspace(self.area[0], self.area[1], 500) # Plot the KDE results ax.plot(dist_space, kdemedian(dist_space), color='red', label='block median') ax.plot(dist_space, kdeaverage(dist_space), color='black', label='block average') ax.plot(dist_space, kdetotal(dist_space), color='blue', label='all pairs') # Set plot labels, grid, and limits ax.grid() ax.set_xlabel(r'${K_{s}}$', fontsize=20) ax.set_ylabel('Frequency', fontsize=20) ax.tick_params(labelsize=18) ax.set_xlim(self.area) ax.legend(fontsize=20) # Adjust layout for better display plt.subplots_adjust(left=0.09, right=0.96, top=0.93, bottom=0.12) # Save the figure plt.savefig(self.savefig, dpi=500) plt.show() # Save the filtered data to CSV bkinfo.to_csv(self.savefile, index=False) ================================================ FILE: wgdi/ksfigure.py ================================================ import re import sys import matplotlib.pyplot as plt import numpy as np import pandas as pd import wgdi.base as base from scipy import stats class ksfigure(): def __init__(self, options): self.figsize = 10, 6.18 self.legendfontsize = 30 self.labelfontsize = 9 self.area = 0, 3 self.shadow = True self.mode = 'median' for k, v in options: setattr(self, str(k), v) print(str(k), ' = ', v) if self.xlabel == 'none' or self.xlabel == '': self.xlabel = r'Synonymous nucleotide subsititution (${K_{s}}$)' if self.ylabel == 'none' or self.ylabel == '': self.ylabel = 'kernel density of syntenic blocks' if self.title == 'none' or self.title == '': self.title = '' self.figsize = [float(k) for k in self.figsize.split(',')] self.area = [float(k) for k in self.area.split(',')] self.shadow = base.str_to_bool(self.shadow) def Gaussian_distribution(self, t, k): y = np.zeros(len(t)) for i in range(0, int((len(k) - 1) / 3)+1): if np.isnan(k[3 * i + 2]): continue k[3 * i + 2] = float(k[3 * i + 2])/np.sqrt(2) k[3 * i + 0] = float(k[3 * i + 0]) * \ np.sqrt(2*np.pi)*float(k[3 * i + 2]) y1 = stats.norm.pdf( t, float(k[3 * i + 1]), float(k[3 * i + 2])) * float(k[3 * i + 0]) y = y+y1 return y def run(self): plt.rcParams['ytick.major.pad'] = 0 fig, ax = plt.subplots(figsize=self.figsize) ksfit = pd.read_csv(self.ksfit, index_col=0) t = np.arange(self.area[0], self.area[1], 0.0005) col = [k for k in ksfit.columns if re.match('Unnamed:', k)] for index, row in ksfit.iterrows(): ax.plot(t, self.Gaussian_distribution( t, row[col].values), linestyle=row['linestyle'], color=row['color'],alpha=0.8, label=index, linewidth=row['linewidth']) if self.shadow == True: ax.fill_between(t, 0, self.Gaussian_distribution(t, row[col].values), color=row['color'], alpha=0.15, interpolate=True, edgecolor=None, label=index,) align = dict(family='Arial', verticalalignment="center", horizontalalignment="center") ax.set_xlabel(self.xlabel, fontsize=self.labelfontsize, labelpad=20, **align) ax.set_ylabel(self.ylabel, fontsize=self.labelfontsize, labelpad=20, **align) ax.set_title(self.title, weight='bold', fontsize=self.labelfontsize, **align) plt.tick_params(labelsize=10) handles,labels = ax.get_legend_handles_labels() df = pd.DataFrame({ 'handles': handles, 'labels': labels}) df.drop_duplicates(subset='labels', keep='first', inplace=True) handles, labels = df['handles'].tolist(), df['labels'].tolist() if self.shadow == True: plt.legend(handles=handles,labels=labels,loc='upper right', prop={ 'family': 'Arial', 'style': 'italic', 'size': self.legendfontsize}) else: plt.legend(handles=handles,labels=labels,loc='upper right', prop={ 'family': 'Arial', 'style': 'italic', 'size': self.legendfontsize}) plt.gca().spines['top'].set_visible(False) plt.gca().spines['right'].set_visible(False) plt.savefig(self.savefig, dpi=500) plt.show() sys.exit(0) ================================================ FILE: wgdi/peaksfit.py ================================================ import re import sys import matplotlib.pyplot as plt import numpy as np import pandas as pd from scipy.optimize import curve_fit from scipy.stats import gaussian_kde, linregress import wgdi.base as base class peaksfit(): def __init__(self, options): self.figsize = 10, 6.18 self.fontsize = 9 self.area = 0, 3 self.mode = 'median' self.histogram_only = False for k, v in options: setattr(self, str(k), v) print(str(k), ' = ', v) self.figsize = [float(k) for k in self.figsize.split(',')] self.area = [float(k) for k in self.area.split(',')] self.bins_number = int(self.bins_number) self.peaks = 1 self.histogram_only = base.str_to_bool(self.histogram_only) def ks_values(self, df): df.loc[df['ks'].str.startswith('_'),'ks']= df.loc[df['ks'].str.startswith('_'),'ks'].str[1:] ks = df['ks'].str.split('_') ks_total = [] ks_average = [] for v in ks.values: ks_total.extend([float(k) for k in v]) ks_average = df['ks_average'].values ks_median = df['ks_median'].values return [ks_median, ks_average, ks_total] def gaussian_fuc(self, x, *params): y = np.zeros_like(x) for i in range(0, len(params), 3): amp = float(params[i]) ctr = float(params[i+1]) wid = float(params[i+2]) y = y + amp * np.exp(-((x - ctr)/wid)**2) return y def kde_fit(self, data, x): kde = gaussian_kde(data) kde.set_bandwidth(bw_method=kde.factor/3.) p = kde(x) guess = [1,1, 1]*self.peaks popt, pcov = curve_fit(self.gaussian_fuc, x, p, guess, maxfev = 80000) popt = [abs(k) for k in popt] data = [] y = self.gaussian_fuc(x, *popt) for i in range(0, len(popt), 3): array = [popt[i], popt[i+1], popt[i+2]] data.append(self.gaussian_fuc(x, *array)) slope, intercept, r_value, p_value, std_err = linregress(p, y) print("\nR-square: "+str(r_value**2)) print("The gaussian fitting curve parameters are :") print(' | '.join([str(k) for k in popt])) return y, data def run(self): plt.rcParams['ytick.major.pad'] = 0 fig, ax = plt.subplots(figsize=self.figsize) bkinfo = pd.read_csv(self.blockinfo) ks_median, ks_average, ks_total = self.ks_values(bkinfo) data = eval('ks_'+self.mode) data = [k for k in data if self.area[0] <= k <= self.area[1]] x = np.linspace(self.area[0], self.area[1], self.bins_number) n, bins, patches = ax.hist(data, int( self.bins_number), density=1, facecolor='blue', alpha=0.3, label='Histogram') if self.histogram_only == True: pass else: y, fit = self.kde_fit(data, x) ax.plot(x, y, color='black', linestyle='-', label='Gaussian fitting') ax.grid() align = dict(family='Arial', verticalalignment="center", horizontalalignment="center") ax.set_xlabel(r'${K_{s}}$', fontsize=20) ax.set_ylabel('Frequency', fontsize=20) ax.tick_params(labelsize=18) ax.legend(fontsize=20) ax.set_xlim(self.area) plt.subplots_adjust(left=0.09, right=0.96, top=0.93, bottom=0.12) plt.savefig(self.savefig, dpi=500) plt.show() sys.exit(0) ================================================ FILE: wgdi/pindex.py ================================================ import os import sys import numpy as np import pandas as pd import wgdi.base as base class pindex(): def __init__(self, options): self.remove_delta = True self.position = 'order' self.retention = 0.05 self.diff = 0.05 self.gap = 50 for k, v in options: setattr(self, str(k), v) print(k, ' = ', v) self.gap = int(self.gap) self.retention = float(self.retention) self.diff = float(self.diff) def Pindex(self, sub1, sub2): r1 = self.retain(sub1) r2 = self.retain(sub2) r = [] for i in range(len(r2)): if(r1[i] < self.retention or r2[i] < self.retention): r.append(0) continue d = (r1[i]-r2[i])/(r1[i]+r2[i])*0.5 if d > self.diff: r.append(1) elif -d > self.diff: r.append(-1) else: r.append(0) a, b, c = len([i for i in r if i == 1]), len( [i for i in r if i == -1]), len([i for i in r if i == 0]) return [a, -b, c, len(r)] def retain(self, arr): a = [] for i in range(0, len(arr), 2*self.gap): start, end = i-self.gap, i+self.gap genenum, retainnum = 0, 0 for j in range(start, end): if((j >= int(len(arr))) or (j < 0)): continue else: retainnum += arr[j] genenum += 1 a.append(float(retainnum/genenum)) return a def run(self): alignment = pd.read_csv(self.alignment, header=None, index_col=0) alignment.replace(r'\w+', 1, regex=True, inplace=True) alignment.replace('.', 0, inplace=True) alignment.fillna(0, inplace=True) gff = base.newgff(self.gff) lens = base.newlens(self.lens, self.position) gff = gff[gff['chr'].isin(lens.index)] alignment = alignment.join(gff[['chr', self.position]], how='left') alignment.dropna(axis=0, how='any', inplace=True) p = self.cal_pindex(alignment) print('Polyploidy-index: ', p) sys.exit(0) def cal_pindex(self, alignment): data, df = [], [] columns = alignment.columns[:-2].tolist() for i in range(len(columns)-1): for j in range(i+1, len(columns)): b = [] for chr, group in alignment.groupby('chr'): sub1 = group.loc[:, columns[i]].tolist() sub2 = group.loc[:, columns[j]].tolist() p = self.Pindex(sub1, sub2) b.append(p) df.append([i, j, chr]+p) sub_diver = sum([abs(k[0]+k[1]) for k in b]) if self.remove_delta == True: sub_total = sum([abs(k[1])+abs(k[0]) for k in b]) if sub_total == 0: c = 0 else: c = sub_diver/sub_total else: sub_total = sum([abs(k[1])+abs(k[0])+abs(k[2]) for k in b]) c = sub_diver/sub_total data.append(c) df = pd.DataFrame(df, columns=[ 'sub1', 'sub2', 'chr', 'sub1_high', 'sub2_high', 'No_diff', 'Total']) df['sub2_high'] = df['sub2_high'].abs() self.infomation(df) print('\nPolyploidy-index between subgenomes are ', data) return sum(data)/len(data) def turn_percentage(self, x): return '(%.2f%%)' % (x * 100) def infomation(self, df): data = [] for names, group in df.groupby(['sub1', 'sub2']): newgroup = pd.concat([group.head(1), group], axis=0, ignore_index=True) cols = ['sub1_high', 'sub2_high', 'No_diff', 'Total'] newgroup.loc[0, cols] = group.loc[:, cols].sum() group1 = newgroup.copy() group1[cols] = group1[cols].astype(str) newgroup['sub1_high'] = ( newgroup['sub1_high'] / newgroup['Total']).apply(self.turn_percentage) newgroup['sub2_high'] = ( newgroup['sub2_high'] / newgroup['Total']).apply(self.turn_percentage) newgroup['No_diff'] = ( newgroup['No_diff'] / newgroup['Total']).apply(self.turn_percentage) newgroup['Total'] = ( newgroup['Total'] / group['Total'].sum()).apply(self.turn_percentage) newgroup[cols] = group1[cols]+newgroup[cols] group_list = [] a = newgroup[['chr']+cols].columns.to_numpy() a[0] = 'Chromosome' a[1], a[2] = 'Sub_'+str(names[0]+1), 'Sub_'+str(names[1]+1) group_list.append(a) b = newgroup[['chr']+cols].to_numpy() b[0][0] = 'Total' for k in b: group_list.append(k) group_list = np.array(group_list).T for k in group_list: data.append(k) data = pd.DataFrame(data) data.to_csv(self.savefile, header=None, index=None) ================================================ FILE: wgdi/polyploidy_classification.py ================================================ import pandas as pd import wgdi.base as base class polyploidy_classification: def __init__(self, options): self.same_protochromosome = False self.same_subgenome = False for k, v in options: setattr(self, str(k), v) print(f"{k} = {v}") self.same_protochromosome = base.str_to_bool(self.same_protochromosome) self.same_subgenome = base.str_to_bool(self.same_subgenome) # Initialize classid with a default value if not provided self.classid = [str(k) for k in getattr(self, 'classid', 'class1,class2').split(',')] def run(self): # Read input files ancestor_left = base.read_classification(self.ancestor_left) ancestor_top = base.read_classification(self.ancestor_top) bkinfo = pd.read_csv(self.blockinfo) # Ensure chr1 and chr2 are treated as strings bkinfo['chr1'] = bkinfo['chr1'].astype(str) bkinfo['chr2'] = bkinfo['chr2'].astype(str) # Filter rows where chr1 and chr2 match ancestor values bkinfo = bkinfo[bkinfo['chr1'].isin(ancestor_left[0].values) & bkinfo['chr2'].isin(ancestor_top[0].values)] # Initialize additional columns bkinfo[self.classid[0]] = 0 bkinfo[self.classid[1]] = 0 bkinfo[self.classid[0] + '_color'] = '' bkinfo[self.classid[1] + '_color'] = '' bkinfo['diff'] = 0.0 # Processing the first classification (ancestor_left vs chr1) for name, group in bkinfo.groupby('chr1'): d1 = ancestor_left[ancestor_left[0] == name] for index1, row1 in group.iterrows(): a, b = sorted([row1['start1'], row1['end1']]) a, b = int(a), int(b) for index2, row2 in d1.iterrows(): c, d = sorted([row2[1], row2[2]]) h = len([k for k in range(a, b) if k in range(c, d)]) / (b - a) if h > bkinfo.loc[index1, 'diff']: bkinfo.loc[index1, 'diff'] = float(h) bkinfo.loc[index1, self.classid[0]] = row2[4] bkinfo.loc[index1, self.classid[0] + '_color'] = row2[3] # Reset 'diff' and process the second classification (ancestor_top vs chr2) bkinfo['diff'] = 0.0 for name, group in bkinfo.groupby('chr2'): d2 = ancestor_top[ancestor_top[0] == name] for index1, row1 in group.iterrows(): a, b = sorted([row1['start2'], row1['end2']]) a, b = int(a), int(b) for index2, row2 in d2.iterrows(): c, d = sorted([row2[1], row2[2]]) h = len([k for k in range(a, b) if k in range(c, d)]) / (b - a) if h > bkinfo.loc[index1, 'diff']: bkinfo.loc[index1, 'diff'] = float(h) bkinfo.loc[index1, self.classid[1]] = row2[4] bkinfo.loc[index1, self.classid[1] + '_color'] = row2[3] # Uncomment if you want to filter rows where both colors match if self.same_protochromosome == True: bkinfo = bkinfo[bkinfo[self.classid[1] + '_color'] == bkinfo[self.classid[0] + '_color']] if self.same_subgenome == True: bkinfo = bkinfo[bkinfo[self.classid[1]] == bkinfo[self.classid[0]]] # Save the result to a CSV file bkinfo.to_csv(self.savefile, index=False) ================================================ FILE: wgdi/retain.py ================================================ import matplotlib.pyplot as plt import pandas as pd import wgdi.base as base class retain: def __init__(self, options): self.position = 'order' # Initialize the options by setting attributes dynamically for k, v in options: setattr(self, str(k), v) print(f"{str(k)} = {v}") # Handle the ylim parameter, which defines the y-axis limits self.ylim = [float(k) for k in self.ylim.split(',')] if hasattr(self, 'ylim') else [0, 1] # Handle the colors and figsize parameters self.colors = [str(k) for k in self.colors.split(',')] self.figsize = [float(k) for k in self.figsize.split(',')] def run(self): # Load GFF and lens data gff = base.newgff(self.gff) lens = base.newlens(self.lens, self.position) # Filter GFF data based on lens chromosome index gff = gff[gff['chr'].isin(lens.index)] # Load alignment data and join with GFF alignment = pd.read_csv(self.alignment, header=None, index_col=0) alignment = alignment.join(gff[['chr', self.position]], how='left') # Perform alignment processing self.retain = self.align_chr(alignment) # Save the processed data to a file self.retain[self.retain.columns[:-2]].to_csv(self.savefile, sep='\t', header=None) # Create a figure for plotting fig, axs = plt.subplots(len(lens), 1, sharex=True, sharey=True, figsize=tuple(self.figsize)) fig.add_subplot(111, frameon=False) align = dict(family='DejaVu Sans', verticalalignment="center", horizontalalignment="center") # Hide all the spines and ticks on the plot for spine in plt.gca().spines.values(): spine.set_visible(False) plt.tick_params(top=False, bottom=False, left=False, right=False, labelleft=False, labelbottom=False) # Group the retain data by chromosome and plot each chromosome's data groups = self.retain.groupby('chr') for i, chr_name in enumerate(lens.index): group = groups.get_group(chr_name) if len(lens) == 1: for j, col in enumerate(self.retain.columns[:-2]): axs.plot(group['order'].values, group[col].values, linestyle='-', color=self.colors[j], linewidth=1) axs.spines['right'].set_visible(False) axs.spines['top'].set_visible(False) axs.set_ylim(self.ylim) axs.tick_params(labelsize=12) else: # Plot each column's data for the current chromosome for j, col in enumerate(self.retain.columns[:-2]): axs[i].plot(group['order'].values, group[col].values, linestyle='-', color=self.colors[j], linewidth=1) # Hide the right and top spines for each subplot axs[i].spines['right'].set_visible(False) axs[i].spines['top'].set_visible(False) axs[i].set_ylim(self.ylim) axs[i].tick_params(labelsize=12) for i, chr_name in enumerate(lens.index): if len(lens) == 1: x, y = axs.get_xlim()[1] * 0.90, axs.get_ylim()[1] * 0.8 axs.text(x, y, f"{self.refgenome} {chr_name}", fontsize=14, **align) else: # Add a label for the reference genome and chromosome x, y = axs[i].get_xlim()[1] * 0.90, axs[i].get_ylim()[1] * 0.8 axs[i].text(x, y, f"{self.refgenome} {chr_name}", fontsize=14, **align) # Adjust layout and save the figure as an image plt.ylabel(f"{self.ylabel}\n\n\n\n", fontsize=18, **align) plt.subplots_adjust(left=0.1, right=0.95, top=0.95, bottom=0.05) plt.savefig(self.savefig, dpi=500) plt.show() def align_chr(self, alignment): """ Perform the alignment processing for each chromosome by updating the values. """ for i in alignment.columns[:-2]: # Update values: set '1' for valid values, '0' for invalid, and fill NaN with 0 alignment.loc[alignment[i].str.contains(r'\w', na=False), i] = 1 alignment.loc[alignment[i] == '.', i] = 0 alignment.loc[alignment[i] == ' ', i] = 0 alignment[i] = alignment[i].astype('float64').fillna(0) # Apply the moving average function to each group by chromosome for chr_name, group in alignment.groupby(['chr']): a = self.moving_average(group[i].values.tolist()) alignment.loc[group.index, i] = a return alignment def moving_average(self, arr): """ Calculate a moving average over a specified window size. This function smooths the input array using a sliding window. """ a = [] for i in range(len(arr)): # Define the window range start, end = max(0, i - int(self.step)), min(len(arr), i + int(self.step)) ave = sum(arr[start:end]) / (end - start) a.append(ave) return a ================================================ FILE: wgdi/run.py ================================================ import argparse import os import shutil import sys import wgdi import wgdi.base as base from wgdi.align_dotplot import align_dotplot from wgdi.block_correspondence import block_correspondence from wgdi.block_info import block_info from wgdi.block_ks import block_ks from wgdi.circos import circos from wgdi.dotplot import dotplot from wgdi.karyotype import karyotype from wgdi.karyotype_mapping import karyotype_mapping from wgdi.ks import ks from wgdi.ks_peaks import kspeaks from wgdi.ksfigure import ksfigure from wgdi.peaksfit import peaksfit from wgdi.pindex import pindex from wgdi.polyploidy_classification import polyploidy_classification from wgdi.retain import retain from wgdi.run_colliearity import mycollinearity from wgdi.trees import trees from wgdi.ancestral_karyotype import ancestral_karyotype from wgdi.ancestral_karyotype_repertoire import ancestral_karyotype_repertoire from wgdi.shared_fusion import shared_fusion from wgdi.fusion_positions_database import fusion_positions_database from wgdi.fusions_detection import fusions_detection # Argument parser setup parser = argparse.ArgumentParser( prog='wgdi', usage='%(prog)s [options]', epilog="", formatter_class=argparse.RawDescriptionHelpFormatter ) parser.description = '''\ WGDI(Whole-Genome Duplication Integrated): A user-friendly toolkit for evolutionary analyses of whole-genome duplications and ancestral karyotypes. https://wgdi.readthedocs.io/en/latest/ -------------------------------------- ''' parser.add_argument("-v", "--version", action='version', version='0.75') parser.add_argument("-d", dest="dotplot", help="Show homologous gene dotplot") parser.add_argument("-icl", dest="improvedcollinearity", help="Improved version of ColinearScan ") parser.add_argument("-ks", dest="calks", help="Calculate Ka/Ks for homologous gene pairs by YN00") parser.add_argument("-bk", dest="blockks", help="Show Ks of blocks in a dotplot") parser.add_argument("-bi", dest="blockinfo", help="Collinearity and Ks speculate whole genome duplication") parser.add_argument("-c", dest="correspondence", help="Extract event-related genomic alignment") parser.add_argument("-kp", dest="kspeaks", help="A simple way to get ks peaks") parser.add_argument("-kf", dest="ksfigure", help="A simple way to draw ks distribution map") parser.add_argument("-pf", dest="peaksfit", help="Gaussian fitting of ks distribution") parser.add_argument("-pc", dest="polyploidy_classification", help="Polyploid distinguish among subgenomes") parser.add_argument("-a", dest="alignment", help="Show event-related genomic alignment in a dotplot") parser.add_argument("-k", dest="karyotype", help="Show genome evolution from reconstructed ancestors") parser.add_argument("-ak", dest="ancestral_karyotype", help="Generation of ancestral karyotypes from chromosomes that retain same structures in genomes") parser.add_argument("-akr", dest="ancestral_karyotype_repertoire", help="Incorporate genes from collinearity blocks into the ancestral karyotype repertoire") parser.add_argument("-km", dest="karyotype_mapping", help="Mapping from the known karyotype result to this species") parser.add_argument("-fpd", dest="fusion_positions_database", help="Extract the fusion positions dataset") parser.add_argument("-fd", dest="fusions_detection", help="Determine whether these fusion events occur in other genomes") parser.add_argument("-sf", dest="shared_fusion", help="Quickly find shared fusions between species") parser.add_argument("-at", dest="alignmenttrees", help="Collinear genes construct phylogenetic trees") parser.add_argument("-p", dest="pindex", help="Polyploidy-index characterize the degree of divergence between subgenomes of a polyploidy") parser.add_argument("-r", dest="retain", help="Show subgenomes in gene retention or genome fractionation") parser.add_argument("-ci", dest="circos", help="A simple way to run circos") parser.add_argument("-conf", dest="configure", help="Display and modify the environment variable") args = parser.parse_args() # Function to run subprograms based on options def run_subprogram(program, conf, name): options = base.load_conf(conf, name) r = program(options) r.run() # Function to configure environment def run_configure(): base.rewrite(args.configure, 'ini') # Main function to decide which module to run based on input arguments def module_to_run(argument, conf): switcher = { 'dotplot': (dotplot, conf, 'dotplot'), 'correspondence': (block_correspondence, conf, 'correspondence'), 'alignment': (align_dotplot, conf, 'alignment'), 'retain': (retain, conf, 'retain'), 'blockks': (block_ks, conf, 'blockks'), 'blockinfo': (block_info, conf, 'blockinfo'), 'calks': (ks, conf, 'ks'), 'circos': (circos, conf, 'circos'), 'kspeaks': (kspeaks, conf, 'kspeaks'), 'peaksfit': (peaksfit, conf, 'peaksfit'), 'ksfigure': (ksfigure, conf, 'ksfigure'), 'pindex': (pindex, conf, 'pindex'), 'alignmenttrees': (trees, conf, 'alignmenttrees'), 'improvedcollinearity': (mycollinearity, conf, 'collinearity'), 'configure': run_configure, 'polyploidy_classification': (polyploidy_classification, conf, 'polyploidy classification'), 'karyotype': (karyotype, conf, 'karyotype'), 'ancestral_karyotype': (ancestral_karyotype, conf, 'ancestral_karyotype'), 'karyotype_mapping': (karyotype_mapping, conf, 'karyotype_mapping'), 'ancestral_karyotype_repertoire': (ancestral_karyotype_repertoire, conf, 'ancestral_karyotype_repertoire'), 'shared_fusion': (shared_fusion, conf, 'shared_fusion'), 'fusion_positions_database': (fusion_positions_database, conf, 'fusion_positions_database'), 'fusions_detection': (fusions_detection, conf, 'fusions_detection'), } if argument == 'configure': run_configure() else: program, conf, name = switcher.get(argument) if program: run_subprogram(program, conf, name) # Main entry point def main(): path = wgdi.__path__[0] options = { 'dotplot': 'dotplot.conf', 'correspondence': 'corr.conf', 'alignment': 'align.conf', 'retain': 'retain.conf', 'blockks': 'blockks.conf', 'blockinfo': 'blockinfo.conf', 'calks': 'ks.conf', 'circos': 'circos.conf', 'kspeaks': 'kspeaks.conf', 'ksfigure': 'ksfigure.conf', 'pindex': 'pindex.conf', 'alignmenttrees': 'alignmenttrees.conf', 'peaksfit': 'peaksfit.conf', 'configure': 'conf.ini', 'improvedcollinearity': 'collinearity.conf', 'polyploidy_classification': 'polyploidy_classification.conf', 'karyotype': 'karyotype.conf', 'ancestral_karyotype': 'ancestral_karyotype.conf', 'ancestral_karyotype_repertoire': 'ancestral_karyotype_repertoire.conf', 'karyotype_mapping': 'karyotype_mapping.conf', 'shared_fusion': 'shared_fusion.conf', 'fusion_positions_database': 'fusion_positions_database.conf', 'fusions_detection': 'fusions_detection.conf', } for arg in vars(args): value = getattr(args, arg) if value is not None: if value in ['?', 'help', 'example']: with open(os.path.join(path, 'example', options[arg])) as f: print(f.read()) if arg == 'ksfigure' and not os.path.exists('ks_fit_result.csv'): shutil.copy2(os.path.join(wgdi.__path__[0], 'example/ks_fit_result.csv'), os.getcwd()) elif not os.path.exists(value): print(f'{value} not exists') sys.exit(0) else: module_to_run(arg, value) if __name__ == "__main__": main() ================================================ FILE: wgdi/run_colliearity.py ================================================ import gc import re import sys from multiprocessing import Pool import numpy as np import pandas as pd import wgdi.base as base import wgdi.collinearity as improvedcollinearity class mycollinearity(): def __init__(self, options): # Initialize parameters with default values self.repeat_number = 10 self.multiple = 1 self.score = 100 self.evalue = 1e-5 self.blast_reverse = False self.over_gap = 5 self.comparison = 'genomes' self.options = options for k, v in options: setattr(self, str(k), v) print(f"{str(k)} = {v}") self.position = 'order' # Parse grading values if hasattr(self, 'grading'): self.grading = [int(k) for k in self.grading.split(',')] else: self.grading = [50, 40, 25] # Ensure process is an integer if hasattr(self, 'process'): self.process = int(self.process) else: self.process = 4 self.over_gap = int(self.over_gap ) base.str_to_bool(self.blast_reverse) def deal_blast_for_chromosomes(self, blast, rednum, repeat_number): bluenum = rednum blast = blast.sort_values(by=[0, 11], ascending=[True, False]) def assign_grading(group): group['cumcount'] = group.groupby(1).cumcount() group = group[group['cumcount'] <= repeat_number] group['grading'] = pd.cut( group['cumcount'], bins=[-1, 0, bluenum, repeat_number], labels=self.grading, right=True ) return group newblast = blast.groupby(['chr1', 'chr2']).apply(assign_grading).reset_index(drop=True) newblast['grading'] = newblast['grading'].astype(int) return newblast[newblast['grading'] > 0] def deal_blast_for_genomes(self, blast, rednum, repeat_number): # Initialize the grading column blast['grading'] = 0 # Define the blue number as the sum of rednum and the predefined constant bluenum = 4 + rednum # Get the indices for each group by sorting the 11th column in descending order index = [group.sort_values(by=[11], ascending=[False])[:repeat_number].index.tolist() for name, group in blast.groupby([0])] # Split the indices into red, blue, and gray groups reddata = np.array([k[:rednum] for k in index], dtype=object) bluedata = np.array([k[rednum:bluenum] for k in index], dtype=object) graydata = np.array([k[bluenum:repeat_number] for k in index], dtype=object) # Concatenate the results into flat lists redindex = np.concatenate(reddata) if reddata.size else [] blueindex = np.concatenate(bluedata) if bluedata.size else [] grayindex = np.concatenate(graydata) if graydata.size else [] # Update the grading column based on the group indices blast.loc[redindex, 'grading'] = self.grading[0] blast.loc[blueindex, 'grading'] = self.grading[1] blast.loc[grayindex, 'grading'] = self.grading[2] # Return only the rows with non-zero grading return blast[blast['grading'] > 0] def run(self): # Read and process lens files lens1 = base.newlens(self.lens1, 'order') lens2 = base.newlens(self.lens2, 'order') # Read and process gff files gff1 = base.newgff(self.gff1) gff2 = base.newgff(self.gff2) # Filter gff data based on lens indices gff1 = gff1[gff1['chr'].isin(lens1.index)] gff2 = gff2[gff2['chr'].isin(lens2.index)] # Process blast data blast = base.newblast(self.blast, int(self.score), float(self.evalue),gff1, gff2, self.blast_reverse) # Map positions and chromosome information blast['loc1'] = blast[0].map(gff1[self.position]) blast['loc2'] = blast[1].map(gff2[self.position]) blast['chr1'] = blast[0].map(gff1['chr']) blast['chr2'] = blast[1].map(gff2['chr']) # Apply blast filtering and grading if self.comparison.lower() == 'genomes': blast = self.deal_blast_for_genomes(blast, int(self.multiple), int(self.repeat_number)) if self.comparison.lower() == 'chromosomes': blast = self.deal_blast_for_chromosomes(blast, int(self.multiple), int(self.repeat_number)) print(f"The filtered homologous gene pairs are {len(blast)}.\n") if len(blast) < 1: print("Stopped!\n\nIt may be that the id1 and id2 in the BLAST file do not match with (gff1, lens1) and (gff2, lens2).") sys.exit(1) # Group blast data by 'chr1' and 'chr2' total = [] for (chr1, chr2), group in blast.groupby(['chr1', 'chr2']): total.append([chr1, chr2, group]) del blast, group gc.collect() # Determine chunk size for multiprocessing n = int(np.ceil(len(total) / float(self.process))) result, data = '', [] try: # Initialize multiprocessing Pool pool = Pool(self.process) for i in range(0, len(total), n): # Apply single_pool function asynchronously data.append(pool.apply_async( self.single_pool, args=(total[i:i + n], gff1, gff2, lens1, lens2) )) pool.close() pool.join() except: pool.terminate() for k in data: # Collect results from async tasks text = k.get() if text: result += text # Write final output to file result = re.split('\n', result) fout = open(self.savefile, 'w') num = 1 for line in result: if re.match(r"# Alignment", line): # Replace alignment number s = f'# Alignment {num}:' fout.write(s + line.split(':')[1] + '\n') num += 1 continue if len(line) > 0: fout.write(line + '\n') fout.close() sys.exit(0) def single_pool(self, group, gff1, gff2, lens1, lens2): text = '' for bk in group: chr1, chr2 = str(bk[0]), str(bk[1]) print(f'Running {chr1} vs {chr2}') # Extract and sort points points = bk[2][['loc1', 'loc2', 'grading']].sort_values( by=['loc1', 'loc2'], ascending=[True, True] ) # Initialize collinearity analysis collinearity = improvedcollinearity.collinearity( self.options, points) data = collinearity.run() if not data: continue # Extract gene information gf1 = gff1[gff1['chr'] == chr1].reset_index().set_index('order')[[1, 'strand']] gf2 = gff2[gff2['chr'] == chr2].reset_index().set_index('order')[[1, 'strand']] n = 1 for block, evalue, score in data: if len(block) < self.over_gap: continue # Map gene names and strands block['name1'] = block['loc1'].map(gf1[1]) block['name2'] = block['loc2'].map(gf2[1]) block['strand1'] = block['loc1'].map(gf1['strand']) block['strand2'] = block['loc2'].map(gf2['strand']) block['strand'] = np.where( block['strand1'] == block['strand2'], '1', '-1' ) # Prepare text output block['text'] = block.apply( lambda x: f"{x['name1']} {x['loc1']} {x['name2']} {x['loc2']} {x['strand']}\n", axis=1 ) # Determine alignment mark a, b = block['loc2'].head(2).values mark = 'plus' if a < b else 'minus' # Append alignment information text += f'# Alignment {n}: score={score} pvalue={evalue} N={len(block)} {chr1}&{chr2} {mark}\n' text += ''.join(block['text'].values) n += 1 return text ================================================ FILE: wgdi/shared_fusion.py ================================================ import pandas as pd import wgdi.base as base class shared_fusion: def __init__(self, options): for k, v in options: setattr(self, str(k), v) print(f"{k} = {v}") # Handle classid and limit_length options self.classid = [str(k) for k in self.classid.split(',')] if hasattr(self, 'classid') else ['class1', 'class2'] self.limit_length = int(self.limit_length) if hasattr(self, 'limit_length') else 20 # Clean and split lens files self.lens1 = self.lens1.replace(' ', '').split(',') self.lens2 = self.lens2.replace(' ', '').split(',') def run(self): # Read classification files and block information ancestor_left = base.read_classification(self.ancestor_left) ancestor_top = base.read_classification(self.ancestor_top) bkinfo = pd.read_csv(self.blockinfo) # Preprocess blockinfo columns bkinfo['chr1'] = bkinfo['chr1'].astype(str) bkinfo['chr2'] = bkinfo['chr2'].astype(str) bkinfo['start1'] = bkinfo['start1'].astype(int) bkinfo['end1'] = bkinfo['end1'].astype(int) bkinfo['start2'] = bkinfo['start2'].astype(int) bkinfo['end2'] = bkinfo['end2'].astype(int) # Filter based on ancestor chromosomes bkinfo = bkinfo[(bkinfo['chr1'].isin(ancestor_left[0].values)) & (bkinfo['chr2'].isin(ancestor_top[0].values))] # Read lens files lens1 = pd.read_csv(self.lens1[0], sep='\t', header=None) lens2 = pd.read_csv(self.lens2[0], sep='\t', header=None) lens1[0] = lens1[0].astype(str) lens2[0] = lens2[0].astype(str) # Perform block fusion analysis blockinfoout = self.block_fusions(bkinfo, ancestor_left, ancestor_top) # Apply filters based on breakpoints and length blockinfoout = blockinfoout[(blockinfoout['breakpoints1'] == 1) & (blockinfoout['breakpoints2'] == 1)] blockinfoout = blockinfoout[(blockinfoout['break_length1'] >= self.limit_length) & (blockinfoout['break_length2'] >= self.limit_length)] # Save the filtered block info blockinfoout.to_csv(self.filtered_blockinfo, index=False) # Filter lens data based on the blockinfoout lens1 = lens1[lens1[0].isin(blockinfoout['chr1'].values)] lens2 = lens2[lens2[0].isin(blockinfoout['chr2'].values)] # Save filtered lens data lens1.to_csv(self.lens1[1], sep='\t', index=False, header=False) lens2.to_csv(self.lens2[1], sep='\t', index=False, header=False) def block_fusions(self, bkinfo, ancestor_left, ancestor_top): # Initialize new columns in the bkinfo dataframe bkinfo['breakpoints1'] = 0 bkinfo['breakpoints2'] = 0 bkinfo['break_length1'] = 0 bkinfo['break_length2'] = 0 for index, row in bkinfo.iterrows(): # Process species 1 (chr1) a, b = sorted([row['start1'], row['end1']]) d1 = ancestor_left[(ancestor_left[0] == row['chr1']) & (ancestor_left[2] >= a) & (ancestor_left[1] <= b)] if len(d1) > 1: bkinfo.loc[index, 'breakpoints1'] = 1 breaklength_max = 0 for _, row2 in d1.iterrows(): length_in = len([k for k in range(a, b) if k in range(row2[1], row2[2])]) length_out = (b - a) - length_in breaklength_max = max(breaklength_max, min(length_in, length_out) + 1) bkinfo.loc[index, 'break_length1'] = breaklength_max # Process species 2 (chr2) c, d = sorted([row['start2'], row['end2']]) d2 = ancestor_top[(ancestor_top[0] == row['chr2']) & (ancestor_top[2] >= c) & (ancestor_top[1] <= d)] if len(d2) > 1: bkinfo.loc[index, 'breakpoints2'] = 1 breaklength_max = 0 for _, row2 in d2.iterrows(): length_in = len([k for k in range(c, d) if k in range(row2[1], row2[2])]) length_out = (d - c) - length_in breaklength_max = max(breaklength_max, min(length_in, length_out) + 1) bkinfo.loc[index, 'break_length2'] = breaklength_max return bkinfo ================================================ FILE: wgdi/trees.py ================================================ import os import shutil from io import StringIO import numpy as np import pandas as pd from Bio import AlignIO, Seq, SeqIO, SeqRecord import subprocess import wgdi.base as base class trees(): def __init__(self, options): base_conf = base.config() self.position = 'order' self.alignfile = '' self.align_trimming = '' self.trimming = 'trimal' self.threads = '1' self.minimum = 4 self.tree_software = 'iqtree' self.delete_detail = True for k, v in base_conf: setattr(self, str(k), v) for k, v in options: setattr(self, str(k), v) print(str(k), ' = ', v) if hasattr(self, 'codon_position'): self.codon_position = [ int(k)-1 for k in self.codon_position.split(',')] else: self.codon_position = [0, 1, 2] self.delete_detail = base.str_to_bool(self.delete_detail) def grouping(self, alignment): data = [] indexs = [] if not os.path.exists(self.dir): os.makedirs(self.dir) sequence = SeqIO.to_dict(SeqIO.parse(self.sequence_file, "fasta")) if hasattr(self, 'cds_file'): seq_cds = SeqIO.to_dict(SeqIO.parse(self.cds_file, "fasta")) for index, row in alignment.iterrows(): file = base.gen_md5_id(str(row.values)) self.sequencefile = os.path.join(self.dir, file+'.fasta') self.alignfile = os.path.join(self.dir, file+'.aln') self.align_trimming = self.alignfile+'.trimming' self.treefile = os.path.join(self.dir, file+'.aln.treefile') if os.path.isfile(self.treefile) and os.path.isfile(self.alignfile): data.append(self.treefile) indexs.append(index) continue ids = [] ids_cds = [] for i in range(len(row)): if type(row[i]) == float and np.isnan(row[i]): continue gene_sequence = sequence[row[i]] gene_sequence.id = str(int(i)+1) gene_sequence.description = '' ids.append(gene_sequence) SeqIO.write(ids, self.sequencefile, "fasta") self.align() if hasattr(self, 'cds_file'): self.seqcdsfile = os.path.join(self.dir, file+'.cds.fasta') for i in range(len(row)): if type(row[i]) == float and np.isnan(row[i]): continue gene_cds = seq_cds[row[i]] gene_cds.id = str(int(i)+1) ids_cds.append(gene_cds) SeqIO.write(ids_cds, self.seqcdsfile, "fasta") self.pal2nal() self.codon() if self.trimming.upper() == 'TRIMAL': self.trimal() if self.trimming.upper() == 'DIVVIER': self.divvier() self.buildtrees() if os.path.isfile(self.treefile): data.append(self.treefile) return data def codon(self): if self.codon_position == [0, 1, 2]: shutil.move(self.alignfile+'.mrtrans', self.alignfile) return True records = list(SeqIO.parse(self.alignfile+'.mrtrans', 'fasta')) if len(records) == 0: return False newrecords = [] def final_list(test_list, x, y): return [ test_list[i+j] for i in range(0, len(test_list), x) for j in y] for k in records: if len(k.seq) % 3 > 0: return False seq = final_list(k.seq, 3, self.codon_position) k.seq = ''.join(seq) newrecords.append(SeqRecord.SeqRecord( Seq.Seq(k.seq), id=k.id, description='')) SeqIO.write(newrecords, self.alignfile, 'fasta') return True def pal2nal(self): args = ['perl', self.pal2nal_path, self.alignfile, self.seqcdsfile, '-output fasta', '>'+self.alignfile+'.mrtrans'] command = ' '.join(args) try: os.system(command) except: return False return True def align(self): if self.align_software == 'mafft': try: command = [self.mafft_path,'--quiet', self.sequencefile, '>', self.alignfile] subprocess.run(" ".join(command), shell=True, check=True) except subprocess.CalledProcessError as e: print(f"Error while running MAFFT: {e}") if self.align_software == 'muscle': try: command = [self.muscle_path,'-align', self.sequencefile, '-output', self.alignfile, '-quiet'] subprocess.run(" ".join(command), shell=True, check=True) except subprocess.CalledProcessError as e: print(f"Error while running Muscle: {e}") def trimal(self): args = [self.trimal_path, '-in', self.alignfile, '-out', self.align_trimming, '-automated1'] command = ' '.join(args) try: os.system(command) except: return False return True def divvier(self): args = [self.divvier_path, '-mincol', '4', '-divvygap', self.alignfile] command = ' '.join(args) try: os.system(command) os.rename(self.alignfile+'.divvy.fas', self.align_trimming) except: return False return True def buildtrees(self): try: if self.tree_software.upper() == 'IQTREE': args = [self.iqtree_path, '-s', self.align_trimming, '-m', self.model, '-T', self.threads, '--quiet'] command = ' '.join(args) os.system(command) os.rename(self.align_trimming+'.treefile', self.treefile) elif self.tree_software.upper() == 'FASTTREE': args = [self.fasttree_path, self.align_trimming, '>', self.treefile] command = ' '.join(args) os.system(command) except: return False if self.delete_detail == True: for file in (self.sequencefile, self.align_trimming+'.bionj', self.align_trimming+'.iqtree', self.align_trimming+'.ckp.gz', self.align_trimming+'.log', self.align_trimming+'.mldist', self.align_trimming+'.model.gz'): try: os.remove(file) except OSError: pass return True def run(self): alignment = pd.read_csv(self.alignment, header=None) alignment.replace('.', np.nan, inplace=True) alignment.dropna(thresh=int(self.minimum), inplace=True) if hasattr(self, 'gff') and hasattr(self, 'lens'): gff = base.newgff(self.gff) lens = base.newlens(self.lens, self.position) alignment = pd.merge( alignment, gff[['chr', self.position]], left_on=0, right_on=gff.index, how='left') alignment.dropna(subset=['chr', 'order'], inplace=True) alignment['order'] = alignment['order'].astype(int) alignment = alignment[alignment['chr'].isin(lens.index)] alignment.drop(alignment.columns[-2:], axis=1, inplace=True) data = self.grouping(alignment) fout = open(self.trees_file, 'w') fout.close() for i in range(0, len(data), 100): trees = ' '.join([str(k) for k in data[i:i+100]]) args = ['cat', trees, '>>', self.trees_file] command = ' '.join([str(k) for k in args]) os.system(command) df = pd.read_csv(self.trees_file, header=None, sep='\t') df[0].to_csv(self.trees_file, index=None, sep='\t', header=False) print("done") ================================================ FILE: wgdi.egg-info/PKG-INFO ================================================ Metadata-Version: 2.1 Name: wgdi Version: 0.75 Summary: A user-friendly toolkit for evolutionary analyses of whole-genome duplications and ancestral karyotypes Home-page: https://github.com/SunPengChuan/wgdi Author: Pengchuan Sun Author-email: sunpengchuan@gmail.com License: BSD License Classifier: Intended Audience :: Science/Research Classifier: Programming Language :: Python :: 3 Classifier: License :: OSI Approved :: BSD License Classifier: Operating System :: OS Independent Description-Content-Type: text/markdown License-File: LICENSE Requires-Dist: pandas>=1.1.0 Requires-Dist: numpy Requires-Dist: biopython Requires-Dist: matplotlib Requires-Dist: scipy Requires-Dist: tabulate # WGDI ![Latest PyPI version](https://img.shields.io/pypi/v/wgdi.svg) [![Downloads](https://pepy.tech/badge/wgdi/month)](https://pepy.tech/project/wgdi) [![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/wgdi/README.html) | | | | --- | --- | | Author | Pengchuan Sun ([sunpengchuan](https//github.com/sunpengchuan)) | | Email | | | License | [BSD](http://creativecommons.org/licenses/BSD/) | ## Description **WGDI (Whole-Genome Duplication Integrated analysis)** is a Python-based command-line tool designed to simplify the analysis of whole-genome duplications (WGD) and cross-species genome alignments. It offers three main workflows that enhance the detection and study of WGD events: ## Key Features ### 1. Polyploid Inference - Identifies and confirms polyploid events with high accuracy. ### 2. Genomic Homology Inference - Traces the evolutionary history of duplicated regions across species, with a focus on distinguishing subgenomes. ### 3. Ancestral Karyotyping - Reconstructs protochromosomes and traces common chromosomal rearrangements to understand chromosome evolution. ## Installation Python package and command line interface (IDLE) for the analysis of whole genome duplications (WGDI). WGDI can be deployed in Windows, Linux, and Mac OS operating systems and can be installed via pip and conda. #### Bioconda ``` conda install -c bioconda wgdi ``` #### Pypi ``` pip3 install wgdi ``` Documentation for installation along with a user tutorial, a default parameter file, and test data are provided. please consult the docs at . ## Tips Here are some videos with simple examples of WGDI. ###### [WGDI的简单使用(一)](https://www.bilibili.com/video/BV1qK4y1U7eK) or https://youtu.be/k-S6FVcBIQw ###### [WGDI的简单使用(二)](https://www.bilibili.com/video/BV195411P7L1) or https://youtu.be/QiZYFYGclyE chatting group QQ : 966612552 ## Citating WGDI If you use wgdi in your work, please cite: > Sun P., Jiao B., Yang Y., Shan L., Li T., Li X., Xi Z., Wang X., and Liu J. (2022). WGDI: A user-friendly toolkit for evolutionary analyses of whole-genome duplications and ancestral karyotypes. Mol. Plant. doi: https://doi.org/10.1016/j.molp.2022.10.018. ## News ## 0.75 * Fixed some issues (-fpd). * Introduced a threads parameter for the iqtree command within alignmenttrees (-at). ## 0.74 * Improved the the fusion positions dataset (-fpd). * Fixed some issues (-pc). ## 0.7.1 * Added extract the fusion positions dataset (-fpd). * Added determine whether these fusion events occur in other genomes (-fd). * Improved the karyotype_mapping (-km) effect. * Fixed the problem caused by the Python version, now it is compatible with version 3.12. ## 0.6.5 * Fixed some issues (-sf). * Added new tips to avoid some errors. ## 0.6.4 * Fixed the problem caused by the Python version, now it is compatible with version 3.11.3. ## 0.6.3 * Fixed some issues (-ks, -sf). ## 0.6.2 * Added find shared fusions between species (-sf). ## 0.6.1 * Fixed issue with alignment (-a). Only version 0.6.0 has this bug. ## 0.6.0 * Fixed issue with improved collinearity (-icl). * Added a parameter 'tandem_ratio' to blockinfo (-bi). ## 0.5.9 * Update the improved collinearity (-icl). Faster than before, but lower than MCscanX, JCVI. * Fixed issue with ancestral karyotype repertoire (-akr). ## 0.5.8 * Fixed issue with gene names (-ks). ## 0.5.7 - Fixed issue with chromosome order (-ak). - Fixed issue with gene names (-ks). This version is not fixed, please install the latest version. ## 0.5.5 and 0.5.6 * Add ancestral karyotype (-ak) * Add ancestral karyotype repertoire (-akr) ## 0.5.4 * Improved the karyotype_mapping (-km) effect. * little change (-at). ## 0.5.3 * Fixed legend issue with (-kf). * Fixed calculate Ks issue with (-ks). * Improved the karyotype_mapping (-km) effect. * Improved the alignmenttrees (-at) effect. ## 0.5.2 * Fixed some bugs. ## 0.5.1 * Fixed the error of the command (-conf). * Improved the karyotype_mapping (-km) effect. * Added the available data set of alignmenttree (-at). Low copy data set (for example, single-copy_groups.tsv of sonicparanoid2 software). ## 0.4.9 * The latest version adds karyotype_mapping (-km) and karyotype (-k) display. * The latest version changes the calculation of extracting pvalue from collinearity (-icl), making this parameter more sensitive. Therefore, it is recommended to set to 0.2 instead of 0.05. * The latest version has also changed the drawing display of ksfigure (-kf) to make it more beautiful. ================================================ FILE: wgdi.egg-info/SOURCES.txt ================================================ LICENSE README.md setup.py wgdi/__init__.py wgdi/align_dotplot.py wgdi/ancestral_karyotype.py wgdi/ancestral_karyotype_repertoire.py wgdi/base.py wgdi/block_correspondence.py wgdi/block_info.py wgdi/block_ks.py wgdi/circos.py wgdi/collinearity.py wgdi/dotplot.py wgdi/fusion_positions_database.py wgdi/fusions_detection.py wgdi/karyotype.py wgdi/karyotype_mapping.py wgdi/ks.py wgdi/ks_peaks.py wgdi/ksfigure.py wgdi/peaksfit.py wgdi/pindex.py wgdi/polyploidy_classification.py wgdi/retain.py wgdi/run.py wgdi/run_colliearity.py wgdi/shared_fusion.py wgdi/trees.py wgdi.egg-info/PKG-INFO wgdi.egg-info/SOURCES.txt wgdi.egg-info/dependency_links.txt wgdi.egg-info/entry_points.txt wgdi.egg-info/requires.txt wgdi.egg-info/top_level.txt wgdi.egg-info/zip-safe wgdi/example/__init__.py wgdi/example/align.conf wgdi/example/alignmenttrees.conf wgdi/example/ancestral_karyotype.conf wgdi/example/ancestral_karyotype_repertoire.conf wgdi/example/blockinfo.conf wgdi/example/blockks.conf wgdi/example/circos.conf wgdi/example/collinearity.conf wgdi/example/conf.ini wgdi/example/corr.conf wgdi/example/dotplot.conf wgdi/example/fusion_positions_database.conf wgdi/example/fusions_detection.conf wgdi/example/karyotype.conf wgdi/example/karyotype_mapping.conf wgdi/example/ks.conf wgdi/example/ks_fit_result.csv wgdi/example/ksfigure.conf wgdi/example/kspeaks.conf wgdi/example/peaksfit.conf wgdi/example/pindex.conf wgdi/example/polyploidy_classification.conf wgdi/example/retain.conf wgdi/example/shared_fusion.conf ================================================ FILE: wgdi.egg-info/dependency_links.txt ================================================ ================================================ FILE: wgdi.egg-info/entry_points.txt ================================================ [console_scripts] wgdi = wgdi.run:main ================================================ FILE: wgdi.egg-info/requires.txt ================================================ pandas>=1.1.0 numpy biopython matplotlib scipy tabulate ================================================ FILE: wgdi.egg-info/top_level.txt ================================================ wgdi ================================================ FILE: wgdi.egg-info/zip-safe ================================================