Repository: SunPengChuan/wgdi
Branch: master
Commit: 00375818da64
Files: 115
Total size: 311.6 KB

Directory structure:
gitextract_p42u6yxa/

├── LICENSE
├── README.md
├── __init__.py
├── build/
│   └── lib/
│       └── wgdi/
│           ├── __init__.py
│           ├── align_dotplot.py
│           ├── ancestral_karyotype.py
│           ├── ancestral_karyotype_repertoire.py
│           ├── base.py
│           ├── block_correspondence.py
│           ├── block_info.py
│           ├── block_ks.py
│           ├── circos.py
│           ├── collinearity.py
│           ├── dotplot.py
│           ├── example/
│           │   ├── __init__.py
│           │   ├── align.conf
│           │   ├── alignmenttrees.conf
│           │   ├── ancestral_karyotype.conf
│           │   ├── ancestral_karyotype_repertoire.conf
│           │   ├── blockinfo.conf
│           │   ├── blockks.conf
│           │   ├── circos.conf
│           │   ├── collinearity.conf
│           │   ├── conf.ini
│           │   ├── corr.conf
│           │   ├── dotplot.conf
│           │   ├── fusion_positions_database.conf
│           │   ├── fusions_detection.conf
│           │   ├── karyotype.conf
│           │   ├── karyotype_mapping.conf
│           │   ├── ks.conf
│           │   ├── ks_fit_result.csv
│           │   ├── ksfigure.conf
│           │   ├── kspeaks.conf
│           │   ├── peaksfit.conf
│           │   ├── pindex.conf
│           │   ├── polyploidy_classification.conf
│           │   ├── retain.conf
│           │   └── shared_fusion.conf
│           ├── fusion_positions_database.py
│           ├── fusions_detection.py
│           ├── karyotype.py
│           ├── karyotype_mapping.py
│           ├── ks.py
│           ├── ks_peaks.py
│           ├── ksfigure.py
│           ├── peaksfit.py
│           ├── pindex.py
│           ├── polyploidy_classification.py
│           ├── retain.py
│           ├── run.py
│           ├── run_colliearity.py
│           ├── shared_fusion.py
│           └── trees.py
├── command.txt
├── dist/
│   └── wgdi-0.75-py3-none-any.whl
├── setup.py
├── wgdi/
│   ├── __init__.py
│   ├── align_dotplot.py
│   ├── ancestral_karyotype.py
│   ├── ancestral_karyotype_repertoire.py
│   ├── base.py
│   ├── block_correspondence.py
│   ├── block_info.py
│   ├── block_ks.py
│   ├── circos.py
│   ├── collinearity.py
│   ├── dotplot.py
│   ├── example/
│   │   ├── __init__.py
│   │   ├── align.conf
│   │   ├── alignmenttrees.conf
│   │   ├── ancestral_karyotype.conf
│   │   ├── ancestral_karyotype_repertoire.conf
│   │   ├── blockinfo.conf
│   │   ├── blockks.conf
│   │   ├── circos.conf
│   │   ├── collinearity.conf
│   │   ├── conf.ini
│   │   ├── corr.conf
│   │   ├── dotplot.conf
│   │   ├── fusion_positions_database.conf
│   │   ├── fusions_detection.conf
│   │   ├── karyotype.conf
│   │   ├── karyotype_mapping.conf
│   │   ├── ks.conf
│   │   ├── ks_fit_result.csv
│   │   ├── ksfigure.conf
│   │   ├── kspeaks.conf
│   │   ├── peaksfit.conf
│   │   ├── pindex.conf
│   │   ├── polyploidy_classification.conf
│   │   ├── retain.conf
│   │   └── shared_fusion.conf
│   ├── fusion_positions_database.py
│   ├── fusions_detection.py
│   ├── karyotype.py
│   ├── karyotype_mapping.py
│   ├── ks.py
│   ├── ks_peaks.py
│   ├── ksfigure.py
│   ├── peaksfit.py
│   ├── pindex.py
│   ├── polyploidy_classification.py
│   ├── retain.py
│   ├── run.py
│   ├── run_colliearity.py
│   ├── shared_fusion.py
│   └── trees.py
└── wgdi.egg-info/
    ├── PKG-INFO
    ├── SOURCES.txt
    ├── dependency_links.txt
    ├── entry_points.txt
    ├── requires.txt
    ├── top_level.txt
    └── zip-safe

================================================
FILE CONTENTS
================================================

================================================
FILE: LICENSE
================================================
Copyright (c) 2018-2018, Pengchuan Sun

All rights reserved.

Redistribution and use in source and binary forms, with or without modification,
are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list
of conditions and the following disclaimer.

Redistributions in binary form must reproduce the above copyright notice, this
list of conditions and the following disclaimer in the documentation and/or
other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

================================================
FILE: README.md
================================================
# WGDI

![Latest PyPI version](https://img.shields.io/pypi/v/wgdi.svg) [![Downloads](https://pepy.tech/badge/wgdi/month)](https://pepy.tech/project/wgdi) [![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/wgdi/README.html)

| | |
| --- | --- |
| Author  | Pengchuan Sun ([sunpengchuan](https//github.com/sunpengchuan)) |
| Email   | <sunpengchuan@gmail.com> |
| License | [BSD](http://creativecommons.org/licenses/BSD/) |

## Description

**WGDI (Whole-Genome Duplication Integrated analysis)** is a Python-based command-line tool designed to simplify the analysis of whole-genome duplications (WGD) and cross-species genome alignments. It offers three main workflows that enhance the detection and study of WGD events:

## Key Features

### 1. Polyploid Inference
- Identifies and confirms polyploid events with high accuracy.

### 2. Genomic Homology Inference
- Traces the evolutionary history of duplicated regions across species, with a focus on distinguishing subgenomes. 

### 3. Ancestral Karyotyping
- Reconstructs protochromosomes and traces common chromosomal rearrangements to understand chromosome evolution. 


## Installation

Python package and command line interface (IDLE) for the analysis of whole genome duplications (WGDI). WGDI can be deployed in Windows, Linux, and Mac OS operating systems and can be installed via pip and conda.

#### Bioconda

```
conda install -c bioconda  wgdi
```

#### Pypi

```
pip3 install wgdi
```

Documentation for installation along with a user tutorial, a default parameter file, and test data are provided. please consult the docs at <http://wgdi.readthedocs.io/en/latest/>.

## Tips

Here are some videos with simple examples of WGDI.

###### [WGDI的简单使用（一）](https://www.bilibili.com/video/BV1qK4y1U7eK) or https://youtu.be/k-S6FVcBIQw

###### [WGDI的简单使用（二）](https://www.bilibili.com/video/BV195411P7L1) or https://youtu.be/QiZYFYGclyE

chatting group QQ : 966612552

## Citating WGDI

If you use wgdi in your work, please cite:

> Sun P., Jiao B., Yang Y., Shan L., Li T., Li X., Xi Z., Wang X., and Liu J. (2022). WGDI: A user-friendly toolkit for evolutionary analyses of whole-genome duplications and ancestral karyotypes. Mol. Plant. doi: https://doi.org/10.1016/j.molp.2022.10.018.

## News

## 0.75
* Fixed some issues (-fpd, -km).
* Introduced a threads parameter for the iqtree command within alignmenttrees (-at).

## 0.74
* Improved the the fusion positions dataset (-fpd).
* Fixed some issues (-pc).

## 0.7.1
* Added extract the fusion positions dataset (-fpd).
* Added determine whether these fusion events occur in other genomes (-fd).
* Improved the karyotype_mapping (-km) effect.
* Fixed the problem caused by the Python version, now it is compatible with version 3.12.


## 0.6.5
* Fixed some issues (-sf).
* Added new tips to avoid some errors.

## 0.6.4
* Fixed the problem caused by the Python version, now it is compatible with version 3.11.3.

## 0.6.3
* Fixed some issues (-ks, -sf).

## 0.6.2
* Added find shared fusions between species (-sf).

## 0.6.1

* Fixed issue with alignment (-a). Only version 0.6.0 has this bug.

## 0.6.0

* Fixed issue with improved collinearity (-icl).
* Added a parameter 'tandem_ratio' to blockinfo (-bi).

## 0.5.9

* Update the improved collinearity (-icl). Faster than before, but lower than MCscanX, JCVI.
* Fixed issue with ancestral karyotype repertoire (-akr).

## 0.5.8

* Fixed issue with gene names (-ks).

## 0.5.7
- Fixed issue with chromosome order (-ak).
- Fixed issue with gene names (-ks).  This version is not fixed, please install the latest version.

## 0.5.5 and 0.5.6
* Add ancestral karyotype (-ak)
* Add ancestral karyotype repertoire (-akr)

## 0.5.4
* Improved the karyotype_mapping (-km) effect.
* little change (-at).

## 0.5.3
* Fixed legend issue with (-kf).
* Fixed calculate Ks issue with (-ks).
* Improved the karyotype_mapping (-km) effect.
* Improved the alignmenttrees (-at) effect.

## 0.5.2
* Fixed some bugs.

## 0.5.1
* Fixed the error of the command (-conf).
* Improved the karyotype_mapping (-km) effect.
* Added the available data set of alignmenttree (-at). Low copy data set (for example, single-copy_groups.tsv of sonicparanoid2 software).

## 0.4.9
* The latest version adds karyotype_mapping (-km) and karyotype (-k) display.
* The latest version changes the calculation of extracting pvalue from collinearity (-icl), making this parameter more sensitive. Therefore, it is recommended to set to 0.2 instead of 0.05.
* The latest version has also changed the drawing display of ksfigure (-kf) to make it more beautiful.


================================================
FILE: __init__.py
================================================


================================================
FILE: build/lib/wgdi/__init__.py
================================================


================================================
FILE: build/lib/wgdi/align_dotplot.py
================================================
import re
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import wgdi.base as base

class align_dotplot:
    def __init__(self, options):
        # Default values
        self.position = 'order'
        self.figsize = 'default'
        self.classid = 'class1'

        # Initialize from options
        for k, v in options:
            setattr(self, str(k), v)
            print(f'{k} = {v}')
        
        self.ks_area = [float(k) for k in getattr(self, 'ks_area', '-1,3').split(',')]
        self.colors = [str(k) for k in getattr(self, 'colors', 'red,blue,green,black,orange').split(',')]
        self.ancestor_top = None if getattr(self, 'ancestor_top', 'none') == 'none' else self.ancestor_top
        self.ancestor_left = None if getattr(self, 'ancestor_left', 'none') == 'none' else self.ancestor_left

        self.blockinfo_reverse = base.str_to_bool(self.blockinfo_reverse)

    def pair_position(self, alignment, loc1, loc2, colors):
        alignment.index = alignment.index.map(loc1)
        data = []
        for i, k in enumerate(alignment.columns):
            df = alignment[k].map(loc2).dropna()
            for idx, row in df.items():
                data.append([idx, row, colors[i]])
        return pd.DataFrame(data, columns=['loc1', 'loc2', 'color'])

    def run(self):
        axis = [0, 1, 1, 0]

        # Lens generation and figure size
        lens1 = base.newlens(self.lens1, self.position)
        lens2 = base.newlens(self.lens2, self.position)
        
        if re.search(r'\d', self.figsize):
            self.figsize = [float(k) for k in self.figsize.split(',')]
        else:
            self.figsize = np.array([1, float(lens1.sum()) / float(lens2.sum())]) * 10
            
        plt.rcParams['ytick.major.pad'] = 0

        # Create plot
        fig, ax = plt.subplots(figsize=self.figsize)
        ax.xaxis.set_ticks_position('top')
        step1, step2 = 1 / float(lens1.sum()), 1 / float(lens2.sum())

        # Process Ancestor Data
        if self.ancestor_left:
            axis[0] = -0.02
            lens_ancestor_left = self.process_ancestor(self.ancestor_left, lens1.index)

        if self.ancestor_top:
            axis[3] = -0.02
            lens_ancestor_top = self.process_ancestor(self.ancestor_top, lens2.index)

        base.dotplot_frame(fig, ax, lens1, lens2, step1, step2, 
                           self.genome1_name, self.genome2_name, [0, 1])

        # Process GFF files
        gff1, gff2 = base.newgff(self.gff1), base.newgff(self.gff2)
        gff1 = base.gene_location(gff1, lens1, step1, self.position)
        gff2 = base.gene_location(gff2, lens2, step2, self.position)

        if self.ancestor_top:
            self.ancestor_position(ax, gff2, lens_ancestor_top, 'top')

        if self.ancestor_left:
            self.ancestor_position(ax, gff1, lens_ancestor_left, 'left')

        # Process block info and alignment
        bkinfo = self.process_blockinfo(lens1,lens2)
        align = self.alignment(gff1, gff2, bkinfo)
        alignment = align[gff1.columns[-len(bkinfo[self.classid].drop_duplicates()):]]
        alignment.to_csv(self.savefile, header=False)

        # Create scatter plot
        df = self.pair_position(alignment, gff1['loc'], gff2['loc'], self.colors)
        plt.scatter(df['loc2'], df['loc1'], s=float(self.markersize), c=df['color'], 
                    alpha=0.5, edgecolors=None, linewidths=0, marker='o')

        ax.axis(axis)
        plt.subplots_adjust(left=0.07, right=0.97, top=0.93, bottom=0.03)
        plt.savefig(self.savefig, dpi=500)
        plt.show()

    def process_ancestor(self, ancestor_file, lens_index):
        df = pd.read_csv(ancestor_file, sep="\t", header=None)
        df[0] = df[0].astype(str)
        df[3] = df[3].astype(str)
        df[4] = df[4].astype(int)
        df[4] = df[4] / df[4].max()
        return df[df[0].isin(lens_index)]

    def process_blockinfo(self, lens1, lens2):
        bkinfo = pd.read_csv(self.blockinfo, index_col='id')
        if self.blockinfo_reverse ==  True:
            bkinfo[['chr1', 'chr2']] = bkinfo[['chr2', 'chr1']]
            bkinfo[['block1', 'block2']] = bkinfo[['block2', 'block1']]
        bkinfo['chr1'] = bkinfo['chr1'].astype(str)
        bkinfo['chr2'] = bkinfo['chr2'].astype(str)
        bkinfo[self.classid] = bkinfo[self.classid].astype(str)
        return bkinfo[bkinfo['chr1'].isin(lens1.index) & (bkinfo['chr2'].isin(lens2.index))]

    def alignment(self, gff1, gff2, bkinfo):
        gff1['uid'] = gff1['chr'] + 'g' + gff1['order'].astype(str)
        gff2['uid'] = gff2['chr'] + 'g' + gff2['order'].astype(str)
        gff1['id'] = gff1.index
        gff2['id'] = gff2.index
        
        for cl, group in bkinfo.groupby(self.classid):
            name = f'l{cl}'
            gff1[name] = ''
            group = group.sort_values(by=['length'], ascending=True)

            for _, row in group.iterrows():
                block = self.create_block_dataframe(row)
                if block.empty:
                    continue
                block1_min, block1_max = block['block1'].agg(['min', 'max'])
                area = gff1[(gff1['chr'] == row['chr1']) & 
                            (gff1['order'] >= block1_min) & 
                            (gff1['order'] <= block1_max)].index
                
                block['id1'] = (row['chr1'] + 'g' + block['block1'].astype(str)).map(
                    dict(zip(gff1['uid'], gff1.index)))
                block['id2'] = (row['chr2'] + 'g' + block['block2'].astype(str)).map(
                    dict(zip(gff2['uid'], gff2.index)))

                gff1.loc[block['id1'].values, name] = block['id2'].values
                gff1.loc[gff1.index.isin(area) & gff1[name].eq(''), name] = '.'
        return gff1

    def create_block_dataframe(self, row):
        b1, b2, ks = row['block1'].split('_'), row['block2'].split('_'), row['ks'].split('_')
        ks = list(map(float, ks[1:])) if ks[0] == '' else list(map(float, ks))
        block = pd.DataFrame(np.array([b1, b2, ks]).T, columns=['block1', 'block2', 'ks'])
        block['block1'] = block['block1'].astype(int)
        block['block2'] = block['block2'].astype(int)
        block['ks'] = block['ks'].astype(float)
        return block[(block['ks'] <= self.ks_area[1]) & 
                     (block['ks'] >= self.ks_area[0])].drop_duplicates(subset=['block1'], keep='first')

    def ancestor_position(self, ax, gff, lens, mark):
        for _, row in lens.iterrows():
            loc1 = gff[(gff['chr'] == row[0]) & (gff['order'] == int(row[1]))].index
            loc2 = gff[(gff['chr'] == row[0]) & (gff['order'] == int(row[2]))].index
            loc1, loc2 = gff.loc[[loc1[0], loc2[0]], 'loc']
            if mark == 'top':
                width = abs(loc1-loc2)
                loc = [min(loc1, loc2), 0]
                height = -0.02
            if mark == 'left':
                height = abs(loc1-loc2)
                loc = [-0.02, min(loc1, loc2), ]
                width = 0.02
            base.Rectangle(ax, loc, height, width, row[3], row[4])

================================================
FILE: build/lib/wgdi/ancestral_karyotype.py
================================================
import pandas as pd
from Bio import SeqIO
import wgdi.base as base


class ancestral_karyotype:
    def __init__(self, options):
        self.mark = 'aak'
        
        # Set attributes from options
        for k, v in options:
            setattr(self, str(k), v)
            print(f"{k} = {v}")

    def run(self):
        # Load and filter data
        gff = base.newgff(self.gff)
        ancestor = base.read_classification(self.ancestor)
        gff = gff[gff['chr'].isin(ancestor[0].values.tolist())]

        # Create new gff copy and initialize required variables
        newgff = gff.copy()
        data, num = [], 1

        # Create dictionary mapping chromosome to order
        chr_arr = ancestor[3].drop_duplicates().to_list()
        chr_dict = {chr: idx + 1 for idx, chr in enumerate(chr_arr)}
        ancestor['order'] = ancestor[3].map(chr_dict)

        dict1, dict2 = {}, {}

        # Process ancestor and gff information
        for (cla, order), group in ancestor.groupby([4, 'order'], sort=[False, False]):
            for index, row in group.iterrows():
                index1 = gff[(gff['chr'] == row[0]) & (gff['order'] >= row[1]) & (gff['order'] <= row[2])].index
                newgff.loc[index1, 'chr'] = str(num)
                
                # Store results in data
                for k in index1:
                    data.append(newgff.loc[k, :].values.tolist() + [k])

            dict1[str(num)] = cla
            dict2[str(num)] = group[3].values[0]
            num += 1

        # Create dataframe from the data collected
        df = pd.DataFrame(data)

        # Filter based on peptide file
        pep = SeqIO.to_dict(SeqIO.parse(self.pep_file, "fasta"))
        df = df[df[6].isin(pep.keys())]

        # Assign new names and order
        for name, group in df.groupby(0):
            df.loc[group.index, 'order'] = range(1, len(group) + 1)
            df.loc[group.index, 'newname'] = [f"{self.mark}{name}g{i:05d}" for i in range(1, len(group) + 1)]

        # Set data types and sort
        df['order'] = df['order'].astype(int)
        df = df[[0, 'newname', 1, 2, 3, 'order', 6]].sort_values(by=[0, 'order'])

        # Save output files
        df.to_csv(self.ancestor_gff, sep="\t", index=False, header=None)
        lens = df.groupby(0).max()[[2, 'order']]
        lens.to_csv(self.ancestor_lens, sep="\t", header=None)

        # Add extra columns and save final results
        lens[1] = 1
        lens['color'] = lens.index.map(dict2)
        lens['class'] = lens.index.map(dict1)
        lens[[1, 'order', 'color', 'class']].to_csv(self.ancestor_file, sep="\t", header=None)

        # Update peptide sequences with new IDs and save
        id_dict = df.set_index(6).to_dict()['newname']
        seqs = []

        for seq_record in SeqIO.parse(self.pep_file, "fasta"):
            if seq_record.id in id_dict:
                seq_record.id = id_dict[seq_record.id]
                seqs.append(seq_record)

        SeqIO.write(seqs, self.ancestor_pep, "fasta")


================================================
FILE: build/lib/wgdi/ancestral_karyotype_repertoire.py
================================================

import numpy as np
import pandas as pd
from Bio import SeqIO

import wgdi.base as base

class ancestral_karyotype_repertoire():
    def __init__(self, options):
        self.gap = 5
        self.direction = 0.01
        self.mark = 'aak1s'
        self.blockinfo_reverse = False
        for k, v in options:
            setattr(self, str(k), v)
            print(k, ' = ', v)
        self.blockinfo_reverse =  base.str_to_bool(self.blockinfo_reverse)

    def run(self):
        gff1 = base.newgff(self.gff1)
        gff2 = base.newgff(self.gff2)
        bkinfo = pd.read_csv(self.blockinfo, index_col='id')
        if self.blockinfo_reverse == True:
            bkinfo[['chr1', 'chr2']] = bkinfo[['chr2', 'chr1']]
            bkinfo[['block1', 'block2']] = bkinfo[['block2', 'block1']]
        for index, row in bkinfo.iterrows():
            block1, block2 = row['block1'].split('_'), row['block2'].split('_')
            block1, block2 = [int(k) for k in block1], [int(k) for k in block2]
            if int(block1[1])-int(block1[0]) < 0:
                self.direction = -0.01
            for i in range(1, len(block2)):
                if abs(block1[i]-block1[i-1]) == 1 and abs(block2[i]-block2[i-1]) < int(self.gap):
                    gff1_id = gff1[(gff1['chr'] == str(row['chr1'])) & (
                        gff1['order'] == block1[i])].index[0]
                    order = gff1.loc[gff1_id, 'order']
                    gff1_row = gff1.loc[gff1_id, :].copy()
                    for num in range(block2[i-1], block2[i]):
                        order = order + self.direction
                        id = gff2[(gff2['chr'] == str(row['chr2']))
                                  & (gff2['order'] == num)].index[0]
                        gff1_row['order'] = order
                        gff1.loc[id, :] = gff1_row
        df = gff1.copy()
        df = df.sort_values(by=['chr', 'order'])
        for name, group in df.groupby(['chr']):
            df.loc[group.index, 'order'] = list(range(1, len(group)+1))
            df.loc[group.index, 'newname'] = list(
                [str(self.mark)+str(name)+'g'+str(i).zfill(5) for i in range(1, len(group)+1)])
        df['order'] = df['order'].astype(int)
        df['oldname'] = df.index
        columns = ['chr', 'newname', 'start',
                   'end', 'strand', 'order', 'oldname']
        df[columns].to_csv(self.ancestor_gff, sep="\t",
                           index=False, header=None)
        lens = df.groupby('chr').max()[['end', 'order']]
        lens['end'] = lens['end'].astype(np.int64)
        lens.to_csv(self.ancestor_lens, sep="\t", header=None)
        ancestor = base.read_classification(self.ancestor)
        for index, row in ancestor.iterrows():
            ancestor.at[index, 1] = 1
            ancestor.at[index, 2] = lens.at[str(row[0]),'order']
        ancestor.to_csv(self.ancestor_new, sep="\t", index=False, header=None)
        id_dict = df['newname'].to_dict()
        seqs = []
        for seq_record in SeqIO.parse(self.ancestor_pep, "fasta"):
            if seq_record.id in id_dict:
                seq_record.id = id_dict[seq_record.id]
            else:
                continue
            seq_record.description = ''
            seqs.append(seq_record)
        SeqIO.write(seqs, self.ancestor_pep_new, "fasta")


================================================
FILE: build/lib/wgdi/base.py
================================================
import configparser
import hashlib
import os
import re

import matplotlib
import matplotlib.patches as mpatches
import numpy as np
import pandas as pd
from Bio import SeqIO

import wgdi


def gen_md5_id(item):
    """Generate MD5 hash for the given item."""
    return hashlib.md5(item.encode('utf-8')).hexdigest()


def config():
    """Read configuration from the example conf.ini file."""
    conf = configparser.ConfigParser()
    conf.read(os.path.join(wgdi.__path__[0], 'example/conf.ini'))
    return conf.items('ini')


def load_conf(file, section):
    """Load configuration items from the specified section."""
    conf = configparser.ConfigParser()
    conf.read(file)
    return conf.items(section)


def rewrite(file, section):
    """Rewrite the configuration file to keep only the specified section."""
    conf = configparser.ConfigParser()
    conf.read(file)
    if conf.has_section(section):
        for k in conf.sections():
            if k != section:
                conf.remove_section(k)
        conf.write(open(os.path.join(wgdi.__path__[0], 'example/conf.ini'), 'w'))
        print('Option ini has been modified')
    else:
        print('Option ini no change')


def read_colinearscan(file):
    """Read colinearscan output and parse into data structure."""
    data, b, flag, num = [], [], 0, 1
    with open(file) as f:
        for line in f:
            line = line.strip()
            if re.match(r"the", line):
                num = re.search(r'\d+', line).group()
                b = []
                flag = 1
                continue
            if re.match(r"\>LOCALE", line):
                flag = 0
                p = re.split(':', line)
                if b:
                    data.append([num, b, p[1]])
                b = []
                continue
            if flag == 1:
                a = re.split(r"\s", line)
                b.append(a)
    if b:
        data.append([num, b, p[1]])
    return data


def read_mcscanx(fn):
    """Read mcscanx output and parse into data structure."""
    with open(fn) as f1:
        data, b = [], []
        flag, num = 0, 0
        for line in f1:
            line = line.strip()
            if re.match(r"## Alignment", line):
                flag = 1
                if not b:
                    arr = re.findall(r"[\d+\.]+", line)[0]
                    continue
                data.append([num, b, 0])
                b = []
                num = re.findall(r"\d+", line)[0]
                continue
            if flag == 0:
                continue
            a = re.split(r"\:", line)
            c = re.split(r"\s+", a[1])
            b.append([c[1], c[1], c[2], c[2]])
        if b:
            data.append([num, b, 0])
    return data


def read_jcvi(fn):
    """Read jcvi output and parse into data structure."""
    with open(fn) as f1:
        data, b = [], []
        num = 1
        for line in f1:
            line = line.strip()
            if re.match(r"###", line):
                if b:
                    data.append([num, b, 0])
                    b = []
                num += 1
                continue
            a = re.split(r"\t", line)
            b.append([a[0], a[0], a[1], a[1]])
        if b:
            data.append([num, b, 0])
    return data


def read_collinearity(fn):
    """Read collinearity output and parse into data structure."""
    with open(fn) as f1:
        data, b = [], []
        flag, arr = 0, []
        for line in f1:
            line = line.strip()
            if re.match(r"# Alignment", line):
                flag = 1
                if not b:
                    arr = re.findall(r'[\.\d+]+', line)
                    continue
                data.append([arr[0], b, arr[2]])
                b = []
                arr = re.findall(r'[\.\d+]+', line)
                continue
            if flag == 0:
                continue
            b.append(re.split(r"\s", line))
        if b:
            data.append([arr[0], b, arr[2]])
    return data


def read_ks(file, col):
    """Read KS values from file and select specified column."""
    ks = pd.read_csv(file, sep='\t')
    ks.drop_duplicates(subset=['id1', 'id2'], keep='first', inplace=True)
    ks[col] = ks[col].astype(float)
    ks = ks[ks[col] >= 0]
    ks.index = ks['id1'] + ',' + ks['id2']
    return ks[col]


def get_median(data):
    """Calculate the median of the data list."""
    if not data:
        return 0
    data_sorted = sorted(data)
    half = len(data_sorted) // 2
    return (data_sorted[half] + data_sorted[-(half + 1)]) / 2


def cds_to_pep(cds_file, pep_file, fmt='fasta'):
    """Translate CDS sequences to peptide sequences and write to file."""
    records = list(SeqIO.parse(cds_file, fmt))
    for rec in records:
        rec.seq = rec.seq.translate()
    SeqIO.write(records, pep_file, 'fasta')
    return True


def newblast(file, score, evalue, gene_loc1, gene_loc2, reverse):
    """Filter BLAST results based on score, evalue, and gene locations."""
    blast = pd.read_csv(file, sep="\t", header=None)
    
    if reverse == 'true':
        blast[[0, 1]] = blast[[1, 0]]
    blast = blast[(blast[11] >= score) & (blast[10] < evalue) & (blast[1] != blast[0])]
    blast = blast[(blast[0].isin(gene_loc1.index)) & (blast[1].isin(gene_loc2.index))]
    blast.drop_duplicates(subset=[0, 1], keep='first', inplace=True)
    blast[0] = blast[0].astype(str)
    blast[1] = blast[1].astype(str)
    return blast


def newgff(file):
    """Read GFF file and rename columns with appropriate data types."""
    gff = pd.read_csv(file, sep="\t", header=None, index_col=1)
    gff.rename(columns={0: 'chr', 2: 'start', 3: 'end', 4: 'strand', 5: 'order'}, inplace=True)
    gff['chr'] = gff['chr'].astype(str)
    gff['start'] = gff['start'].astype(np.int64)
    gff['end'] = gff['end'].astype(np.int64)
    gff['strand'] = gff['strand'].astype(str)
    gff['order'] = gff['order'].astype(int)
    return gff


def newlens(file, position):
    """Read lens file and select position based on 'order' or 'end'."""
    lens = pd.read_csv(file, sep="\t", header=None, index_col=0)
    lens.index = lens.index.astype(str)
    if position == 'order':
        lens = lens[2]
    elif position == 'end':
        lens = lens[1]
    return lens


def read_classification(file):
    """Read classification data and convert columns to appropriate types."""
    classification = pd.read_csv(file, sep="\t", header=None)
    classification[0] = classification[0].astype(str)
    classification[1] = classification[1].astype(int)
    classification[2] = classification[2].astype(int)
    classification[3] = classification[3].astype(str)
    classification[4] = classification[4].astype(int)
    return classification


def gene_location(gff, lens, step, position):
    """Calculate gene locations based on lens and step."""
    gff = gff[gff['chr'].isin(lens.index)].copy()
    if gff.empty:
        print('Stoped! \n\nChromosomes in gff file and lens file do not correspond.')
        exit(0)
    dict_chr = dict(zip(lens.index, np.append(np.array([0]), lens.cumsum()[:-1].values)))
    gff['loc'] = ''
    for name, group in gff.groupby('chr'):
        gff.loc[group.index, 'loc'] = (dict_chr[name] + group[position]) * step
    return gff


def dotplot_frame(fig, ax, lens1, lens2, step1, step2, genome1_name, genome2_name, arr, pad = 0):
    """Set up the dotplot frame with grid lines and labels."""
    for k in lens1.cumsum()[:-1] * step1:
        ax.axhline(y=k, alpha=0.8, color='black', lw=0.5)
    for k in lens2.cumsum()[:-1] * step2:
        ax.axvline(x=k, alpha=0.8, color='black', lw=0.5)
    align = dict(family='DejaVu Sans', style='italic', horizontalalignment="center", verticalalignment="center")
    yticks = lens1.cumsum() * step1 - 0.5 * lens1 * step1
    ax.set_yticks(yticks)
    ax.set_yticklabels(lens1.index, fontsize = 13, family='DejaVu Sans', style='normal')
    ax.tick_params(axis='y', which='major', pad = pad)
    ax.tick_params(axis='x', which='major', pad = pad)
    xticks = lens2.cumsum() * step2 - 0.5 * lens2 * step2
    ax.set_xticks(xticks)
    ax.set_xticklabels(lens2.index, fontsize = 13, family='DejaVu Sans', style='normal')
    ax.xaxis.set_ticks_position('none')
    ax.yaxis.set_ticks_position('none')
    if arr[0] <= 0:
        ax.text(-0.06, 0.5, genome1_name, weight='semibold', fontsize=16, rotation=90, **align)
    else:
        ax.text(-0.06, 0.5, genome1_name, weight='semibold', fontsize=16, rotation=90, **align)
    if arr[1] < 0:
        ax.text(0.5, -0.06, genome2_name, weight='semibold', fontsize=16, **align)
    else:
        ax.text(0.5, -0.06, genome2_name, weight='semibold', fontsize=16, **align)

def Bezier3(plist, t):
    """Calculate Bezier curve of degree 3."""
    p0, p1, p2 = plist
    return p0 * (1 - t) ** 2 + 2 * p1 * t * (1 - t) + p2 * t ** 2


def Bezier4(plist, t):
    """Calculate Bezier curve of degree 4."""
    p0, p1, p2, p3, p4 = plist
    return p0 * (1 - t) ** 4 + 4 * p1 * t * (1 - t) ** 3 + 6 * p2 * t ** 2 * (1 - t) ** 2 + 4 * p3 * (1 - t) * t ** 3 + p4 * t ** 4


def Rectangle(ax, loc, height, width, color, alpha):
    """Draw a rectangle on the axes with specified properties."""
    p = mpatches.Rectangle(loc, width, height, edgecolor=None, facecolor=color, alpha=alpha)
    ax.add_patch(p)

def str_to_bool(s):
    if isinstance(s, bool):
        return s 
    return str(s).strip().lower() == 'true'

================================================
FILE: build/lib/wgdi/block_correspondence.py
================================================
import re
import numpy as np
import pandas as pd
import wgdi.base as base

class block_correspondence():
    def __init__(self, options):
        # Default values
        self.tandem = True
        self.pvalue = 0.2
        self.position = 'order'
        self.block_length = 5
        self.tandem_length = 200
        self.tandem_ratio = 1
        self.ks_hit = 0.5

        # Set user-defined options
        for k, v in options:
            setattr(self, str(k), v)
            print(k, ' = ', v)

        # Parse ks_area and homo if present
        self.ks_area = [float(k) for k in getattr(self, 'ks_area', '-1,3').split(',')]
        self.homo = [float(k) for k in self.homo.split(',')]
        self.tandem_ratio = float(self.tandem_ratio)
        self.tandem = base.str_to_bool(self.tandem)

    def run(self):
        lens1 = base.newlens(self.lens1, self.position)
        lens2 = base.newlens(self.lens2, self.position)
        
        # Load block information from CSV
        bkinfo = pd.read_csv(self.blockinfo)
        bkinfo = self.preprocess_blockinfo(bkinfo, lens1, lens2)
        
        # Initialize correspondence DataFrame
        cor = self.initialize_correspondence(lens1, lens2)
        
        # If no tandem allowed, remove tandem regions
        if not self.tandem:
            bkinfo = self.remove_tandem(bkinfo)
        
        # Remove low KS hits
        bkinfo = self.remove_ks_hit(bkinfo)

        # Find collinearity regions and save results
        collinear_indices = self.collinearity_region(cor, bkinfo, lens1)
        bkinfo.loc[bkinfo.index.isin(collinear_indices), :].to_csv(self.savefile, index=False)

    def preprocess_blockinfo(self, bkinfo, lens1, lens2):
        bkinfo['chr1'] = bkinfo['chr1'].astype(str)
        bkinfo['chr2'] = bkinfo['chr2'].astype(str)
        
        # Filter by length, chromosome indices, and p-value
        bkinfo = bkinfo[(bkinfo['length'] >= int(self.block_length)) & 
                        (bkinfo['chr1'].isin(lens1.index)) & 
                        (bkinfo['chr2'].isin(lens2.index)) & 
                        (bkinfo['pvalue'] <= float(self.pvalue))]
        
        # Filter by tandem ratio if the column exists
        if 'tandem_ratio' in bkinfo.columns:
            bkinfo = bkinfo[bkinfo['tandem_ratio'] <= self.tandem_ratio]
        
        return bkinfo

    def initialize_correspondence(self, lens1, lens2):
        # Create correspondence DataFrame with initial values
        cor = [[k, i, 0, lens1[i], j, 0, lens2[j], float(self.homo[0]), float(self.homo[1])] 
               for k in range(1, int(self.multiple) + 1) 
               for i in lens1.index 
               for j in lens2.index]
        
        cor = pd.DataFrame(cor, columns=['sub', 'chr1', 'start1', 'end1', 'chr2', 'start2', 'end2', 'homo1', 'homo2'])
        cor['chr1'] = cor['chr1'].astype(str)
        cor['chr2'] = cor['chr2'].astype(str)
        
        return cor

    def remove_tandem(self, bkinfo):
        # Remove tandem regions from the DataFrame
        group = bkinfo[bkinfo['chr1'] == bkinfo['chr2']].copy()
        group['start'] = group['start1'] - group['start2']
        group['end'] = group['end1'] - group['end2']
        tandem_condition = (group['start'].abs() <= int(self.tandem_length)) | (group['end'].abs() <= int(self.tandem_length))
        index_to_remove = group[tandem_condition].index
        return bkinfo.drop(index_to_remove)

    def remove_ks_hit(self, bkinfo):
        # Remove records with insufficient KS hits
        for index, row in bkinfo.iterrows():
            ks = self.get_ks_value(row['ks'])
            ks_ratio = len([k for k in ks if self.ks_area[0] <= k <= self.ks_area[1]]) / len(ks)
            if ks_ratio < self.ks_hit:
                bkinfo.drop(index, inplace=True)
        return bkinfo

    def get_ks_value(self, ks_str):
        # Extract and return KS values as floats
        ks = ks_str.split('_')
        ks = list(map(float, ks[1:])) if ks[0] == '' else list(map(float, ks))
        return ks

    def collinearity_region(self, cor, bkinfo, lens):
        collinear_indices = []
        for (chr1, chr2), group in bkinfo.groupby(['chr1', 'chr2']):
            group = group.sort_values(by=['length'], ascending=False)
            df = pd.Series(0, index=range(1, int(lens[str(chr1)]) + 1))
            for index, row in group.iterrows():
                # Check homology conditions
                if not self.is_valid_homo(row):
                    continue
                # Update the block series and compute ratio
                b1 = [int(k) for k in row['block1'].split('_')]
                df1 = df.copy()
                df1[b1] += 1
                ratio = (len(df1[df1 > 0]) - len(df[df > 0])) / len(b1)
                if ratio < 0.5:
                    continue
                df[b1] += 1
                collinear_indices.append(index)
        
        return collinear_indices

    def is_valid_homo(self, row):
        # Check if the homology values are within the specified range
        return self.homo[0] <= row['homo' + self.multiple] <= self.homo[1]


================================================
FILE: build/lib/wgdi/block_info.py
================================================
import numpy as np
import pandas as pd
import wgdi.base as base


class block_info:
    def __init__(self, options):
        self.repeat_number = 20
        self.ks_col = 'ks_NG86'
        self.blast_reverse = False
        for k, v in options:
            setattr(self, str(k), v)
            print(f"{k} = {v}")
        
        self.repeat_number = int(self.repeat_number)
        self.blast_reverse = base.str_to_bool(self.blast_reverse)

    def block_position(self, collinearity, blast, gff1, gff2, ks):
        data = []
        for block in collinearity:
            blk_homo, blk_ks = [], []

            # Skip blocks with missing gene coordinates in GFF files
            if block[1][0][0] not in gff1.index or block[1][0][2] not in gff2.index:
                continue
            
            # Extract chromosome info
            chr1, chr2 = gff1.at[block[1][0][0], 'chr'], gff2.at[block[1][0][2], 'chr']
            
            # Extract start and end positions
            array1, array2 = [float(i[1]) for i in block[1]], [float(i[3]) for i in block[1]]
            start1, end1 = array1[0], array1[-1]
            start2, end2 = array2[0], array2[-1]
            
            block1, block2 = [], []
            for k in block[1]:
                block1.append(int(float(k[1])))
                block2.append(int(float(k[3])))
                
                # Check for KS values
                pair_ks = self.get_ks_value(ks, k)
                blk_ks.append(pair_ks)

                # Retrieve blast homo data
                if k[0]+","+k[2] in blast.index:
                    blk_homo.append(blast.loc[k[0]+","+k[2], [f'homo{i}' for i in range(1, 6)]].values.tolist())
            
            ks_median, ks_average = self.calculate_ks_statistics(blk_ks)
            homo = self.calculate_homo_statistics(blk_homo)

            blkks = '_'.join([str(k) for k in blk_ks])
            block1 = '_'.join([str(k) for k in block1])
            block2 = '_'.join([str(k) for k in block2])
            
            # Calculate tandem ratio
            tandem_ratio = self.tandem_ratio(blast, gff2, block[1])
            
            # Store the results
            data.append([
                block[0], chr1, chr2, start1, end1, start2, end2, block[2], len(block[1]), 
                ks_median, ks_average, *homo, block1, block2, blkks, tandem_ratio
            ])
        
        # Create a DataFrame with the results
        data_df = pd.DataFrame(data, columns=[
            'id', 'chr1', 'chr2', 'start1', 'end1', 'start2', 'end2', 'pvalue', 'length', 
            'ks_median', 'ks_average', 'homo1', 'homo2', 'homo3', 'homo4', 'homo5', 
            'block1', 'block2', 'ks', 'tandem_ratio'
        ])

        # Calculate density
        data_df['density1'] = data_df['length'] / ((data_df['end1'] - data_df['start1']).abs() + 1)
        data_df['density2'] = data_df['length'] / ((data_df['end2'] - data_df['start2']).abs() + 1)

        return data_df

    def get_ks_value(self, ks, k):
        """Return KS value for the given pair of genes."""
        pair = f"{k[0]},{k[2]}"
        if pair in ks.index:
            return ks[pair]
        pair_rev = f"{k[2]},{k[0]}"
        if pair_rev in ks.index:
            return ks[pair_rev]
        return -1

    def calculate_ks_statistics(self, blk_ks):
        """Calculate KS statistics: median and average."""
        ks_arr = [k for k in blk_ks if k >= 0]
        if len(ks_arr) == 0:
            return -1, -1
        ks_median = base.get_median(ks_arr)
        ks_average = sum(ks_arr) / len(ks_arr)
        return ks_median, ks_average

    def calculate_homo_statistics(self, blk_homo):
        """Calculate homo statistics by averaging across all blocks."""
        df = pd.DataFrame(blk_homo)
        homo = df.mean().values if len(df) > 0 else [-1, -1, -1, -1, -1]
        return homo

    def blast_homo(self, blast, gff1, gff2, repeat_number):
        """Assign homo values based on blast data."""
        index = [group.sort_values(by=11, ascending=False)[:repeat_number].index.tolist() for name, group in blast.groupby([0])]
        blast = blast.loc[np.concatenate([k[:repeat_number] for k in index], dtype=object), [0, 1]]
        blast = blast.assign(homo1=np.nan, homo2=np.nan, homo3=np.nan, homo4=np.nan, homo5=np.nan)

        # Assign homo values
        for i in range(1, 6):
            bluenum = i + 5
            redindex = np.concatenate([k[:i] for k in index], dtype=object)
            blueindex = np.concatenate([k[i:bluenum] for k in index], dtype=object)
            grayindex = np.concatenate([k[bluenum:repeat_number] for k in index], dtype=object)
            blast.loc[redindex, f'homo{i}'] = 1
            blast.loc[blueindex, f'homo{i}'] = 0
            blast.loc[grayindex, f'homo{i}'] = -1
        
        blast['chr1_order'] = blast[0].map(gff1['order'])
        blast['chr2_order'] = blast[1].map(gff2['order'])
        return blast

    def tandem_ratio(self, blast, gff2, block):
        """Calculate tandem ratio for a block."""
        block = pd.DataFrame(block)[[0, 2]].rename(columns={0: 'id1', 2: 'id2'})
        block['order2'] = block['id2'].map(gff2['order'])

        # Filter block_blast data
        block_blast = blast[(blast[0].isin(block['id1'].values)) & (blast[1].isin(block['id2'].values))].copy()
        block_blast = pd.merge(block_blast, block, left_on=0, right_on='id1', how='left')
        block_blast['difference'] = (block_blast['chr2_order'] - block_blast['order2']).abs()

        # Filter based on difference and calculate ratio
        block_blast = block_blast[(block_blast['difference'] <= self.repeat_number) & (block_blast['difference'] > 0)]
        return len(block_blast[0].unique()) / len(block) * len(block_blast) / (len(block) + len(block_blast))

    def run(self):
        """Main function to run the analysis."""
        # Initialize required datasets
        lens1 = base.newlens(self.lens1, self.position)
        lens2 = base.newlens(self.lens2, self.position)
        gff1 = base.newgff(self.gff1)
        gff2 = base.newgff(self.gff2)

        # Filter GFF files based on chromosome indices
        gff1 = gff1[gff1['chr'].isin(lens1.index)]
        gff2 = gff2[gff2['chr'].isin(lens2.index)]

        # Load blast data
        blast = base.newblast(self.blast, int(self.score), float(self.evalue), gff1, gff2, self.blast_reverse)
        blast = self.blast_homo(blast, gff1, gff2, self.repeat_number)
        blast.index = blast[0] + ',' + blast[1]

        # Get collinearity data
        collinearity = self.auto_file(gff1, gff2)

        # Load ks data if necessary
        ks = pd.Series([]) if self.ks == 'none' or self.ks == '' or not hasattr(self, 'ks') else base.read_ks(self.ks, self.ks_col)

        # Get the block position data
        data = self.block_position(collinearity, blast, gff1, gff2, ks)
        data['class1'] = 0
        data['class2'] = 0

        # Save results
        data.to_csv(self.savefile, index=None)

    def auto_file(self, gff1, gff2):
        """Auto-detect and read collinearity file."""
        with open(self.collinearity) as f:
            p = ' '.join(f.readlines()[0:30])
        
        # Handle different file formats
        if 'path length' in p or 'MAXIMUM GAP' in p:
            return base.read_colinearscan(self.collinearity)
        elif 'MATCH_SIZE' in p or '## Alignment' in p:
            return self.process_mcscanx(gff1, gff2)
        elif '# Alignment' in p:
            return base.read_collinearity(self.collinearity)
        elif '###' in p:
            return self.process_jcvi(gff1, gff2)

    def process_mcscanx(self, gff1, gff2):
        """Process MCScanX format collinearity data."""
        col = base.read_mcscanx(self.collinearity)
        collinearity = []
        for block in col:
            newblock = [k for k in block[1] if k[0] in gff1.index and k[2] in gff2.index]
            if newblock:
                for k in newblock:
                    k[1], k[3] = gff1.at[k[0], 'order'], gff2.at[k[2], 'order']
                collinearity.append([block[0], newblock, block[2]])
        return collinearity

    def process_jcvi(self, gff1, gff2):
        """Process JCVI format collinearity data."""
        col = base.read_jcvi(self.collinearity)
        collinearity = []
        for block in col:
            newblock = [k for k in block[1] if k[0] in gff1.index and k[2] in gff2.index]
            if newblock:
                for k in newblock:
                    k[1], k[3] = gff1.at[k[0], 'order'], gff2.at[k[2], 'order']
                collinearity.append([block[0], newblock, block[2]])
        return collinearity


================================================
FILE: build/lib/wgdi/block_ks.py
================================================
import re
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import wgdi.base as base


class block_ks:
    def __init__(self, options):
        # Default parameters
        self.markersize = 0.8
        self.figsize = 'default'
        self.tandem_length = 200
        self.blockinfo_reverse = False
        self.tandem = False
        self.area = [0, 3]
        self.position = 'order'
        self.ks_col = 'ks_NG86'
        self.pvalue = 0.01
        
        # Overriding default parameters with options
        for k, v in options:
            setattr(self, str(k), v)
            print(f"{k} = {v}")
        
        # Parsing area as a float list
        self.area = [float(k) for k in str(self.area).split(',')]
        self.markersize =  float(self.markersize)
        self.tandem_length =  int(self.tandem_length)
        
        self.blockinfo_reverse =  base.str_to_bool(self.blockinfo_reverse)
        self.remove_tandem =  base.str_to_bool(self.remove_tandem)

    def block_position(self, bkinfo, lens1, lens2, step1, step2):
        pos, pairs = [], []
        
        # Create mappings for chromosome positions
        dict_y_chr = dict(zip(lens1.index, np.append([0], lens1.cumsum()[:-1].values)))
        dict_x_chr = dict(zip(lens2.index, np.append([0], lens2.cumsum()[:-1].values)))
        
        # Iterate through block information
        for _, row in bkinfo.iterrows():
            block1 = row['block1'].split('_')
            block2 = row['block2'].split('_')
            ks = row['ks'].split('_')
            
            locy_median = (dict_y_chr[row['chr1']] + 0.5 * (row['end1'] + row['start1'])) * step1
            locx_median = (dict_x_chr[row['chr2']] + 0.5 * (row['end2'] + row['start2'])) * step2
            pos.append([locx_median, locy_median, row['ks_median']])
            
            # Ensure ks length matches block length
            if len(block1) != len(ks):
                ks = ks[1:]
                
            for i in range(len(block1)):
                locy = (dict_y_chr[row['chr1']] + float(block1[i])) * step1
                locx = (dict_x_chr[row['chr2']] + float(block2[i])) * step2
                pairs.append([locx, locy, float(ks[i])])
        
        return pos, pairs

    def remove_tandem(self, bkinfo):
        # Filter for same-chromosome blocks
        group = bkinfo[bkinfo['chr1'] == bkinfo['chr2']].copy()
        
        # Calculate block start and end differences
        group['start'] = group['start1'] - group['start2']
        group['end'] = group['end1'] - group['end2']
        
        # Remove tandems based on threshold
        index = group[(group['start'].abs() <= self.tandem_length) |
                      (group['end'].abs() <= self.tandem_length)].index
        return bkinfo.drop(index)

    def run(self):
        # Initialize axis and chromosome lens
        axis = [0, 1, 1, 0]
        lens1 = base.newlens(self.lens1, self.position)
        lens2 = base.newlens(self.lens2, self.position)
        
        # Parse figsize
        if re.search(r'\d', self.figsize):
            self.figsize = [float(k) for k in self.figsize.split(',')]
        else:
            self.figsize = np.array([1, float(lens1.sum()) / float(lens2.sum())]) * 10
        
        # Calculate step sizes
        step1 = 1 / float(lens1.sum())
        step2 = 1 / float(lens2.sum())
        
        # Create figure and axes
        fig, ax = plt.subplots(figsize=self.figsize)
        plt.rcParams['ytick.major.pad'] = 0
        ax.xaxis.set_ticks_position('top')
        
        # Plot dotplot frame
        base.dotplot_frame(fig, ax, lens1, lens2, step1, step2,
                           self.genome1_name, self.genome2_name, [0, 1])
        
        # Load block information
        bkinfo = pd.read_csv(self.blockinfo)
        
        # Handle reverse block information
        if self.blockinfo_reverse == True:
            bkinfo[['chr1', 'chr2']] = bkinfo[['chr2', 'chr1']]
            bkinfo[['block1', 'block2']] = bkinfo[['block2', 'block1']]
        
        # Filter block information
        bkinfo['chr1'] = bkinfo['chr1'].astype(str)
        bkinfo['chr2'] = bkinfo['chr2'].astype(str)
        bkinfo = bkinfo[(bkinfo['length'] >= int(self.block_length)) & 
                        (bkinfo['chr1'].isin(lens1.index)) & 
                        (bkinfo['chr2'].isin(lens2.index)) & 
                        (bkinfo['pvalue'] < float(self.pvalue))]
        
        # Remove tandem duplicates if required
        if self.tandem == False:
            bkinfo = self.remove_tandem(bkinfo)
        
        # Calculate positions and pairs
        pos, pairs = self.block_position(bkinfo, lens1, lens2, step1, step2)
        
        # Filter pairs by ks value
        df = pd.DataFrame(pairs, columns=['loc1', 'loc2', 'ks'])
        df = df[(df['ks'] >= self.area[0]) & (df['ks'] <= self.area[1])]
        df.drop_duplicates(inplace=True)
        
        # Plot scatter
        cm = plt.cm.get_cmap('gist_rainbow')
        sc = plt.scatter(df['loc1'], df['loc2'], s=self.markersize, c=df['ks'],
                         alpha=0.9, edgecolors=None, linewidths=0, marker='o', 
                         vmin=self.area[0], vmax=self.area[1], cmap=cm)
        
        # Add colorbar
        cbar = fig.colorbar(sc, shrink=0.5, pad=0.03, fraction=0.1)
        align = dict(family='DejaVu Sans', style='normal',
                     horizontalalignment="center", verticalalignment="center")
        cbar.set_label('Ks', labelpad=12.5, fontsize=16, **align)
        
        # Set axis and save figure
        ax.axis(axis)
        plt.subplots_adjust(left=0.09, right=0.96, top=0.93, bottom=0.03)
        plt.savefig(self.savefig, dpi=500)
        plt.show()


================================================
FILE: build/lib/wgdi/circos.py
================================================
import re
import sys

import matplotlib as mpl
import matplotlib.patches as mpatches
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import wgdi.base as base


class circos():
    def __init__(self, options):
        self.figsize = '10,10'
        self.position = 'order'
        self.label_size = 9
        self.label_radius = 0.015
        self.column_names = [None]*100
        for k, v in options:
            setattr(self, str(k), v)
            print(k, ' = ', v)
        self.figsize = [float(k) for k in self.figsize.split(',')]
        self.ring_width = float(self.ring_width)
        if hasattr(self, 'legend_square'):
            self.legend_square = [float(k)
                                  for k in self.legend_square.split(',')]
        else:
            self.legend_square = 0.04, 0.04

    def plot_circle(self, loc_chr, radius, color='black', lw=1, alpha=1, linestyle='-'):
        for k in loc_chr:
            start, end = loc_chr[k]
            t = np.arange(start, end, 0.005)
            x, y = (radius) * np.cos(t), (radius) * np.sin(t)
            plt.plot(x, y, linestyle=linestyle,
                     color=color, lw=lw, alpha=alpha)

    def plot_labels(self, root, labels, loc_chr, radius, horizontalalignment="center", verticalalignment="center", fontsize=6,
                    color='black'):
        for k in loc_chr:
            loc = sum(loc_chr[k]) * 0.5
            x, y = radius * np.cos(loc), radius * np.sin(loc)
            self.Wedge(root, (x, y), self.label_radius, 0,
                       360, self.label_radius, 'white', 1)
            if 1 * np.pi < loc < 2 * np.pi:
                loc += np.pi
            plt.text(x, y, labels[k], horizontalalignment=horizontalalignment, verticalalignment=verticalalignment,
                     fontsize=fontsize, color=color, rotation=0)

    def Wedge(self, ax, loc, radius, start, end, width, color, alpha):
        p = mpatches.Wedge(loc, radius, start, end, width=width,
                           edgecolor=None, facecolor=color, alpha=alpha)
        ax.add_patch(p)

    def plot_bar(self, df, radius, length, lw, color, alpha):
        for k in df[df.columns[0]].drop_duplicates().values:
            if str(k) not in color.keys():
                color[str(k)] = 'black'
            if k in ['', np.nan]:
                continue
            df_chr = df.groupby(df.columns[0]).get_group(k)
            x1, y1 = radius * \
                np.cos(df_chr['rad']), radius * np.sin(df_chr['rad'])
            x2, y2 = (radius + length) * \
                np.cos(df_chr['rad']), (radius + length) * \
                np.sin(df_chr['rad'])
            x = np.array(
                [x1.values, x2.values, [np.nan] * x1.size]).flatten('F')
            y = np.array(
                [y1.values, y2.values, [np.nan] * x1.size]).flatten('F')
            plt.plot(x, y, linestyle='-',
                     color=color[str(k)], lw=lw, alpha=alpha)

    def chr_location(self, lens, angle_gap, angle):
        start, end, loc_chr = 0, 0.2*angle_gap, {}
        for k in lens.index:
            end += angle_gap + angle * (float(lens[k]))
            start = end - angle * (float(lens[k]))
            loc_chr[k] = [float(start), float(end)]
        return loc_chr

    def deal_alignment(self, alignment, gff, lens, loc_chr, angle):
        alignment.replace('\s+', '', inplace=True)
        alignment.replace('.', '', inplace=True)
        print(alignment.dropna(subset=[2, 3],how='all'))
        # exit(0)
        newalignment = alignment.copy()
        for i in range(len(alignment.columns)):
            alignment[i] = alignment[i].astype(str)
            newalignment[i] = alignment[i].map(gff['chr'].to_dict())
        newalignment['loc'] = alignment[0].map(gff[self.position].to_dict())
        newalignment[0] = newalignment[0].astype('str')
        newalignment['loc'] = newalignment['loc'].astype('float')
        newalignment = newalignment[newalignment[0].isin(lens.index) == True]
        newalignment['rad'] = np.nan
        for name, group in newalignment.groupby(0):
            if str(name) not in loc_chr:
                continue
            newalignment.loc[group.index, 'rad'] = loc_chr[str(
                name)][0]+angle * group['loc']
        print(newalignment.dropna(subset=[2, 3,4],how='all'))
        return newalignment

    def deal_ancestor(self, alignment, gff, lens, loc_chr, angle, al):
        alignment.replace('\s+', '', inplace=True)
        alignment.replace('.', np.nan, inplace=True)
        newalignment = pd.merge(alignment, gff, left_on=0, right_on=gff.index)
        newalignment['rad'] = np.nan
        for name, group in newalignment.groupby('chr'):
            if str(name) not in loc_chr:
                continue
            newalignment.loc[group.index, 'rad'] = loc_chr[str(
                name)][0]+angle * group[self.position]
        newalignment.index = newalignment[0]
        newalignment[0] = newalignment[0].map(newalignment['rad'].to_dict())
        data = []
        for index_al, row_al in al.iterrows():
            for k in alignment.columns[1:]:
                alignment[k] = alignment[k].astype(str)
                group = newalignment[(newalignment['chr'] == row_al['chr']) & (
                    newalignment['order'] >= row_al['start']) & (newalignment['order'] <= row_al['end'])].copy()
                group.loc[:, k] = group.loc[:, k].map(
                    newalignment['rad']).values
                group.dropna(subset=[k], inplace=True)
                group.index = group.index.map(newalignment['rad'].to_dict())
                group['color'] = row_al['color']
                group = group[group[k].notnull()]
                data += group[[0, k, 'color']].values.tolist()
        df = pd.DataFrame(data, columns=['loc1', 'loc2', 'color'])
        return df

    def plot_collinearity(self, data, radius, lw=0.02, alpha=1):
        for name, group in data.groupby('color'):
            x, y = np.array([]), np.array([])
            for index, row in group.iterrows():
                ex1x, ex1y = radius * \
                    np.cos(row['loc1']), radius*np.sin(row['loc1'])
                ex2x, ex2y = radius * \
                    np.cos(row['loc2']), radius*np.sin(row['loc2'])
                ex3x, ex3y = radius * (1-abs(row['loc1']-row['loc2'])/np.pi) * np.cos((row['loc1']+row['loc2'])*0.5), radius * (
                    1-abs(row['loc1']-row['loc2'])/np.pi) * np.sin((row['loc1']+row['loc2'])*0.5)
                x1 = [ex1x, 0.5*ex3x, ex2x]
                y1 = [ex1y, 0.5*ex3y, ex2y]
                step = .002
                t = np.arange(0, 1+step, step)
                xt = base.Bezier3(x1, t)
                yt = base.Bezier3(y1, t)
                x = np.hstack((x, xt, np.nan))
                y = np.hstack((y, yt, np.nan))
            plt.plot(x, y, color=name, lw=lw, alpha=alpha)

    def plot_legend(self, ax, chr_color, width, height):
        (x1, x2) = ax.get_xlim()
        (y1, y2) = ax.get_ylim()
        a = 1000
        for k, v in enumerate(chr_color.keys(), 0):
            h = y1-k//a*height*2
            k = k % a
            if x1 + width * k > x2-width:
                a = k
                h = y1-k//a*height*2
                k = k % a
            loc = [x1 + width * k, h]
            base.Rectangle(ax, loc, height, width, chr_color[v], 1)
            plt.text(loc[0] + width*0.382, h-0.618*height, v, fontsize=12)
        ax.set_ylim(h-2*height, y2)

    def run(self):
        fig, ax = plt.subplots(figsize=self.figsize)
        mpl.rcParams['agg.path.chunksize'] = 100000000
        lens = base.newlens(self.lens, self.position)
        radius, angle_gap = float(self.radius), float(self.angle_gap)
        angle = (2 * np.pi - (int(len(lens))+1.5)
                 * angle_gap) / (int(lens.sum()))
        loc_chr = self.chr_location(lens, angle_gap, angle)
        list_colors = [str(k).strip() for k in re.split(',|:', self.colors)]
        chr_color = dict(zip(list_colors[::2], list_colors[1::2]))
        gff = base.newgff(self.gff)
        if hasattr(self, 'ancestor'):
            ancestor = pd.read_csv(self.ancestor, header=None)
            al = pd.read_csv(self.ancestor_location, sep='\t', header=None)
            al.rename(columns={0: 'chr', 1: 'start',
                               2: 'end', 3: 'color'}, inplace=True)
            al['chr'] = al['chr'].astype(str)
            data = self.deal_ancestor(ancestor, gff, lens, loc_chr, angle, al)
            self.plot_collinearity(data, radius, lw=0.1, alpha=0.8)

        if hasattr(self, 'alignment'):
            alignment = pd.read_csv(self.alignment, header=None)
            print(alignment)
            newalignment = self.deal_alignment(
                alignment, gff, lens, loc_chr, angle)
            if ',' in self.column_names:
                names = [str(k) for k in self.column_names.split(',')]
            else:
                names = [None]*len(newalignment.columns)
            n = 0
            align = dict(family='Arial', verticalalignment="center",
                         horizontalalignment="center")
            print(newalignment)
            for k, v in enumerate(newalignment.columns[1:-2]):
                r = radius + self.ring_width*(k+1)
                print(k,v,r)
                self.plot_circle(loc_chr, r, lw=0.5, alpha=1, color='grey')
                self.plot_bar(newalignment[[v, 'rad']], r + self.ring_width *
                              0.15, self.ring_width*0.7, 0.15, chr_color, 1)
                if n % 2 == 0:
                    loc = 0.05
                    x, y = (r+self.ring_width*0.5) * \
                        np.cos(loc), (r+self.ring_width*0.5) * np.sin(loc)
                    plt.text(x, y, names[n], rotation=loc *
                             180 / np.pi, fontsize=self.label_size, **align)
                else:
                    loc = -0.08
                    x, y = (r+self.ring_width*0.5) * \
                        np.cos(loc), (r+self.ring_width*0.5) * np.sin(loc)
                    plt.text(x, y, names[n], fontsize=self.label_size,
                             rotation=loc * 180 / np.pi, **align)
                n += 1
        if hasattr(self, 'ancestor'):
            colors = al['color'].drop_duplicates().values.tolist()
            ancestor_chr_color = dict(zip(range(1, len(colors)+1), colors))
            self.plot_legend(ax, ancestor_chr_color,
                             self.legend_square[0], self.legend_square[1])
        if hasattr(self, 'alignment'):
            del chr_color['nan']
            self.plot_legend(
                ax, chr_color, self.legend_square[0], self.legend_square[1])
        labels = self.chr_label + lens.index
        labels = dict(zip(lens.index, labels))
        self.plot_labels(ax, labels, loc_chr, radius +
                         self.ring_width*0.3, fontsize=self.label_size)

        plt.axis('off')
        a = (ax.get_ylim()[1]-ax.get_ylim()[0]) / \
            (ax.get_xlim()[1]-ax.get_xlim()[0])
        fig.set_size_inches(self.figsize[0], self.figsize[0]*a, forward=True)
        plt.savefig(self.savefig, dpi=500)
        plt.show()
        sys.exit(0)


================================================
FILE: build/lib/wgdi/collinearity.py
================================================
import numpy as np
import pandas as pd


class collinearity:
    def __init__(self, options, points):
        # Default values
        self.gap_penalty = -1
        self.over_length = 0
        self.mg1 = 40
        self.mg2 = 40
        self.pvalue = 1
        self.over_gap = 3
        self.points = points
        self.p_value = 0
        self.coverage_ratio = 0.8
        
        # Set user-defined options
        for k, v in options:
            setattr(self, str(k), v)

        # Initialize grading and mg values
        self.grading = [50, 40, 25] if not hasattr(self, 'grading') else [int(k) for k in self.grading.split(',')]
        self.mg1, self.mg2 = [40, 40] if not hasattr(self, 'mg') else [int(k) for k in self.mg.split(',')]

        # Convert string values to floats
        self.pvalue = float(self.pvalue)
        self.coverage_ratio = float(self.coverage_ratio)

    def get_matrix(self):
        """Initialize the matrix for the collinearity points."""
        self.points['usedtimes1'] = 0
        self.points['usedtimes2'] = 0
        self.points['times'] = 1
        self.points['score1'] = self.points['grading']
        self.points['score2'] = self.points['grading']
        self.points['path1'] = self.points.index.to_numpy().reshape(len(self.points), 1).tolist()
        self.points['path2'] = self.points['path1']
        self.points_init = self.points.copy()
        self.mat_points = self.points

    def run(self):
        """Run the main collinearity processing."""
        self.get_matrix()
        self.score_matrix()
        data = []

        # Process points for maxPath in the positive direction
        points1 = self.points[['loc1', 'loc2', 'score1', 'path1', 'usedtimes1']].sort_values(by=['score1'], ascending=False)
        points1.drop(index=points1[points1['usedtimes1'] < 1].index, inplace=True)
        points1.columns = ['loc1', 'loc2', 'score', 'path', 'usedtimes']
        
        while (self.over_length >= self.over_gap or len(points1) >= self.over_gap):
            if self.max_path(points1):
                if self.p_value > self.pvalue:
                    continue
                data.append([self.path, self.p_value, self.score])

        # Process points for maxPath in the negative direction
        points2 = self.points[['loc1', 'loc2', 'score2', 'path2', 'usedtimes2']].sort_values(by=['score2'], ascending=False)
        points2.drop(index=points2[points2['usedtimes2'] < 1].index, inplace=True)
        points2.columns = ['loc1', 'loc2', 'score', 'path', 'usedtimes']

        while (self.over_length >= self.over_gap) or (len(points2) >= self.over_gap):
            if self.max_path(points2):
                if self.p_value > self.pvalue:
                    continue
                data.append([self.path, self.p_value, self.score])

        return data

    def score_matrix(self):
        """Calculate the scoring matrix for the points."""
        for index, row, col in self.points[['loc1', 'loc2']].itertuples():
            # Get points within a certain range
            points = self.points[(self.points['loc1'] > row) & 
                                 (self.points['loc2'] > col) & 
                                 (self.points['loc1'] < row + self.mg1) & 
                                 (self.points['loc2'] < col + self.mg2)]
            
            row_i_old, gap = row, self.mg2
            for index_ij, row_i, col_j, grading in points[['loc1', 'loc2', 'grading']].itertuples():
                if col_j - col > gap and row_i > row_i_old:
                    break
                score = grading + (row_i - row + col_j - col) * self.gap_penalty
                score1 = score + self.points.at[index, 'score1']
                if score > 0 and self.points.at[index_ij, 'score1'] < score1:
                    self.points.at[index_ij, 'score1'] = score1
                    self.points.at[index, 'usedtimes1'] += 1
                    self.points.at[index_ij, 'usedtimes1'] += 1
                    self.points.at[index_ij, 'path1'] = self.points.at[index, 'path1'] + [index_ij]
                    gap = min(col_j - col, gap)
                    row_i_old = row_i

        # Reverse processing to handle negative direction
        points_reverse = self.points.sort_values(by=['loc1', 'loc2'], ascending=[False, True])
        for index, row, col in points_reverse[['loc1', 'loc2']].itertuples():
            points = points_reverse[(points_reverse['loc1'] < row) & 
                                    (points_reverse['loc2'] > col) & 
                                    (points_reverse['loc1'] > row - self.mg1) & 
                                    (points_reverse['loc2'] < col + self.mg2)]
            
            row_i_old, gap = row, self.mg2
            for index_ij, row_i, col_j, grading in points[['loc1', 'loc2', 'grading']].itertuples():
                if col_j - col > gap and row_i < row_i_old:
                    break
                score = grading + (row - row_i + col_j - col) * self.gap_penalty
                score2 = score + self.points.at[index, 'score2']
                if score > 0 and self.points.at[index_ij, 'score2'] < score2:
                    self.points.at[index_ij, 'score2'] = score2
                    self.points.at[index, 'usedtimes2'] += 1
                    self.points.at[index_ij, 'usedtimes2'] += 1
                    self.points.at[index_ij, 'path2'] = self.points.at[index, 'path2'] + [index_ij]
                    gap = min(col_j - col, gap)
                    row_i_old = row_i

    def max_path(self, points):
        """Find the maximum path for the given points."""
        if len(points) == 0:
            self.over_length = 0
            return False
        
        # Initialize path score and index
        self.score, self.path_index = points.loc[points.index[0], ['score', 'path']]
        self.path = points[points.index.isin(self.path_index)]
        self.over_length = len(self.path_index)
        
        # Check if the block overlaps with other blocks
        if self.over_length >= self.over_gap and len(self.path) / self.over_length > self.coverage_ratio:
            points.drop(index=self.path.index, inplace=True)
            [loc1_min, loc2_min], [loc1_max, loc2_max] = self.path[['loc1', 'loc2']].agg(['min', 'max']).to_numpy()

            # Calculate p-value
            gap_init = self.points_init[(loc1_min <= self.points_init['loc1']) & 
                                        (self.points_init['loc1'] <= loc1_max) & 
                                        (loc2_min <= self.points_init['loc2']) & 
                                        (self.points_init['loc2'] <= loc2_max)].copy()
            
            self.p_value = self.p_value_estimated(gap_init, loc1_max - loc1_min + 1, loc2_max - loc2_min + 1)
            self.path = self.path.sort_values(by=['loc1'], ascending=[True])[['loc1', 'loc2']]
            return True
        else:
            points.drop(index=points.index[0], inplace=True)
        return False

    def p_value_estimated(self, gap, L1, L2):
        """Estimate p-value based on the given gap and lengths."""
        N1 = gap['times'].sum()
        N = len(gap)
        self.points_init.loc[gap.index, 'times'] += 1
        m = len(self.path)
        a = (1 - self.score / m / self.grading[0]) * (N1 - m + 1) / N * (L1 - m + 1) * (L2 - m + 1) / L1 / L2
        return round(a, 4)


================================================
FILE: build/lib/wgdi/dotplot.py
================================================
import re

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import wgdi.base as base


class dotplot():
    def __init__(self, options):
        self.multiple = 1
        self.score = 100
        self.evalue = 1e-5
        self.repeat_number = 20
        self.markersize = 0.5
        self.figsize = 'default'
        self.position = 'order'
        self.ancestor_top = None
        self.ancestor_left = None
        self.blast_reverse = False
        for k, v in options:
            setattr(self, str(k), v)
            print(k, ' = ', v)
        if self.ancestor_top == 'none' or self.ancestor_top == '':
            self.ancestor_top = None
        if self.ancestor_left == 'none' or self.ancestor_left == '':
            self.ancestor_left = None
        base.str_to_bool(self.blast_reverse)

    def pair_positon(self, blast, gff1, gff2, rednum, repeat_number):
        blast['color'] = ''
        blast['loc1'] = blast[0].map(gff1['loc'])
        blast['loc2'] = blast[1].map(gff2['loc'])
        bluenum = 5+rednum
        index = [group.sort_values(by=[11], ascending=[False])[:repeat_number].index.tolist()
                 for name, group in blast.groupby([0])]
        reddata = np.array([k[:rednum] for k in index], dtype=object)
        bluedata = np.array([k[rednum:bluenum] for k in index], dtype=object)
        graydata = np.array([k[bluenum:repeat_number] for k in index], dtype=object)
        if len(reddata):
            redindex = np.concatenate(reddata)
        else:
            redindex = []
        if len(bluedata):
            blueindex = np.concatenate(bluedata)
        else:
            blueindex = []
        if len(graydata):
            grayindex = np.concatenate(graydata)
        else:
            grayindex = []
        blast.loc[redindex, 'color'] = 'red'
        blast.loc[blueindex, 'color'] = 'blue'
        blast.loc[grayindex, 'color'] = 'gray'
        return blast[blast['color'].str.contains(r'\w')]

    def run(self):
        axis = [0, 1, 1, 0]
        left, right, top, bottom = 0.07, 0.97, 0.93, 0.03
        lens1 = base.newlens(self.lens1, self.position)
        lens2 = base.newlens(self.lens2, self.position)
        step1 = 1 / float(lens1.sum())
        step2 = 1 / float(lens2.sum())
        if self.ancestor_left != None:
            axis[0] = -0.02
            lens_ancestor_left = pd.read_csv(
                self.ancestor_left, sep="\t", header=None)
            lens_ancestor_left[0] = lens_ancestor_left[0].astype(str)
            lens_ancestor_left[3] = lens_ancestor_left[3].astype(str)
            lens_ancestor_left[4] = lens_ancestor_left[4].astype(int)
            lens_ancestor_left[4] = lens_ancestor_left[4] / lens_ancestor_left[4].max()
            lens_ancestor_left = lens_ancestor_left[lens_ancestor_left[0].isin(
                lens1.index)]
        if self.ancestor_top != None:
            axis[3] = -0.02
            lens_ancestor_top = pd.read_csv(
                self.ancestor_top, sep="\t", header=None)
            lens_ancestor_top[0] = lens_ancestor_top[0].astype(str)
            lens_ancestor_top[3] = lens_ancestor_top[3].astype(str)
            lens_ancestor_top[4] = lens_ancestor_top[4].astype(int)
            lens_ancestor_top[4] = lens_ancestor_top[4] / lens_ancestor_top[4].max()
            lens_ancestor_top = lens_ancestor_top[lens_ancestor_top[0].isin(
                lens2.index)]
        if re.search(r'\d', self.figsize):
            self.figsize = [float(k) for k in self.figsize.split(',')]
        else:
            self.figsize = np.array(
                [1, float(lens1.sum())/float(lens2.sum())])*10
        plt.rcParams['ytick.major.pad'] = 0
        fig, ax = plt.subplots(figsize=self.figsize)
        ax.xaxis.set_ticks_position('top')
        base.dotplot_frame(fig, ax, lens1, lens2, step1, step2,
                           self.genome1_name, self.genome2_name, [axis[0], axis[3]])
        gff1 = base.newgff(self.gff1)
        gff2 = base.newgff(self.gff2)
        gff1 = base.gene_location(gff1, lens1, step1, self.position)
        gff2 = base.gene_location(gff2, lens2, step2, self.position)
        if self.ancestor_top != None:
            top = top
            self.aree_left = self.ancestor_posion(ax, gff2, lens_ancestor_top, 'top')
        if self.ancestor_left != None:
            left = left
            self.aree_top = self.ancestor_posion(ax, gff1, lens_ancestor_left, 'left')
        print('read gffs')
        blast = base.newblast(self.blast, int(self.score),
                              float(self.evalue), gff1, gff2, self.blast_reverse)
        if len(blast) ==0:
            print('Stoped! \n\nThe gene id in blast file does not correspond to gff1 and gff2.')
            exit(0)
        print('read blast')
        df = self.pair_positon(blast, gff1, gff2,
                               int(self.multiple), int(self.repeat_number))
        print('deal blast')
        ax.scatter(df['loc2'], df['loc1'], s=float(self.markersize), c=df['color'],
                   alpha=0.5, edgecolors=None, linewidths=0, marker='o')
        ax.axis(axis)
        plt.subplots_adjust(left=left, right=right, top=top, bottom=bottom)
        plt.savefig(self.savefig, dpi=300)
        plt.show()

    def ancestor_posion(self, ax, gff, lens, mark):
        data = []
        for index, row in lens.iterrows():
            loc1 = gff[(gff['chr'] == row[0]) & (
                gff['order'] == int(row[1]))].index
            loc2 = gff[(gff['chr'] == row[0]) & (
                gff['order'] == int(row[2])-1)].index
            loc1, loc2 = gff.loc[[loc1[0], loc2[0]], 'loc']
            if mark == 'top':
                width = abs(loc1-loc2)
                loc = [min(loc1, loc2), 0]
                height = -0.02
                base.Rectangle(ax, loc, height, width, row[3], row[4])
            if mark == 'left':
                height = abs(loc1-loc2)
                loc = [-0.02, min(loc1, loc2), ]
                width = 0.02
                base.Rectangle(ax, loc, height, width, row[3], row[4])
            data.append([loc, height, width, row[3], row[4]])
        return data


================================================
FILE: build/lib/wgdi/example/__init__.py
================================================


================================================
FILE: build/lib/wgdi/example/align.conf
================================================
[alignment]
blockinfo = block information file (.csv)
blockinfo_reverse = false
classid =  class1
gff1 =  gff1 file
gff2 =  gff2 file
lens1 = lens1 file
lens2 = lens2 file
genome1_name =  Genome1 name
genome2_name =  Genome2 name
markersize = 0.5
ks_area = -1,3
position = order
colors = red,blue,green
figsize = 10,10
savefile = savefile(.csv)
savefig= save image(.png, .pdf, .svg)

================================================
FILE: build/lib/wgdi/example/alignmenttrees.conf
================================================
[alignmenttrees]
alignment = alignment file (.csv)
gff = gff file (reference genome, If alignment has no reference species, delete it)
lens = lens file (If alignment has no reference species, delete it)
dir = output folder
sequence_file = sequence file (.fa)
cds_file = cds file (.fa)
codon_positon = 1,2,3  (1,2 mean codon1&2; 1,2,3 mean no codon removed)
trees_file =  trees (.nwk)
align_software = (mafft,muscle)
tree_software =  (iqtree,fasttree)
threads = 1 (Number,AUTO)
model = MFP
trimming =  (trimal,divvier)
minimum = 4
delete_detail = true


================================================
FILE: build/lib/wgdi/example/ancestral_karyotype.conf
================================================
[ancestral_karyotype]
gff = gff file (cat the relevant 'gff' files into a file)
pep_file = pep file (cat the relevant 'pep.fa' files into a file)
ancestor = ancestor file  (this file requires you to provide)
mark = aak 
ancestor_gff =  result file
ancestor_lens =  result file
ancestor_pep =  result file
ancestor_file =  result file

================================================
FILE: build/lib/wgdi/example/ancestral_karyotype_repertoire.conf
================================================
[ancestral_karyotype_repertoire]
blockinfo =  block information (*.csv)
# blockinfo: processed *.csv
blockinfo_reverse =  False
gff1 =  gff1 file (ancestor's gff)
gff2 =  gff2 file (the other species's gff)
gap = 5
mark = aak1s
ancestor = ancestor file 
#current ancestor file
ancestor_new =  result file
ancestor_pep =  ancestor pep file 
#cat all pep files together
ancestor_pep_new =  result file
ancestor_gff =  result file
ancestor_lens =  result file


================================================
FILE: build/lib/wgdi/example/blockinfo.conf
================================================
[blockinfo]
blast = blast file
gff1 =  gff1 file
gff2 =  gff2 file
lens1 = lens1 file
lens2 = lens2 file
collinearity = collinearity file
score = 100
evalue = 1e-5
repeat_number = 20
position = order
ks = ks file
ks_col = ks_NG86
savefile = block information (*.csv)


================================================
FILE: build/lib/wgdi/example/blockks.conf
================================================
[blockks]
lens1 = lens1 file
lens2 = lens2 file
genome1_name =  Genome1 name
genome2_name =  Genome2 name
blockinfo = block information (*.csv)
pvalue = 0.2
tandem = true
tandem_length = 200
markersize = 1
area = 0,2
block_length =  minimum length
figsize = 8,8
savefig = save image(.png, .pdf, .svg)


================================================
FILE: build/lib/wgdi/example/circos.conf
================================================
[circos]
gff =  gff file
lens =  lens file
radius = 0.2
angle_gap = 0.05
ring_width = 0.015
colors  = 1:c,2:m,3:blue,4:gold,5:red,6:lawngreen,7:darkgreen,8:k,9:darkred,10:gray
alignment = alignment file 
chr_label = chr
ancestor = ancestor alignment file 
ancestor_location = ancestor file 
figsize = 10,10
label_size = 9
position = order
legend_square = 0.04, 0.04
column_names = 1,2,3,4,5
savefig = result(.png, .pdf, .svg)


================================================
FILE: build/lib/wgdi/example/collinearity.conf
================================================
[collinearity]
gff1 = gff1 file
gff2 = gff2 file
lens1 = lens1 file
lens2 = lens2 file
blast = blast file
blast_reverse = false
comparison = genomes
multiple  = 1
process = 8
evalue = 1e-5
score = 100
grading = 50,30,25
mg = 25,25
pvalue = 1
repeat_number = 20
positon = order
savefile = collinearity file


================================================
FILE: build/lib/wgdi/example/conf.ini
================================================
[ini]
mafft_path = /home/sunpc/micromamba/envs/wgdi/bin/mafft
pal2nal_path = /home/sunpc/micromamba/envs/wgdi/bin/pal2nal.pl
yn00_path = /home/sunpc/micromamba/envs/wgdi/bin/yn00
muscle_path = /home/sunpc/micromamba/envs/wgdi/bin/muscle
iqtree_path =  /home/sunpc/micromamba/envs/wgdi/bin/iqtree
trimal_path = /home/sunpc/micromamba/envs/wgdi/bin/trimal
fasttree_path = /home/sunpc/micromamba/envs/wgdi/bin/fasttree
divvier_path = /home/sunpc/micromamba/envs/wgdi/bin/divvier


================================================
FILE: build/lib/wgdi/example/corr.conf
================================================
[correspondence]
blockinfo =  blockinfo file(.csv) 
lens1 = lens1 file
lens2 = lens2 file
tandem = true
tandem_length = 200
pvalue = 0.2
block_length = 5
tandem_ratio = 0.5
multiple  = 1
homo = -1,1
savefile = savefile(.csv)


================================================
FILE: build/lib/wgdi/example/dotplot.conf
================================================
[dotplot]
blast = blast file
gff1 =  gff1 file
gff2 =  gff2 file
lens1 = lens1 file
lens2 = lens2 file
genome1_name =  Genome1 name
genome2_name =  Genome2 name
multiple  = 1
score = 100
evalue = 1e-5
repeat_number = 10
position = order
blast_reverse = false
ancestor_left = ancestor file or none
ancestor_top = ancestor file or none
markersize = 0.5
figsize = 10,10
savefig = savefile(.png, .pdf, .svg)


================================================
FILE: build/lib/wgdi/example/fusion_positions_database.conf
================================================
[fusion_positions_database]
pep = pep file
gff = gff file
fusion_positions = fusion_positions file
# Number of gene sets on each side of the breakpoint
ancestor_gff =  result file
ancestor_lens =  result file
ancestor_pep =  result file
ancestor_file =  result file


================================================
FILE: build/lib/wgdi/example/fusions_detection.conf
================================================
[fusions_detection]
blockinfo = block information (*.csv)
ancestor = ancestor file
#The number of genes spanned by a synteny block on both sides of a breakpoint.
min_genes_per_side = 5
density = 0.3
filtered_blockinfo = result blockinfo (.csv)


================================================
FILE: build/lib/wgdi/example/karyotype.conf
================================================
[karyotype]
ancestor = ancestor chromosome file
width = 0.5
figsize = 10,6.18
savefig = save image(.png, .pdf, .svg)

================================================
FILE: build/lib/wgdi/example/karyotype_mapping.conf
================================================
[karyotype_mapping]
blast = blast file
blast_reverse = false
gff1 = gff1 file
gff2 = gff2 file 
score = 100
evalue = 1e-5
repeat_number = 5
ancestor_left = ancestor location file (Only one of ('left', 'top') can be reserved)
ancestor_top = ancestor location file
the_other_lens = the other lens file
blockinfo = block information (*.csv)
blockinfo_reverse = false
limit_length = 5
the_other_ancestor_file =  result file 

================================================
FILE: build/lib/wgdi/example/ks.conf
================================================
[ks]
cds_file = 	cds file 
#cat all cds files together
pep_file = 	pep file
#cat all pep files together
align_software = muscle
pairs_file = gene pairs file
ks_file = ks result

================================================
FILE: build/lib/wgdi/example/ks_fit_result.csv
================================================
,color,linewidth,linestyle,,,,,,
csa_csa,red,2,-,2.532090116,1.510453744,0.229652282,1.638111687,2.048906176,0.345639862
vvi_vvi,blue,2,-,3.00367275,1.288717936,0.177816426,,,
vvi_oin_gamma,orange,2,-,1.910418336,1.328469514,0.262257112,,,
vvi_oin,orange,2,--,4.948194212,0.882608858,0.10426873,,,
vvi_csa,green,2,--,2.470770292464022,1.4131842495219498,0.21391959288821544,,,


================================================
FILE: build/lib/wgdi/example/ksfigure.conf
================================================
[ksfigure]
ksfit = ksfit result(*.csv)
labelfontsize = 15
legendfontsize = 15
xlabel = none            
ylabel = none            
title = none
area = 0,2
figsize = 10,6.18
shadow = true (true/false)
savefig =  save image(.png, .pdf, .svg)


================================================
FILE: build/lib/wgdi/example/kspeaks.conf
================================================
[kspeaks]
blockinfo = block information (*.csv)
pvalue = 0.2
tandem = true
block_length = int number
ks_area = 0,10
multiple  = 1
homo = 0,1
fontsize = 9
area = 0,3
figsize = 10,6.18
savefig = saving image(.png,.pdf)
savefile = ks medain savefile


================================================
FILE: build/lib/wgdi/example/peaksfit.conf
================================================
[peaksfit]
blockinfo = block information (*.csv)
mode = median
bins_number = 200
ks_area = 0,10
fontsize = 9
area = 0,3
figsize = 10,6.18
shadow = true 
savefig = saving image(.png,.pdf,.svg)

================================================
FILE: build/lib/wgdi/example/pindex.conf
================================================
[pindex]
alignment = alignment file (.csv)
gff = gff file
lens =lens file
gap = 50
retention = 0.05
diff = 0.05
remove_delta = (true/false)
savefile = result file(.csv)


================================================
FILE: build/lib/wgdi/example/polyploidy_classification.conf
================================================
[polyploidy classification]
blockinfo = block information (*.csv)
ancestor_left = ancestor file
ancestor_top = ancestor file
classid = class1,class2
same_protochromosome =  False
same_subgenome =  False
savefile = result file(.csv)

================================================
FILE: build/lib/wgdi/example/retain.conf
================================================
[retain]
alignment = alignment file
gff = gff file
lens = lens file
colors = red,blue,green
refgenome = shorthand
figsize = 10,12
step = 50
ylabel = y label
savefile = retain file (result)
savefig = result(.png, .pdf, .svg)


================================================
FILE: build/lib/wgdi/example/shared_fusion.conf
================================================
[shared_fusion]
blockinfo = block information (*.csv)
# The new lens file is the output filtered by lens file.
lens1 = lens file, new lens file
lens2 =  lens file,  new lens file
ancestor_left = ancestor file
ancestor_top = ancestor file
classid = class1,class2
limit_length = 5
filtered_blockinfo = result blockinfo (.csv)

================================================
FILE: build/lib/wgdi/fusion_positions_database.py
================================================
import pandas as pd
import os
from Bio import SeqIO

class fusion_positions_database:
    def __init__(self, options):
        for k, v in options:
            setattr(self, k, v)
            print(f'{k} = {v}')

    def run(self):
        # Load and remove duplicates from data
        gff = pd.read_csv(self.gff, sep="\t", header=None, dtype={0: str, 5: int}).drop_duplicates()
        pep = SeqIO.to_dict(SeqIO.parse(self.pep, "fasta"))
        df = pd.read_csv(self.fusion_positions, sep="\t", header=None, dtype={0: str, 1: int, 2:int, 3:str}).drop_duplicates()
        
        # Load ancestral sequence file if it exists
        seqs = SeqIO.to_dict(SeqIO.parse(self.ancestor_pep, "fasta")) if os.path.exists(self.ancestor_pep) else {}

        sf_gff, sf_lens = [], []

        # Process fusion positions
        for _, row in df.iterrows():
            newchr = row[3]
            newgff = gff[(gff[0] == row[0]) & 
                         (gff[5] >= row[1] - row[2]) & 
                         (gff[5] < row[1] + row[2])].copy()
            newgff['id'] = [f"{newchr}s{str(row[0]).zfill(2)}g{str(i).zfill(3)}" for i in range(1, len(newgff) + 1)]

            sf_position = row[1] - newgff.iloc[0, 5]
            sf_lens.append([newchr, sf_position, len(newgff)])
            
            # For each gene in the filtered GFF region
            for _, gff_row in newgff.iterrows():
                if gff_row[1] in pep and gff_row['id'] not in seqs:
                    gene = pep[gff_row[1]][:]
                    gene.id, gene.description = gff_row['id'], ''
                    seqs[gff_row['id']] = gene
                    # Collect data for the final GFF output
                    sf_gff.append([gff_row['id'], newchr, sf_position, gff_row[2], gff_row[3], gff_row[4], gff_row[1]])

        # Write sequences to FASTA file
        SeqIO.write(seqs.values(), self.ancestor_pep, 'fasta')

        # Save filtered GFF data
        if sf_gff:
            sf_gff = pd.DataFrame(sf_gff)
            sf_gff.rename(columns={3: 'start', 4: 'end', 5: 'strand'}, inplace=True)
            sf_gff['order'] = sf_gff[0].str[-3:].astype(int)
            sf_gff[[1, 0, 'start', 'end', 'strand', 'order', 6]].to_csv(self.ancestor_gff, sep="\t", mode='a', index=False, header=None)
            sf_lens = pd.DataFrame(sf_lens).drop_duplicates()
            sf_lens.to_csv(self.ancestor_lens, sep="\t", mode='a', index=False, header=None)

            # Generate ancestral sequence data
            ancestor = []
            for _, row in sf_lens.iterrows():
                ancestor.append([row[0], 1, row[1], 'red', 1])
                ancestor.append([row[0], row[1] + 1, row[2], 'blue', 1])
            pd.DataFrame(ancestor).to_csv(self.ancestor_file, sep="\t", mode='a', index=False, header=None)

        # Remove duplicates from the output files
        for file in [self.ancestor_gff, self.ancestor_lens, self.ancestor_file]:
            df = pd.read_csv(file, header=None).drop_duplicates().to_csv(file, index=False, header=None)


================================================
FILE: build/lib/wgdi/fusions_detection.py
================================================
import pandas as pd
from tabulate import tabulate

class fusions_detection:
    def __init__(self, options):
        self.min_genes_per_side = 5
        self.density = 0.3
        for k, v in options:
            setattr(self, k, v)
            print(f"{k} = {v}")
        self.min_genes_per_side = int(self.min_genes_per_side)
        self.density = float(self.density)

    def run(self):
        # Load the ancestor file and process the positions
        ancestor = pd.read_csv(self.ancestor, sep='\t', header=None)
        position = ancestor.groupby(0)[2].unique().apply(pd.Series)
        bkinfo = pd.read_csv(self.blockinfo)
        newbkinfo = bkinfo.head(0)
        
        # Iterate over each row in the position dataframe
        for index, row in position.iterrows():
            # Filter the bkinfo dataframe based on chr2 and density
            filtered_group = bkinfo[(bkinfo['chr2'] == index) & (bkinfo['density2'] >= self.density)].copy()
            # Split the block2 column and stack the resulting series
            df = filtered_group['block2'].str.split('_', expand=True).stack().astype(int)
            # Count the number of genes greater and less than the current position
            filtered_group['greater'] = (df > row[0]).groupby(level=0).sum()
            filtered_group['less'] = (df < row[0]).groupby(level=0).sum()
            # Filter the group based on the minimum number of genes per side
            filtered_group = filtered_group[(filtered_group['greater'] >= self.min_genes_per_side) & (filtered_group['less'] >= self.min_genes_per_side)]
            # Concatenate the filtered group with the newbkinfo dataframe
            newbkinfo = pd.concat([newbkinfo, filtered_group])
        if len(newbkinfo) ==0:
            print("\nNo shared fusion breakpoints detected")
            exit(0)

        # Get and print the shared fusion positions
        newbkinfo.to_csv(self.filtered_blockinfo, header=True, index=False)
        non_overlap_counts = newbkinfo.groupby('chr2').apply(self.count_non_overlapping)
        data = [(chr2, count) for chr2, count in non_overlap_counts.items()]
        print("\nThe following are the shared fusion breakpoints and counts:")
        print(tabulate(data, headers=["Fusion Breakpoint", "Count"], tablefmt="github"))

    def count_non_overlapping(self, group):
        if len(group) == 1:
            return 1
        grouped = group.groupby('chr1')
        total_count = 0
        for chr1, chr_group in grouped:
            chr_group = chr_group.sort_values(by='start1').reset_index(drop=True)
            count = 0
            current_end = -1 
            for _, row in chr_group.iterrows():
                start1, end1 = row['start1'], row['end1']
                if start1 > current_end:
                    count += 1
                    current_end = end1 
            total_count += count
        return total_count

================================================
FILE: build/lib/wgdi/karyotype.py
================================================
import matplotlib.pyplot as plt
import pandas as pd

import wgdi.base as base


class karyotype():
    def __init__(self, options):
        self.width = 0.5
        for k, v in options:
            setattr(self, str(k), v)
            print(str(k), ' = ', v)
        if hasattr(self, 'figsize'):
            self.figsize = [float(k) for k in self.figsize.split(',')]
        else:
            self.figsize = 10, 6.18
        if hasattr(self, 'width'):
            self.width = float(self.width)
        else:
            self.width = 0.5

    def run(self):
        fig, ax = plt.subplots(figsize=self.figsize)
        ancestor_lens = pd.read_csv(
            self.ancestor, sep="\t", header=None)
        ancestor_lens[0] = ancestor_lens[0].astype(str)
        ancestor_lens[3] = ancestor_lens[3].astype(str)
        ancestor_lens[4] = ancestor_lens[4].astype(int)
        ancestor_lens[4] = ancestor_lens[4] / ancestor_lens[4].max()
        chrs = ancestor_lens[0].drop_duplicates().to_list()
        ax.bar(chrs, 10, color='white', alpha=0)
        for index, row in ancestor_lens.iterrows():
            base.Rectangle(ax, [chrs.index(row[0])-self.width*0.5,
                                row[1]], row[2]-row[1], self.width, row[3], row[4])
        ax.tick_params(labelsize=15)
        ax.spines['top'].set_visible(False)
        ax.spines['right'].set_visible(False)
        ax.spines['left'].set_visible(False)
        ax.spines['bottom'].set_visible(False)
        ax.set_xticks([])
        ax.set_yticks([])
        plt.savefig(self.savefig, dpi=500)
        plt.show()


================================================
FILE: build/lib/wgdi/karyotype_mapping.py
================================================
import numpy as np
import pandas as pd

import wgdi.base as base


class karyotype_mapping:
    def __init__(self, options):
        # Initialize default attributes
        self.blast_reverse = False
        self.blockinfo_reverse = False
        self.position = 'order'
        self.block_length = 5
        self.limit_length = 5
        self.repeat_number = 20
        self.score = 100
        self.evalue = 1e-5

        # Update attributes with provided keyword arguments and print them
        for k, v in options:
            setattr(self, k, v)
            print(f"{k} = {v}")
        
        self.blast_reverse = base.str_to_bool(self.blast_reverse)
        self.blockinfo_reverse = base.str_to_bool(self.blockinfo_reverse)
        self.limit_length = int(self.limit_length)

    def karyotype_left(self, pairs, ancestor, gff1, gff2):
        # Loop through each row in ancestor to set color and classification in gff1
        for _, row in ancestor.iterrows():
            loc_min, loc_max = sorted([row[1], row[2]])
            index1 = gff1[(gff1['chr'] == row[0]) &
                          (gff1['order'] >= loc_min) &
                          (gff1['order'] <= loc_max)].index
            gff1.loc[index1, ['color', 'classification']] = row[3], row[4]

        # Merge pairs with gff1 and update gff2 with color and classification
        data = pd.merge(pairs, gff1, left_on=0, right_index=True, how='left')
        data.drop_duplicates(subset=[1], inplace=True)
        data.set_index(1, inplace=True)
        gff2.loc[data.index, ['color', 'classification']] = data[['color', 'classification']]
        return gff2

    def karyotype_top(self, pairs, ancestor, gff1, gff2):
        # Loop through each row in ancestor to set color and classification in gff2
        for _, row in ancestor.iterrows():
            loc_min, loc_max = sorted([row[1], row[2]])
            index1 = gff2[(gff2['chr'] == row[0]) &
                          (gff2['order'] >= loc_min) &
                          (gff2['order'] <= loc_max)].index
            gff2.loc[index1, ['color', 'classification']] = row[3], row[4]

        # Merge pairs with gff2 and update gff1 with color and classification
        data = pd.merge(pairs, gff2, left_on=1, right_index=True, how='left')
        data.drop_duplicates(subset=[0], inplace=True)
        data.set_index(0, inplace=True)
        gff1.loc[data.index, ['color', 'classification']] = data[['color', 'classification']]
        return gff1

    def karyotype_map(self, gff, lens):
        # Filter gff based on lens index and non-null color
        gff = gff[gff['chr'].isin(lens.index) & gff['color'].notnull()]
        ancestor = []
        # Group by chromosome and process each group to create ancestor records
        for chr, group in gff.groupby('chr'):
            color, class_id, arr = '', 1, []
            for _, row in group.iterrows():
                if color ==  row['color'] and class_id == row['classification']:
                    arr.append(row['order'])
                else:
                    if len(arr) >= self.limit_length:
                        ancestor.append([chr, min(arr), max(arr), color, class_id, len(arr)])
                    color, class_id = row['color'], row['classification']
                    arr = []
                    if len(ancestor) >= 1 and color == ancestor[-1][3] and class_id == ancestor[-1][4] and chr == ancestor[-1][0]:
                        arr.append(ancestor[-1][1])
                        arr += np.random.randint(ancestor[-1][1], ancestor[-1][2], size=ancestor[-1][5]-1).tolist()
                        ancestor.pop()
                    arr.append(row['order'])
            if len(arr) >= self.limit_length:
                ancestor.append([chr, min(arr), max(arr), color, class_id, len(arr)])

        ancestor = pd.DataFrame(ancestor)
        # Adjust min and max positions for each chromosome group
        for chr, group in ancestor.groupby(0):
            ancestor.loc[group.index[0], 1] = 1
            ancestor.loc[group.index[-1], 2] = lens[chr]
        ancestor[4] = ancestor[4].astype(int)
        return ancestor[[0, 1, 2, 3, 4, 5]]

    def colinear_gene_pairs(self, bkinfo, gff1, gff2):
        gff1 = gff1.reset_index()
        gff2 = gff2.reset_index()
        
        gff1_indexed = gff1.set_index(['chr', 'order'])
        gff2_indexed = gff2.set_index(['chr', 'order'])
        
        data = []
        for _, row in bkinfo.iterrows():
            b1 = list(map(int, row['block1'].split('_')))
            b2 = list(map(int, row['block2'].split('_')))

            for order1, order2 in zip(b1, b2):
                a = gff1_indexed.loc[(row['chr1'], order1), 1]
                b = gff2_indexed.loc[(row['chr2'], order2), 1]
                data.append([a, b])
        return pd.DataFrame(data)
    
    def new_ancestor(self, ancestor, gff1, gff2, blast):
        # Iterate through ancestor rows to adjust positions based on neighboring rows
        for i in range(1, len(ancestor)):
            if ancestor.iloc[i, 0] == ancestor.iloc[i-1, 0]:
                area = ancestor.iloc[i, 1] - ancestor.iloc[i-1, 2]
                if area <= 5:
                    ancestor.iloc[i-1, 2] = ancestor.iloc[i, 1] - 1
                else:
                    index1 = gff1[(gff1['chr'] == ancestor.iloc[i, 0]) &
                                (gff1['order'] >= ancestor.iloc[i-1, 2]+1) &
                                (gff1['order'] <= ancestor.iloc[i, 1]-1)].index
                    index2 = gff2[gff2['color'] == ancestor.iloc[i-1, 3]].index
                    index3 = gff2[gff2['color'] == ancestor.iloc[i, 3]].index

                    newblast1 = blast[(blast[0].isin(index1)) & (blast[1].isin(index2))]
                    newblast2 = blast[(blast[0].isin(index1)) & (blast[1].isin(index3))]

                    if len(newblast1) >= len(newblast2):
                        ancestor.iloc[i-1, 2] = ancestor.iloc[i, 1] - 1
                    else:
                        ancestor.iloc[i, 1] = ancestor.iloc[i-1, 2] + 1
        for chr, group in ancestor.groupby(0):
            if len(group) == 1:
                continue
            newgff1 = gff1[gff1['chr'] == chr]
            for i in range(1, len(group)):
                if group.iloc[i, 5] > 200:
                    continue

                index_left = newgff1[(newgff1['order'] >= group.iloc[i, 1]) &
                                (newgff1['order'] <= group.iloc[i, 2])].index
                blast_left = blast[blast[0].isin(index_left)]

                index_prev = gff2[gff2['color'] == group.iloc[i-1, 3]].index
                blast_prev = blast_left[blast_left[1].isin(index_prev)]

                index_curr = gff2[gff2['color'] == group.iloc[i, 3]].index
                blast_curr = blast_left[blast_left[1].isin(index_curr)]

                if len(blast_curr) <= len(blast_prev):
                    ancestor.loc[group.index[i],3] = ancestor.loc[group.index[i]-1,3]

                if i < len(group)-1:
                    index_next = gff2[gff2['color'] == group.iloc[i+1, 3]].index
                    blast_next = blast_left[blast_left[1].isin(index_next)]
                    if len(blast_next) > max(len(blast_prev),len(blast_curr)):
                        ancestor.loc[group.index[i],3] = ancestor.loc[group.index[i]+1,3]
        
        ancestor['group'] = (ancestor[0].shift(1) != ancestor[0]) | (ancestor[3].shift(1) != ancestor[3]) | (ancestor[4].shift(1) != ancestor[4])
        ancestor['group'] = ancestor['group'].cumsum()
        result = ancestor.groupby('group').agg({
            0: 'first',
            1: 'min',
            2: 'max',
            3: 'first',
            4: 'first',
        }).reset_index(drop=True)

        return result

    def run(self):
        # Read and process block information
        bkinfo = pd.read_csv(self.blockinfo, index_col='id')
        bkinfo['chr1'] = bkinfo['chr1'].astype(str)
        bkinfo['chr2'] = bkinfo['chr2'].astype(str)
        if self.blockinfo_reverse == True:
            bkinfo[['chr1', 'chr2']] =  bkinfo[['chr2', 'chr1']]
            bkinfo[['block1', 'block2']] =  bkinfo[['block2', 'block1']]
        bkinfo = bkinfo[bkinfo['length'] > int(self.block_length)]

        # Read GFF and lens data
        gff1 = base.newgff(self.gff1)
        gff2 = base.newgff(self.gff2)
        lens = base.newlens(self.the_other_lens, self.position)
        blast = base.newblast(self.blast, int(self.score), float(self.evalue), gff1, gff2, self.blast_reverse)
        # blast.drop_duplicates(subset=[0], keep='first', inplace=True)

        # Find colinear gene pairs
        pairs = self.colinear_gene_pairs(bkinfo, gff1, gff2)

        # Depending on available attributes, call either karyotype_top or karyotype_left
        if hasattr(self, 'ancestor_top'):
            ancestor = base.read_classification(self.ancestor_top)
            data = self.karyotype_top(pairs, ancestor, gff1, gff2)
        elif hasattr(self, 'ancestor_left'):
            ancestor = base.read_classification(self.ancestor_left)
            data = self.karyotype_left(pairs, ancestor, gff1, gff2)
            gff1, gff2 = gff2, gff1
            blast.iloc[:, :2] = blast.iloc[:, [1, 0]].to_numpy()
        else:
            print('Missing ancestor file.')
            exit(0)

        # Map the data and create the final ancestor file
        the_other_ancestor_file = self.karyotype_map(data, lens)
        the_other_ancestor_file = self.new_ancestor(the_other_ancestor_file, gff1, gff2, blast)
        the_other_ancestor_file.to_csv(self.the_other_ancestor_file, sep='\t', header=False, index=False)

================================================
FILE: build/lib/wgdi/ks.py
================================================
import os
import sys
import numpy as np
import pandas as pd
from Bio import SeqIO
import subprocess
from Bio.Phylo.PAML import yn00
import wgdi.base as base


class ks:
    def __init__(self, options):
        base_conf = base.config()
        self.pair_pep_file = 'pair.pep'
        self.pair_cds_file = 'pair.cds'
        self.prot_align_file = 'prot.aln'
        self.mrtrans = 'pair.mrtrans'
        self.pair_yn = 'pair.yn'

        for k, v in base_conf:
            setattr(self, str(k), v)
        for k, v in options:
            setattr(self, str(k), v)
            print(f'{str(k)} = {v}')

    def auto_file(self):
        pairs = []
        with open(self.pairs_file) as f:
            p = ' '.join(f.readlines()[:30])

        # Detect file format and process accordingly
        if 'path length' in p or 'MAXIMUM GAP' in p:
            collinearity = base.read_colinearscan(self.pairs_file)
            pairs = [[v[0], v[2]] for k in collinearity for v in k[1]]
        elif 'MATCH_SIZE' in p or '## Alignment' in p:
            collinearity = base.read_mcscanx(self.pairs_file)
            pairs = [[v[0], v[2]] for k in collinearity for v in k[1]]
        elif '# Alignment' in p:
            collinearity = base.read_collinearity(self.pairs_file)
            pairs = [[v[0], v[2]] for k in collinearity for v in k[1]]
        elif '###' in p:
            collinearity = base.read_jcvi(self.pairs_file)
            pairs = [[v[0], v[2]] for k in collinearity for v in k[1]]
        elif ',' in p:
            collinearity = pd.read_csv(self.pairs_file, header=None)
            pairs = collinearity.values.tolist()
        else:
            collinearity = pd.read_csv(self.pairs_file, header=None, sep='\t')
            pairs = collinearity.values.tolist()

        df = pd.DataFrame(pairs).drop_duplicates()
        df[0] = df[0].astype(str)
        df[1] = df[1].astype(str)
        df.index = df[0] + ',' + df[1]
        return df

    def run(self):
        # Load sequence data
        cds = SeqIO.to_dict(SeqIO.parse(self.cds_file, "fasta"))
        pep = SeqIO.to_dict(SeqIO.parse(self.pep_file, "fasta"))
        df_pairs = self.auto_file()

        # Check if ks file exists and load it, otherwise create a new one
        if os.path.exists(self.ks_file):
            ks = pd.read_csv(self.ks_file, sep='\t').drop_duplicates()
            kscopy = ks.copy()
            names = ks.columns.tolist()
            names[0], names[1] = names[1], names[0]
            kscopy.columns = names
            ks = pd.concat([ks, kscopy])
            ks['id'] = ks['id1'] + ',' + ks['id2']
            df_pairs.drop(np.intersect1d(df_pairs.index, ks['id'].to_numpy()), inplace=True)
            ks_file = open(self.ks_file, 'a+')
        else:
            ks_file = open(self.ks_file, 'w')
            ks_file.write('\t'.join(['id1', 'id2', 'ka_NG86', 'ks_NG86', 'ka_YN00', 'ks_YN00']) + '\n')

        # Filter valid pairs based on sequence data
        df_pairs = df_pairs[
            (df_pairs[0].isin(cds.keys())) & (df_pairs[1].isin(cds.keys())) &
            (df_pairs[0].isin(pep.keys())) & (df_pairs[1].isin(pep.keys()))
        ]

        pairs = df_pairs[[0, 1]].to_numpy()

        if len(pairs) > 0 and pairs[0][0][:3] == pairs[0][1][:3]:
            allpairs = []
            pair_hash = {}
            for k in pairs:
                if k[0] + ',' + k[1] in pair_hash or k[1] + ',' + k[0] in pair_hash:
                    continue
                else:
                    pair_hash[k[0] + ',' + k[1]] = 1
                    pair_hash[k[1] + ',' + k[0]] = 1
                    allpairs.append(k)
            pairs = allpairs

        for k in pairs:
            cds_gene1, cds_gene2 = cds[k[0]], cds[k[1]]
            cds_gene1.id, cds_gene2.id = 'gene1', 'gene2'
            pep_gene1, pep_gene2 = pep[k[0]], pep[k[1]]
            pep_gene1.id, pep_gene2.id = 'gene1', 'gene2'

            # Write sequences to files
            SeqIO.write([cds[k[0]], cds[k[1]]], self.pair_cds_file, "fasta")
            SeqIO.write([pep[k[0]], pep[k[1]]], self.pair_pep_file, "fasta")

            # Compute Ka/Ks values
            kaks = self.pair_kaks(['gene1', 'gene2'])
            if kaks is None:
                continue

            ks_file.write('\t'.join([str(i) for i in list(k) + list(kaks)]) + '\n')

        ks_file.close()

        # Clean up temporary files
        for file in [
            self.pair_pep_file, self.pair_cds_file, self.mrtrans, self.pair_yn,
            self.prot_align_file, '2YN.dN', '2YN.dS', '2YN.t', 'rst', 'rst1', 'yn00.ctl', 'rub'
        ]:
            try:
                os.remove(file)
            except OSError:
                pass

    def pair_kaks(self, k):
        self.align()
        pal = self.pal2nal()
        if not pal:
            return []

        kaks = self.run_yn00()
        if kaks is None:
            return []

        kaks_new = [
            kaks[k[0]][k[1]]['NG86']['dN'], kaks[k[0]][k[1]]['NG86']['dS'],
            kaks[k[0]][k[1]]['YN00']['dN'], kaks[k[0]][k[1]]['YN00']['dS']
        ]
        return kaks_new

    def align(self):
        if self.align_software == 'mafft':
            try:
                command = [self.mafft_path, '--quiet', self.pair_pep_file, '>', self.prot_align_file]
                subprocess.run(" ".join(command), shell=True, check=True)
            except subprocess.CalledProcessError as e:
                print(f"Error while running MAFFT: {e}")

        elif self.align_software == 'muscle':
            try:
                command = [self.muscle_path, '-align', self.pair_pep_file, '-output', self.prot_align_file, '-quiet']
                subprocess.run(" ".join(command), shell=True, check=True)
            except subprocess.CalledProcessError as e:
                print(f"Error while running Muscle: {e}")

    def pal2nal(self):
        args = ['perl', self.pal2nal_path, self.prot_align_file, self.pair_cds_file, '-output paml -nogap', '>' + self.mrtrans]
        command = ' '.join(args)
        try:
            os.system(command)
        except:
            return False
        return True

    def run_yn00(self):
        yn = yn00.Yn00()
        yn.alignment = self.mrtrans
        yn.out_file = self.pair_yn
        yn.set_options(icode=0, commonf3x4=0, weighting=0, verbose=1)

        try:
            run_result = yn.run(command=self.yn00_path)
        except:
            run_result = None
        return run_result


================================================
FILE: build/lib/wgdi/ks_peaks.py
================================================
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.stats.kde import gaussian_kde

import wgdi.base as base

class kspeaks:
    def __init__(self, options):
        # Default values
        self.tandem_length = 200
        self.figsize = 10, 6.18
        self.fontsize = 9
        self.block_length = 3
        self.area = 0, 3
        self.tandem =  True

        # Set options passed in
        for k, v in options:
            setattr(self, str(k), v)
            print(f'{str(k)} = {v}')

        # Convert string values to lists of floats
        self.homo = [float(k) for k in self.homo.split(',')]
        self.ks_area = [float(k) for k in self.ks_area.split(',')]
        self.figsize = [float(k) for k in self.figsize.split(',')]
        self.area = [float(k) for k in self.area.split(',')]
        self.pvalue = float(self.pvalue)
        self.block_length = int(self.block_length)
        self.tandem = base.str_to_bool(self.tandem)

    def remove_tandem(self, bkinfo):
        """
        Remove tandem duplications based on start and end position differences.
        """
        group = bkinfo[bkinfo['chr1'] == bkinfo['chr2']].copy()
        group.loc[:, 'start'] = group.loc[:, 'start1'] - group.loc[:, 'start2']
        group.loc[:, 'end'] = group.loc[:, 'end1'] - group.loc[:, 'end2']
        
        # Drop rows where start or end difference is within tandem length
        index = group[(group['start'].abs() <= self.tandem_length) | 
                      (group['end'].abs() <= self.tandem_length)].index
        bkinfo = bkinfo.drop(index)
        return bkinfo

    def ks_kde(self, df):
        """
        Perform kernel density estimation (KDE) on Ks data.
        """
        # Clean up 'ks' column by removing leading underscores
        df.loc[df['ks'].str.startswith('_'), 'ks'] = df.loc[df['ks'].str.startswith('_'), 'ks'].str[1:]
        
        ks = df['ks'].str.split('_')
        arr = []
        ks_ave = []
        
        # Collect individual Ks values and calculate average Ks per row
        for v in ks.values:
            v = [float(k) for k in v if float(k) >= 0]
            if len(v) == 0:
                continue
            arr.extend(v)
            ks_ave.append(sum(v) / len(v))  # Mean of each row's Ks values
        
        # KDE for three distributions: median, average, total
        kdemedian = gaussian_kde(df['ks_median'].values)
        kdemedian.set_bandwidth(bw_method=kdemedian.factor / 3.)
        
        kdeaverage = gaussian_kde(ks_ave)
        kdeaverage.set_bandwidth(bw_method=kdeaverage.factor / 3.)
        
        kdetotal = gaussian_kde(arr)
        kdetotal.set_bandwidth(bw_method=kdetotal.factor / 3.)

        return [kdemedian, kdeaverage, kdetotal]

    def run(self):
        """
        Main method to process the data, perform KDE, and generate the plot.
        """
        plt.rcParams['ytick.major.pad'] = 0
        fig, ax = plt.subplots(figsize=self.figsize)

        # Read the block info file
        bkinfo = pd.read_csv(self.blockinfo)
        bkinfo['chr1'] = bkinfo['chr1'].astype(str)
        bkinfo['chr2'] = bkinfo['chr2'].astype(str)
        bkinfo['length'] = bkinfo['length'].astype(int)

        # Filter based on block length and p-value
        bkinfo = bkinfo[(bkinfo['length'] > self.block_length) &
                        (bkinfo['pvalue'] < self.pvalue)]

        # Remove tandem duplications if needed
        if self.tandem == False:
            bkinfo = self.remove_tandem(bkinfo)

        # Further filtering based on homozygous range and Ks area
        bkinfo = bkinfo[bkinfo[f'homo{self.multiple}'] >= self.homo[0]]
        bkinfo = bkinfo[bkinfo[f'homo{self.multiple}'] <= self.homo[1]]
        bkinfo = bkinfo[bkinfo['ks_median'] >= self.ks_area[0]]
        bkinfo = bkinfo[bkinfo['ks_median'] <= self.ks_area[1]]

        # Perform KDE on the Ks data
        kdemedian, kdeaverage, kdetotal = self.ks_kde(bkinfo)

        # Define the range for the x-axis (Ks values)
        dist_space = np.linspace(self.area[0], self.area[1], 500)

        # Plot the KDE results
        ax.plot(dist_space, kdemedian(dist_space), color='red', label='block median')
        ax.plot(dist_space, kdeaverage(dist_space), color='black', label='block average')
        ax.plot(dist_space, kdetotal(dist_space), color='blue', label='all pairs')

        # Set plot labels, grid, and limits
        ax.grid()
        ax.set_xlabel(r'${K_{s}}$', fontsize=20)
        ax.set_ylabel('Frequency', fontsize=20)
        ax.tick_params(labelsize=18)
        ax.set_xlim(self.area)
        ax.legend(fontsize=20)

        # Adjust layout for better display
        plt.subplots_adjust(left=0.09, right=0.96, top=0.93, bottom=0.12)

        # Save the figure
        plt.savefig(self.savefig, dpi=500)
        plt.show()

        # Save the filtered data to CSV
        bkinfo.to_csv(self.savefile, index=False)

================================================
FILE: build/lib/wgdi/ksfigure.py
================================================
import re
import sys

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import wgdi.base as base
from scipy import stats


class ksfigure():
    def __init__(self, options):
        self.figsize = 10, 6.18
        self.legendfontsize = 30
        self.labelfontsize = 9
        self.area = 0, 3
        self.shadow = True
        self.mode = 'median'
        for k, v in options:
            setattr(self, str(k), v)
            print(str(k), ' = ', v)
        if self.xlabel == 'none' or self.xlabel == '':
            self.xlabel = r'Synonymous nucleotide subsititution (${K_{s}}$)'
        if self.ylabel == 'none' or self.ylabel == '':
            self.ylabel = 'kernel density of syntenic blocks'
        if self.title == 'none' or self.title == '':
            self.title = ''
        self.figsize = [float(k) for k in self.figsize.split(',')]
        self.area = [float(k) for k in self.area.split(',')]
        self.shadow = base.str_to_bool(self.shadow)

    def Gaussian_distribution(self, t, k):
        y = np.zeros(len(t))
        for i in range(0, int((len(k) - 1) / 3)+1):
            if np.isnan(k[3 * i + 2]):
                continue
            k[3 * i + 2] = float(k[3 * i + 2])/np.sqrt(2)
            k[3 * i + 0] = float(k[3 * i + 0]) * \
                np.sqrt(2*np.pi)*float(k[3 * i + 2])
            y1 = stats.norm.pdf(
                t, float(k[3 * i + 1]), float(k[3 * i + 2])) * float(k[3 * i + 0])
            y = y+y1
        return y

    def run(self):
        plt.rcParams['ytick.major.pad'] = 0
        fig, ax = plt.subplots(figsize=self.figsize)
        ksfit = pd.read_csv(self.ksfit, index_col=0)
        t = np.arange(self.area[0], self.area[1], 0.0005)
        col = [k for k in ksfit.columns if re.match('Unnamed:', k)]
        for index, row in ksfit.iterrows():
            ax.plot(t, self.Gaussian_distribution(
                t, row[col].values), linestyle=row['linestyle'], color=row['color'],alpha=0.8, label=index, linewidth=row['linewidth'])
            if self.shadow == True:
                ax.fill_between(t, 0, self.Gaussian_distribution(t, row[col].values),  color=row['color'], alpha=0.15, interpolate=True, edgecolor=None, label=index,)
        align = dict(family='Arial', verticalalignment="center",
                     horizontalalignment="center")
        ax.set_xlabel(self.xlabel, fontsize=self.labelfontsize,
                      labelpad=20, **align)
        ax.set_ylabel(self.ylabel, fontsize=self.labelfontsize,
                      labelpad=20, **align)
        ax.set_title(self.title, weight='bold',
                     fontsize=self.labelfontsize, **align)
        plt.tick_params(labelsize=10)
        handles,labels = ax.get_legend_handles_labels()
        df = pd.DataFrame({  'handles': handles, 'labels': labels})
        df.drop_duplicates(subset='labels', keep='first', inplace=True)
        handles, labels = df['handles'].tolist(), df['labels'].tolist()
        if self.shadow == True:
            plt.legend(handles=handles,labels=labels,loc='upper right', prop={
                   'family': 'Arial', 'style': 'italic', 'size': self.legendfontsize})
        else:
            plt.legend(handles=handles,labels=labels,loc='upper right', prop={
                   'family': 'Arial', 'style': 'italic', 'size': self.legendfontsize})
        plt.gca().spines['top'].set_visible(False)
        plt.gca().spines['right'].set_visible(False)
        plt.savefig(self.savefig, dpi=500)
        plt.show()
        sys.exit(0)


================================================
FILE: build/lib/wgdi/peaksfit.py
================================================
import re
import sys

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.optimize import curve_fit
from scipy.stats import gaussian_kde, linregress

import wgdi.base as base


class peaksfit():
    def __init__(self, options):
        self.figsize = 10, 6.18
        self.fontsize = 9
        self.area = 0, 3
        self.mode = 'median'
        self.histogram_only = False
        for k, v in options:
            setattr(self, str(k), v)
            print(str(k), ' = ', v)
        self.figsize = [float(k) for k in self.figsize.split(',')]
        self.area = [float(k) for k in self.area.split(',')]
        self.bins_number = int(self.bins_number)
        self.peaks = 1
        self.histogram_only = base.str_to_bool(self.histogram_only)

    def ks_values(self, df):
        df.loc[df['ks'].str.startswith('_'),'ks']= df.loc[df['ks'].str.startswith('_'),'ks'].str[1:]
        ks = df['ks'].str.split('_')
        ks_total = []
        ks_average = []
        for v in ks.values:
            ks_total.extend([float(k) for k in v])
        ks_average = df['ks_average'].values
        ks_median = df['ks_median'].values
        return [ks_median, ks_average, ks_total]

    def gaussian_fuc(self, x, *params):
        y = np.zeros_like(x)
        for i in range(0, len(params), 3):
            amp = float(params[i])
            ctr = float(params[i+1])
            wid = float(params[i+2])
            y = y + amp * np.exp(-((x - ctr)/wid)**2)
        return y

    def kde_fit(self, data, x):
        kde = gaussian_kde(data)
        kde.set_bandwidth(bw_method=kde.factor/3.)
        p = kde(x)
        guess = [1,1, 1]*self.peaks
        popt, pcov = curve_fit(self.gaussian_fuc, x, p, guess, maxfev = 80000)
        popt = [abs(k) for k in popt]
        data = []
        y = self.gaussian_fuc(x, *popt)
        for i in range(0, len(popt), 3):
            array = [popt[i], popt[i+1], popt[i+2]]
            data.append(self.gaussian_fuc(x, *array))
        slope, intercept, r_value, p_value, std_err = linregress(p, y)
        print("\nR-square: "+str(r_value**2))
        print("The gaussian fitting curve parameters are :")
        print('  |  '.join([str(k) for k in popt]))
        return y, data

    def run(self):
        plt.rcParams['ytick.major.pad'] = 0
        fig, ax = plt.subplots(figsize=self.figsize)
        bkinfo = pd.read_csv(self.blockinfo)
        ks_median, ks_average, ks_total = self.ks_values(bkinfo)
        data = eval('ks_'+self.mode)
        data = [k for k in data if self.area[0] <= k <= self.area[1]]
        x = np.linspace(self.area[0], self.area[1], self.bins_number)
        n, bins, patches = ax.hist(data, int(
            self.bins_number), density=1, facecolor='blue', alpha=0.3, label='Histogram')
        if self.histogram_only == True:
            pass
        else:
            y, fit = self.kde_fit(data, x)
            ax.plot(x, y, color='black', linestyle='-', label='Gaussian fitting')
        ax.grid()
        align = dict(family='Arial', verticalalignment="center",
                     horizontalalignment="center")
        ax.set_xlabel(r'${K_{s}}$', fontsize=20)
        ax.set_ylabel('Frequency', fontsize=20)
        ax.tick_params(labelsize=18)
        ax.legend(fontsize=20)
        ax.set_xlim(self.area)
        plt.subplots_adjust(left=0.09, right=0.96, top=0.93, bottom=0.12)
        plt.savefig(self.savefig, dpi=500)
        plt.show()
        sys.exit(0)


================================================
FILE: build/lib/wgdi/pindex.py
================================================
import os
import sys

import numpy as np
import pandas as pd
import wgdi.base as base


class pindex():
    def __init__(self, options):
        self.remove_delta = True
        self.position = 'order'
        self.retention = 0.05
        self.diff = 0.05
        self.gap = 50
        for k, v in options:
            setattr(self, str(k), v)
            print(k, ' = ', v)
        self.gap = int(self.gap)
        self.retention = float(self.retention)
        self.diff = float(self.diff)

    def Pindex(self, sub1, sub2):
        r1 = self.retain(sub1)
        r2 = self.retain(sub2)
        r = []
        for i in range(len(r2)):
            if(r1[i] < self.retention or r2[i] < self.retention):
                r.append(0)
                continue
            d = (r1[i]-r2[i])/(r1[i]+r2[i])*0.5
            if d > self.diff:
                r.append(1)
            elif -d > self.diff:
                r.append(-1)
            else:
                r.append(0)
        a, b, c = len([i for i in r if i == 1]), len(
            [i for i in r if i == -1]), len([i for i in r if i == 0])
        return [a, -b, c, len(r)]

    def retain(self, arr):
        a = []
        for i in range(0, len(arr), 2*self.gap):
            start, end = i-self.gap, i+self.gap
            genenum, retainnum = 0, 0
            for j in range(start, end):
                if((j >= int(len(arr))) or (j < 0)):
                    continue
                else:
                    retainnum += arr[j]
                    genenum += 1
            a.append(float(retainnum/genenum))
        return a

    def run(self):
        alignment = pd.read_csv(self.alignment, header=None, index_col=0)
        alignment.replace(r'\w+', 1, regex=True, inplace=True)
        alignment.replace('.', 0, inplace=True)
        alignment.fillna(0, inplace=True)
        gff = base.newgff(self.gff)
        lens = base.newlens(self.lens, self.position)
        gff = gff[gff['chr'].isin(lens.index)]
        alignment = alignment.join(gff[['chr', self.position]], how='left')
        alignment.dropna(axis=0, how='any', inplace=True)
        p = self.cal_pindex(alignment)
        print('Polyploidy-index: ', p)
        sys.exit(0)

    def cal_pindex(self, alignment):
        data, df = [], []
        columns = alignment.columns[:-2].tolist()
        for i in range(len(columns)-1):
            for j in range(i+1, len(columns)):
                b = []
                for chr, group in alignment.groupby('chr'):
                    sub1 = group.loc[:, columns[i]].tolist()
                    sub2 = group.loc[:, columns[j]].tolist()
                    p = self.Pindex(sub1, sub2)
                    b.append(p)
                    df.append([i, j, chr]+p)
                sub_diver = sum([abs(k[0]+k[1]) for k in b])
                if self.remove_delta == True:
                    sub_total = sum([abs(k[1])+abs(k[0]) for k in b])
                    if sub_total == 0:
                        c = 0
                    else:
                        c = sub_diver/sub_total
                else:
                    sub_total = sum([abs(k[1])+abs(k[0])+abs(k[2]) for k in b])
                    c = sub_diver/sub_total
                data.append(c)
        df = pd.DataFrame(df, columns=[
                          'sub1', 'sub2', 'chr', 'sub1_high', 'sub2_high', 'No_diff', 'Total'])
        df['sub2_high'] = df['sub2_high'].abs()
        self.infomation(df)
        print('\nPolyploidy-index between subgenomes are ', data)
        return sum(data)/len(data)

    def turn_percentage(self, x):
        return '(%.2f%%)' % (x * 100)

    def infomation(self, df):
        data = []
        for names, group in df.groupby(['sub1', 'sub2']):
            newgroup = pd.concat([group.head(1), group],
                                 axis=0, ignore_index=True)
            cols = ['sub1_high', 'sub2_high', 'No_diff', 'Total']
            newgroup.loc[0, cols] = group.loc[:, cols].sum()
            group1 = newgroup.copy()
            group1[cols] = group1[cols].astype(str)
            newgroup['sub1_high'] = (
                newgroup['sub1_high'] / newgroup['Total']).apply(self.turn_percentage)
            newgroup['sub2_high'] = (
                newgroup['sub2_high'] / newgroup['Total']).apply(self.turn_percentage)
            newgroup['No_diff'] = (
                newgroup['No_diff'] / newgroup['Total']).apply(self.turn_percentage)
            newgroup['Total'] = (
                newgroup['Total'] / group['Total'].sum()).apply(self.turn_percentage)
            newgroup[cols] = group1[cols]+newgroup[cols]
            group_list = []
            a = newgroup[['chr']+cols].columns.to_numpy()
            a[0] = 'Chromosome'
            a[1], a[2] = 'Sub_'+str(names[0]+1), 'Sub_'+str(names[1]+1)
            group_list.append(a)
            b = newgroup[['chr']+cols].to_numpy()
            b[0][0] = 'Total'
            for k in b:
                group_list.append(k)
            group_list = np.array(group_list).T
            for k in group_list:
                data.append(k)
        data = pd.DataFrame(data)
        data.to_csv(self.savefile, header=None, index=None)


================================================
FILE: build/lib/wgdi/polyploidy_classification.py
================================================
import pandas as pd
import wgdi.base as base


class polyploidy_classification:
    def __init__(self, options):
        self.same_protochromosome = False
        self.same_subgenome = False
        for k, v in options:
            setattr(self, str(k), v)
            print(f"{k} = {v}")
        
        self.same_protochromosome = base.str_to_bool(self.same_protochromosome)
        self.same_subgenome = base.str_to_bool(self.same_subgenome)
        
        # Initialize classid with a default value if not provided
        self.classid = [str(k) for k in getattr(self, 'classid', 'class1,class2').split(',')]

    def run(self):
        # Read input files
        ancestor_left = base.read_classification(self.ancestor_left)
        ancestor_top = base.read_classification(self.ancestor_top)
        bkinfo = pd.read_csv(self.blockinfo)

        # Ensure chr1 and chr2 are treated as strings
        bkinfo['chr1'] = bkinfo['chr1'].astype(str)
        bkinfo['chr2'] = bkinfo['chr2'].astype(str)

        # Filter rows where chr1 and chr2 match ancestor values
        bkinfo = bkinfo[bkinfo['chr1'].isin(ancestor_left[0].values) & bkinfo['chr2'].isin(ancestor_top[0].values)]

        # Initialize additional columns
        bkinfo[self.classid[0]] = 0
        bkinfo[self.classid[1]] = 0
        bkinfo[self.classid[0] + '_color'] = ''
        bkinfo[self.classid[1] + '_color'] = ''
        bkinfo['diff'] = 0.0

        # Processing the first classification (ancestor_left vs chr1)
        for name, group in bkinfo.groupby('chr1'):
            d1 = ancestor_left[ancestor_left[0] == name]
            for index1, row1 in group.iterrows():
                a, b = sorted([row1['start1'], row1['end1']])
                a, b = int(a), int(b)
                for index2, row2 in d1.iterrows():
                    c, d = sorted([row2[1], row2[2]])
                    h = len([k for k in range(a, b) if k in range(c, d)]) / (b - a)
                    if h > bkinfo.loc[index1, 'diff']:
                        bkinfo.loc[index1, 'diff'] = float(h)
                        bkinfo.loc[index1, self.classid[0]] = row2[4]
                        bkinfo.loc[index1, self.classid[0] + '_color'] = row2[3]

        # Reset 'diff' and process the second classification (ancestor_top vs chr2)
        bkinfo['diff'] = 0.0
        for name, group in bkinfo.groupby('chr2'):
            d2 = ancestor_top[ancestor_top[0] == name]
            for index1, row1 in group.iterrows():
                a, b = sorted([row1['start2'], row1['end2']])
                a, b = int(a), int(b)
                for index2, row2 in d2.iterrows():
                    c, d = sorted([row2[1], row2[2]])
                    h = len([k for k in range(a, b) if k in range(c, d)]) / (b - a)
                    if h > bkinfo.loc[index1, 'diff']:
                        bkinfo.loc[index1, 'diff'] = float(h)
                        bkinfo.loc[index1, self.classid[1]] = row2[4]
                        bkinfo.loc[index1, self.classid[1] + '_color'] = row2[3]

        # Uncomment if you want to filter rows where both colors match
        if self.same_protochromosome == True:
            bkinfo = bkinfo[bkinfo[self.classid[1] + '_color'] == bkinfo[self.classid[0] + '_color']]
        if self.same_subgenome == True:
            bkinfo = bkinfo[bkinfo[self.classid[1]] == bkinfo[self.classid[0]]]  

        # Save the result to a CSV file
        bkinfo.to_csv(self.savefile, index=False)


================================================
FILE: build/lib/wgdi/retain.py
================================================
import matplotlib.pyplot as plt
import pandas as pd
import wgdi.base as base

class retain:
    def __init__(self, options):
        self.position = 'order'
        
        # Initialize the options by setting attributes dynamically
        for k, v in options:
            setattr(self, str(k), v)
            print(f"{str(k)} = {v}")

        # Handle the ylim parameter, which defines the y-axis limits
        self.ylim = [float(k) for k in self.ylim.split(',')] if hasattr(self, 'ylim') else [0, 1]
        
        # Handle the colors and figsize parameters
        self.colors = [str(k) for k in self.colors.split(',')]
        self.figsize = [float(k) for k in self.figsize.split(',')]

    def run(self):
        # Load GFF and lens data
        gff = base.newgff(self.gff)
        lens = base.newlens(self.lens, self.position)
        
        # Filter GFF data based on lens chromosome index
        gff = gff[gff['chr'].isin(lens.index)]
        
        # Load alignment data and join with GFF
        alignment = pd.read_csv(self.alignment, header=None, index_col=0)
        alignment = alignment.join(gff[['chr', self.position]], how='left')
        
        # Perform alignment processing
        self.retain = self.align_chr(alignment)
        
        # Save the processed data to a file
        self.retain[self.retain.columns[:-2]].to_csv(self.savefile, sep='\t', header=None)
        
        # Create a figure for plotting
        fig, axs = plt.subplots(len(lens), 1, sharex=True, sharey=True, figsize=tuple(self.figsize))
        fig.add_subplot(111, frameon=False)
        
        align = dict(family='DejaVu Sans', verticalalignment="center", horizontalalignment="center")

        
        # Hide all the spines and ticks on the plot
        for spine in plt.gca().spines.values():
            spine.set_visible(False)
        plt.tick_params(top=False, bottom=False, left=False, right=False, labelleft=False, labelbottom=False)
        
        # Group the retain data by chromosome and plot each chromosome's data
        groups = self.retain.groupby('chr')
        for i, chr_name in enumerate(lens.index):
            group = groups.get_group(chr_name)

            if len(lens) == 1:
                for j, col in enumerate(self.retain.columns[:-2]):
                    axs.plot(group['order'].values, group[col].values,
                                linestyle='-', color=self.colors[j], linewidth=1)
                axs.spines['right'].set_visible(False)
                axs.spines['top'].set_visible(False)
                axs.set_ylim(self.ylim)
                axs.tick_params(labelsize=12)                
            else:
                # Plot each column's data for the current chromosome
                for j, col in enumerate(self.retain.columns[:-2]):
                    axs[i].plot(group['order'].values, group[col].values,
                                linestyle='-', color=self.colors[j], linewidth=1)
            
                # Hide the right and top spines for each subplot
                axs[i].spines['right'].set_visible(False)
                axs[i].spines['top'].set_visible(False)
                axs[i].set_ylim(self.ylim)
                axs[i].tick_params(labelsize=12)

        for i, chr_name in enumerate(lens.index):
            if len(lens) == 1:
                x, y = axs.get_xlim()[1] * 0.90, axs.get_ylim()[1] * 0.8
                axs.text(x, y, f"{self.refgenome} {chr_name}", fontsize=14, **align)
            else:
                # Add a label for the reference genome and chromosome
                x, y = axs[i].get_xlim()[1] * 0.90, axs[i].get_ylim()[1] * 0.8
                axs[i].text(x, y, f"{self.refgenome} {chr_name}", fontsize=14, **align)
        
        # Adjust layout and save the figure as an image
        plt.ylabel(f"{self.ylabel}\n\n\n\n", fontsize=18, **align)
        plt.subplots_adjust(left=0.1, right=0.95, top=0.95, bottom=0.05)
        plt.savefig(self.savefig, dpi=500)
        plt.show()

    def align_chr(self, alignment):
        """
        Perform the alignment processing for each chromosome by updating the values.
        """
        for i in alignment.columns[:-2]:
            # Update values: set '1' for valid values, '0' for invalid, and fill NaN with 0
            alignment.loc[alignment[i].str.contains(r'\w', na=False), i] = 1
            alignment.loc[alignment[i] == '.', i] = 0
            alignment.loc[alignment[i] == ' ', i] = 0
            alignment[i] = alignment[i].astype('float64').fillna(0)
            
            # Apply the moving average function to each group by chromosome
            for chr_name, group in alignment.groupby(['chr']):
                a = self.moving_average(group[i].values.tolist())
                alignment.loc[group.index, i] = a
        return alignment

    def moving_average(self, arr):
        """
        Calculate a moving average over a specified window size.
        This function smooths the input array using a sliding window.
        """
        a = []
        for i in range(len(arr)):
            # Define the window range
            start, end = max(0, i - int(self.step)), min(len(arr), i + int(self.step))
            ave = sum(arr[start:end]) / (end - start)
            a.append(ave)
        return a


================================================
FILE: build/lib/wgdi/run.py
================================================
import argparse
import os
import shutil
import sys

import wgdi
import wgdi.base as base
from wgdi.align_dotplot import align_dotplot
from wgdi.block_correspondence import block_correspondence
from wgdi.block_info import block_info
from wgdi.block_ks import block_ks
from wgdi.circos import circos
from wgdi.dotplot import dotplot
from wgdi.karyotype import karyotype
from wgdi.karyotype_mapping import karyotype_mapping
from wgdi.ks import ks
from wgdi.ks_peaks import kspeaks
from wgdi.ksfigure import ksfigure
from wgdi.peaksfit import peaksfit
from wgdi.pindex import pindex
from wgdi.polyploidy_classification import polyploidy_classification
from wgdi.retain import retain
from wgdi.run_colliearity import mycollinearity
from wgdi.trees import trees
from wgdi.ancestral_karyotype import ancestral_karyotype
from wgdi.ancestral_karyotype_repertoire import ancestral_karyotype_repertoire
from wgdi.shared_fusion import shared_fusion
from wgdi.fusion_positions_database import fusion_positions_database
from wgdi.fusions_detection import fusions_detection


# Argument parser setup
parser = argparse.ArgumentParser(
    prog='wgdi', usage='%(prog)s [options]', epilog="",
    formatter_class=argparse.RawDescriptionHelpFormatter
)

parser.description = '''\
WGDI(Whole-Genome Duplication Integrated): A user-friendly toolkit for evolutionary analyses of whole-genome duplications and ancestral karyotypes.

    https://wgdi.readthedocs.io/en/latest/
    -------------------------------------- 
'''

parser.add_argument("-v", "--version", action='version', version='0.75')
parser.add_argument("-d", dest="dotplot", help="Show homologous gene dotplot")
parser.add_argument("-icl", dest="improvedcollinearity", help="Improved version of ColinearScan ")
parser.add_argument("-ks", dest="calks", help="Calculate Ka/Ks for homologous gene pairs by YN00")
parser.add_argument("-bk", dest="blockks", help="Show Ks of blocks in a dotplot")
parser.add_argument("-bi", dest="blockinfo", help="Collinearity and Ks speculate whole genome duplication")
parser.add_argument("-c", dest="correspondence", help="Extract event-related genomic alignment")
parser.add_argument("-kp", dest="kspeaks", help="A simple way to get ks peaks")
parser.add_argument("-kf", dest="ksfigure", help="A simple way to draw ks distribution map")
parser.add_argument("-pf", dest="peaksfit", help="Gaussian fitting of ks distribution")
parser.add_argument("-pc", dest="polyploidy_classification", help="Polyploid distinguish among subgenomes")
parser.add_argument("-a", dest="alignment", help="Show event-related genomic alignment in a dotplot")
parser.add_argument("-k", dest="karyotype", help="Show genome evolution from reconstructed ancestors")
parser.add_argument("-ak", dest="ancestral_karyotype", help="Generation of ancestral karyotypes from chromosomes that retain same structures in genomes")
parser.add_argument("-akr", dest="ancestral_karyotype_repertoire", help="Incorporate genes from collinearity blocks into the ancestral karyotype repertoire")
parser.add_argument("-km", dest="karyotype_mapping", help="Mapping from the known karyotype result to this species")
parser.add_argument("-fpd", dest="fusion_positions_database", help="Extract the fusion positions dataset")
parser.add_argument("-fd", dest="fusions_detection", help="Determine whether these fusion events occur in other genomes")
parser.add_argument("-sf", dest="shared_fusion", help="Quickly find shared fusions between species")
parser.add_argument("-at", dest="alignmenttrees", help="Collinear genes construct phylogenetic trees")
parser.add_argument("-p", dest="pindex", help="Polyploidy-index characterize the degree of divergence between subgenomes of a polyploidy")
parser.add_argument("-r", dest="retain", help="Show subgenomes in gene retention or genome fractionation")
parser.add_argument("-ci", dest="circos", help="A simple way to run circos")
parser.add_argument("-conf", dest="configure", help="Display and modify the environment variable")

args = parser.parse_args()

# Function to run subprograms based on options
def run_subprogram(program, conf, name):
    options = base.load_conf(conf, name)
    r = program(options)
    r.run()

# Function to configure environment
def run_configure():
    base.rewrite(args.configure, 'ini')

# Main function to decide which module to run based on input arguments
def module_to_run(argument, conf):
    switcher = {
        'dotplot': (dotplot, conf, 'dotplot'),
        'correspondence': (block_correspondence, conf, 'correspondence'),
        'alignment': (align_dotplot, conf, 'alignment'),
        'retain': (retain, conf, 'retain'),
        'blockks': (block_ks, conf, 'blockks'),
        'blockinfo': (block_info, conf, 'blockinfo'),
        'calks': (ks, conf, 'ks'),
        'circos': (circos, conf, 'circos'),
        'kspeaks': (kspeaks, conf, 'kspeaks'),
        'peaksfit': (peaksfit, conf, 'peaksfit'),
        'ksfigure': (ksfigure, conf, 'ksfigure'),
        'pindex': (pindex, conf, 'pindex'),
        'alignmenttrees': (trees, conf, 'alignmenttrees'),
        'improvedcollinearity': (mycollinearity, conf, 'collinearity'),
        'configure': run_configure,
        'polyploidy_classification': (polyploidy_classification, conf, 'polyploidy classification'),
        'karyotype': (karyotype, conf, 'karyotype'),
        'ancestral_karyotype': (ancestral_karyotype, conf, 'ancestral_karyotype'),
        'karyotype_mapping': (karyotype_mapping, conf, 'karyotype_mapping'),
        'ancestral_karyotype_repertoire': (ancestral_karyotype_repertoire, conf, 'ancestral_karyotype_repertoire'),
        'shared_fusion': (shared_fusion, conf, 'shared_fusion'),
        'fusion_positions_database': (fusion_positions_database, conf, 'fusion_positions_database'),
        'fusions_detection': (fusions_detection, conf, 'fusions_detection'),
    }
    
    if argument == 'configure':
        run_configure()
    else:
        program, conf, name = switcher.get(argument)
        if program:
            run_subprogram(program, conf, name)


# Main entry point
def main():
    path = wgdi.__path__[0]
    options = {
        'dotplot': 'dotplot.conf',
        'correspondence': 'corr.conf',
        'alignment': 'align.conf',
        'retain': 'retain.conf',
        'blockks': 'blockks.conf',
        'blockinfo': 'blockinfo.conf',
        'calks': 'ks.conf',
        'circos': 'circos.conf',
        'kspeaks': 'kspeaks.conf',
        'ksfigure': 'ksfigure.conf',
        'pindex': 'pindex.conf',
        'alignmenttrees': 'alignmenttrees.conf',
        'peaksfit': 'peaksfit.conf',
        'configure': 'conf.ini',
        'improvedcollinearity': 'collinearity.conf',
        'polyploidy_classification': 'polyploidy_classification.conf',
        'karyotype': 'karyotype.conf',
        'ancestral_karyotype': 'ancestral_karyotype.conf',
        'ancestral_karyotype_repertoire': 'ancestral_karyotype_repertoire.conf',
        'karyotype_mapping': 'karyotype_mapping.conf',
        'shared_fusion': 'shared_fusion.conf',
        'fusion_positions_database': 'fusion_positions_database.conf',
        'fusions_detection': 'fusions_detection.conf',
    }

    for arg in vars(args):
        value = getattr(args, arg)
        if value is not None:
            if value in ['?', 'help', 'example']:
                with open(os.path.join(path, 'example', options[arg])) as f:
                    print(f.read())
                
                if arg == 'ksfigure' and not os.path.exists('ks_fit_result.csv'):
                    shutil.copy2(os.path.join(wgdi.__path__[0], 'example/ks_fit_result.csv'), os.getcwd())
            elif not os.path.exists(value):
                print(f'{value} not exists')
                sys.exit(0)
            else:
                module_to_run(arg, value)


if __name__ == "__main__":
    main()


================================================
FILE: build/lib/wgdi/run_colliearity.py
================================================
import gc
import re
import sys
from multiprocessing import Pool

import numpy as np
import pandas as pd

import wgdi.base as base
import wgdi.collinearity as improvedcollinearity


class mycollinearity():
    def __init__(self, options):
        # Initialize parameters with default values
        self.repeat_number = 10
        self.multiple = 1
        self.score = 100
        self.evalue = 1e-5
        self.blast_reverse = False
        self.over_gap  = 5
        self.comparison = 'genomes'
        self.options = options

        for k, v in options:
            setattr(self, str(k), v)
            print(f"{str(k)} = {v}")
        self.position = 'order'
        # Parse grading values
        if hasattr(self, 'grading'):
            self.grading = [int(k) for k in self.grading.split(',')]
        else:
            self.grading = [50, 40, 25]
        # Ensure process is an integer
        if hasattr(self, 'process'):
            self.process = int(self.process)
        else:
            self.process = 4
        self.over_gap  = int(self.over_gap )
        base.str_to_bool(self.blast_reverse)

    def deal_blast_for_chromosomes(self, blast, rednum, repeat_number):
        bluenum = rednum
        blast = blast.sort_values(by=[0, 11], ascending=[True, False])
        def assign_grading(group):
            group['cumcount'] = group.groupby(1).cumcount()
            group = group[group['cumcount'] <= repeat_number]
            group['grading'] = pd.cut(
                group['cumcount'],
                bins=[-1, 0, bluenum, repeat_number],
                labels=self.grading,
                right=True
            )
            return group
        newblast = blast.groupby(['chr1', 'chr2']).apply(assign_grading).reset_index(drop=True)
        newblast['grading'] = newblast['grading'].astype(int)
        return newblast[newblast['grading'] > 0]
    
    def deal_blast_for_genomes(self, blast, rednum, repeat_number):
        # Initialize the grading column
        blast['grading'] = 0
        
        # Define the blue number as the sum of rednum and the predefined constant
        bluenum = 4 + rednum
        
        # Get the indices for each group by sorting the 11th column in descending order
        index = [group.sort_values(by=[11], ascending=[False])[:repeat_number].index.tolist()
                for name, group in blast.groupby([0])]
        
        # Split the indices into red, blue, and gray groups
        reddata = np.array([k[:rednum] for k in index], dtype=object)
        bluedata = np.array([k[rednum:bluenum] for k in index], dtype=object)
        graydata = np.array([k[bluenum:repeat_number] for k in index], dtype=object)
        
        # Concatenate the results into flat lists
        redindex = np.concatenate(reddata) if reddata.size else []
        blueindex = np.concatenate(bluedata) if bluedata.size else []
        grayindex = np.concatenate(graydata) if graydata.size else []

        # Update the grading column based on the group indices
        blast.loc[redindex, 'grading'] = self.grading[0]
        blast.loc[blueindex, 'grading'] = self.grading[1]
        blast.loc[grayindex, 'grading'] = self.grading[2]

        # Return only the rows with non-zero grading
        return blast[blast['grading'] > 0]

    def run(self):
        # Read and process lens files
        lens1 = base.newlens(self.lens1, 'order')
        lens2 = base.newlens(self.lens2, 'order')
        # Read and process gff files
        gff1 = base.newgff(self.gff1)
        gff2 = base.newgff(self.gff2)
        # Filter gff data based on lens indices
        gff1 = gff1[gff1['chr'].isin(lens1.index)]
        gff2 = gff2[gff2['chr'].isin(lens2.index)]
        # Process blast data

        blast = base.newblast(self.blast, int(self.score), float(self.evalue),gff1, gff2, self.blast_reverse)

        # Map positions and chromosome information
        blast['loc1'] = blast[0].map(gff1[self.position])
        blast['loc2'] = blast[1].map(gff2[self.position])
        blast['chr1'] = blast[0].map(gff1['chr'])
        blast['chr2'] = blast[1].map(gff2['chr'])
        # Apply blast filtering and grading
        if self.comparison.lower() == 'genomes':
            blast = self.deal_blast_for_genomes(blast, int(self.multiple), int(self.repeat_number))
        if self.comparison.lower() == 'chromosomes':
            blast = self.deal_blast_for_chromosomes(blast, int(self.multiple), int(self.repeat_number))
        print(f"The filtered homologous gene pairs are {len(blast)}.\n")
        if len(blast) < 1:
            print("Stopped!\n\nIt may be that the id1 and id2 in the BLAST file do not match with (gff1, lens1) and (gff2, lens2).")
            sys.exit(1)
        # Group blast data by 'chr1' and 'chr2'
        total = []
        for (chr1, chr2), group in blast.groupby(['chr1', 'chr2']):
            total.append([chr1, chr2, group])
        del blast, group
        gc.collect()
        # Determine chunk size for multiprocessing
        n = int(np.ceil(len(total) / float(self.process)))
        result, data = '', []
        try:
            # Initialize multiprocessing Pool
            pool = Pool(self.process)
            for i in range(0, len(total), n):
                # Apply single_pool function asynchronously
                data.append(pool.apply_async(
                    self.single_pool, args=(total[i:i + n], gff1, gff2, lens1, lens2)
                ))
            pool.close()
            pool.join()
        except:
            pool.terminate()
        for k in data:
            # Collect results from async tasks
            text = k.get()
            if text:
                result += text
        # Write final output to file
        result = re.split('\n', result)
        fout = open(self.savefile, 'w')
        num = 1
        for line in result:
            if re.match(r"# Alignment", line):
                # Replace alignment number
                s = f'# Alignment {num}:'
                fout.write(s + line.split(':')[1] + '\n')
                num += 1
                continue
            if len(line) > 0:
                fout.write(line + '\n')
        fout.close()
        sys.exit(0)

    def single_pool(self, group, gff1, gff2, lens1, lens2):
        text = ''
        for bk in group:
            chr1, chr2 = str(bk[0]), str(bk[1])
            print(f'Running {chr1} vs {chr2}')
            # Extract and sort points
            points = bk[2][['loc1', 'loc2', 'grading']].sort_values(
                by=['loc1', 'loc2'], ascending=[True, True]
            )
            # Initialize collinearity analysis
            collinearity = improvedcollinearity.collinearity(
                self.options, points)
            data = collinearity.run()
            if not data:
                continue
            # Extract gene information
            gf1 = gff1[gff1['chr'] == chr1].reset_index().set_index('order')[[1, 'strand']]
            gf2 = gff2[gff2['chr'] == chr2].reset_index().set_index('order')[[1, 'strand']]
            n = 1
            for block, evalue, score in data:
                if len(block) < self.over_gap:
                    continue
                # Map gene names and strands
                block['name1'] = block['loc1'].map(gf1[1])
                block['name2'] = block['loc2'].map(gf2[1])
                block['strand1'] = block['loc1'].map(gf1['strand'])
                block['strand2'] = block['loc2'].map(gf2['strand'])
                block['strand'] = np.where(
                    block['strand1'] == block['strand2'], '1', '-1'
                )
                # Prepare text output
                block['text'] = block.apply(
                    lambda x: f"{x['name1']} {x['loc1']} {x['name2']} {x['loc2']} {x['strand']}\n",
                    axis=1
                )
                # Determine alignment mark
                a, b = block['loc2'].head(2).values
                mark = 'plus' if a < b else 'minus'
                # Append alignment information
                text += f'# Alignment {n}: score={score} pvalue={evalue} N={len(block)} {chr1}&{chr2} {mark}\n'
                text += ''.join(block['text'].values)
                n += 1
        return text

================================================
FILE: build/lib/wgdi/shared_fusion.py
================================================
import pandas as pd
import wgdi.base as base

class shared_fusion:
    def __init__(self, options):
        for k, v in options:
            setattr(self, str(k), v)
            print(f"{k} = {v}")
        
        # Handle classid and limit_length options
        self.classid = [str(k) for k in self.classid.split(',')] if hasattr(self, 'classid') else ['class1', 'class2']
        self.limit_length = int(self.limit_length) if hasattr(self, 'limit_length') else 20
        
        # Clean and split lens files
        self.lens1 = self.lens1.replace(' ', '').split(',')
        self.lens2 = self.lens2.replace(' ', '').split(',')

    def run(self):
        # Read classification files and block information
        ancestor_left = base.read_classification(self.ancestor_left)
        ancestor_top = base.read_classification(self.ancestor_top)
        bkinfo = pd.read_csv(self.blockinfo)

        # Preprocess blockinfo columns
        bkinfo['chr1'] = bkinfo['chr1'].astype(str)
        bkinfo['chr2'] = bkinfo['chr2'].astype(str)
        bkinfo['start1'] = bkinfo['start1'].astype(int)
        bkinfo['end1'] = bkinfo['end1'].astype(int)
        bkinfo['start2'] = bkinfo['start2'].astype(int)
        bkinfo['end2'] = bkinfo['end2'].astype(int)
        
        # Filter based on ancestor chromosomes
        bkinfo = bkinfo[(bkinfo['chr1'].isin(ancestor_left[0].values)) & 
                        (bkinfo['chr2'].isin(ancestor_top[0].values))]

        # Read lens files
        lens1 = pd.read_csv(self.lens1[0], sep='\t', header=None)
        lens2 = pd.read_csv(self.lens2[0], sep='\t', header=None)
        lens1[0] = lens1[0].astype(str)
        lens2[0] = lens2[0].astype(str)

        # Perform block fusion analysis
        blockinfoout = self.block_fusions(bkinfo, ancestor_left, ancestor_top)

        # Apply filters based on breakpoints and length
        blockinfoout = blockinfoout[(blockinfoout['breakpoints1'] == 1) & 
                                     (blockinfoout['breakpoints2'] == 1)]
        blockinfoout = blockinfoout[(blockinfoout['break_length1'] >= self.limit_length) & 
                                     (blockinfoout['break_length2'] >= self.limit_length)]

        # Save the filtered block info
        blockinfoout.to_csv(self.filtered_blockinfo, index=False)

        # Filter lens data based on the blockinfoout
        lens1 = lens1[lens1[0].isin(blockinfoout['chr1'].values)]
        lens2 = lens2[lens2[0].isin(blockinfoout['chr2'].values)]

        # Save filtered lens data
        lens1.to_csv(self.lens1[1], sep='\t', index=False, header=False)
        lens2.to_csv(self.lens2[1], sep='\t', index=False, header=False)

    def block_fusions(self, bkinfo, ancestor_left, ancestor_top):
        # Initialize new columns in the bkinfo dataframe
        bkinfo['breakpoints1'] = 0
        bkinfo['breakpoints2'] = 0
        bkinfo['break_length1'] = 0
        bkinfo['break_length2'] = 0

        for index, row in bkinfo.iterrows():
            # Process species 1 (chr1)
            a, b = sorted([row['start1'], row['end1']])
            d1 = ancestor_left[(ancestor_left[0] == row['chr1']) & 
                               (ancestor_left[2] >= a) & (ancestor_left[1] <= b)]
            if len(d1) > 1:
                bkinfo.loc[index, 'breakpoints1'] = 1
                breaklength_max = 0
                for _, row2 in d1.iterrows():
                    length_in = len([k for k in range(a, b) if k in range(row2[1], row2[2])])
                    length_out = (b - a) - length_in
                    breaklength_max = max(breaklength_max, min(length_in, length_out) + 1)
                bkinfo.loc[index, 'break_length1'] = breaklength_max

            # Process species 2 (chr2)
            c, d = sorted([row['start2'], row['end2']])
            d2 = ancestor_top[(ancestor_top[0] == row['chr2']) & 
                              (ancestor_top[2] >= c) & (ancestor_top[1] <= d)]
            if len(d2) > 1:
                bkinfo.loc[index, 'breakpoints2'] = 1
                breaklength_max = 0
                for _, row2 in d2.iterrows():
                    length_in = len([k for k in range(c, d) if k in range(row2[1], row2[2])])
                    length_out = (d - c) - length_in
                    breaklength_max = max(breaklength_max, min(length_in, length_out) + 1)
                bkinfo.loc[index, 'break_length2'] = breaklength_max

        return bkinfo


================================================
FILE: build/lib/wgdi/trees.py
================================================
import os
import shutil
from io import StringIO

import numpy as np
import pandas as pd
from Bio import AlignIO, Seq, SeqIO, SeqRecord
import subprocess

import wgdi.base as base


class trees():
    def __init__(self, options):
        base_conf = base.config()
        self.position = 'order'
        self.alignfile = ''
        self.align_trimming = ''
        self.trimming = 'trimal'
        self.threads = '1'
        self.minimum = 4
        self.tree_software = 'iqtree'
        self.delete_detail = True
        for k, v in base_conf:
            setattr(self, str(k), v)
        for k, v in options:
            setattr(self, str(k), v)
            print(str(k), ' = ', v)
        if hasattr(self, 'codon_position'):
            self.codon_position = [
                int(k)-1 for k in self.codon_position.split(',')]
        else:
            self.codon_position = [0, 1, 2]
        self.delete_detail = base.str_to_bool(self.delete_detail)

    def grouping(self, alignment):
        data = []
        indexs = []
        if not os.path.exists(self.dir):
            os.makedirs(self.dir)
        sequence = SeqIO.to_dict(SeqIO.parse(self.sequence_file, "fasta"))
        if hasattr(self, 'cds_file'):
            seq_cds = SeqIO.to_dict(SeqIO.parse(self.cds_file, "fasta"))
        for index, row in alignment.iterrows():
            file = base.gen_md5_id(str(row.values))
            self.sequencefile = os.path.join(self.dir, file+'.fasta')
            self.alignfile = os.path.join(self.dir, file+'.aln')
            self.align_trimming = self.alignfile+'.trimming'
            self.treefile = os.path.join(self.dir, file+'.aln.treefile')
            if os.path.isfile(self.treefile) and os.path.isfile(self.alignfile):
                data.append(self.treefile)
                indexs.append(index)
                continue
            ids = []
            ids_cds = []
            for i in range(len(row)):
                if type(row[i]) == float and np.isnan(row[i]):
                    continue
                gene_sequence = sequence[row[i]]
                gene_sequence.id = str(int(i)+1)
                gene_sequence.description = ''
                ids.append(gene_sequence)
            SeqIO.write(ids, self.sequencefile, "fasta")
            self.align()
            if hasattr(self, 'cds_file'):
                self.seqcdsfile = os.path.join(self.dir, file+'.cds.fasta')
                for i in range(len(row)):
                    if type(row[i]) == float and np.isnan(row[i]):
                        continue
                    gene_cds = seq_cds[row[i]]
                    gene_cds.id = str(int(i)+1)
                    ids_cds.append(gene_cds)
                SeqIO.write(ids_cds, self.seqcdsfile, "fasta")
                self.pal2nal()
                self.codon()
            if self.trimming.upper() == 'TRIMAL':
                self.trimal()
            if self.trimming.upper() == 'DIVVIER':
                self.divvier()
            self.buildtrees()
            if os.path.isfile(self.treefile):
                data.append(self.treefile)
        return data

    def codon(self):
        if self.codon_position == [0, 1, 2]:
            shutil.move(self.alignfile+'.mrtrans', self.alignfile)
            return True
        records = list(SeqIO.parse(self.alignfile+'.mrtrans', 'fasta'))
        if len(records) == 0:
            return False
        newrecords = []
        def final_list(test_list, x, y): return [
            test_list[i+j] for i in range(0, len(test_list), x) for j in y]
        for k in records:
            if len(k.seq) % 3 > 0:
                return False
            seq = final_list(k.seq, 3, self.codon_position)
            k.seq = ''.join(seq)
            newrecords.append(SeqRecord.SeqRecord(
                Seq.Seq(k.seq), id=k.id, description=''))
        SeqIO.write(newrecords, self.alignfile, 'fasta')
        return True

    def pal2nal(self):
        args = ['perl', self.pal2nal_path, self.alignfile,
                self.seqcdsfile, '-output fasta', '>'+self.alignfile+'.mrtrans']
        command = ' '.join(args)
        try:
            os.system(command)
        except:
            return False
        return True

    def align(self):
        if self.align_software == 'mafft':
            try:
                command = [self.mafft_path,'--quiet', self.sequencefile, '>', self.alignfile]
                subprocess.run(" ".join(command), shell=True, check=True)
            except subprocess.CalledProcessError as e:
                print(f"Error while running MAFFT: {e}")

        if self.align_software == 'muscle':
            try:
                command = [self.muscle_path,'-align', self.sequencefile, '-output', self.alignfile, '-quiet']
                subprocess.run(" ".join(command), shell=True, check=True)
            except subprocess.CalledProcessError as e:
                print(f"Error while running Muscle: {e}")

    def trimal(self):
        args = [self.trimal_path, '-in', self.alignfile,
                '-out', self.align_trimming, '-automated1']
        command = ' '.join(args)
        try:
            os.system(command)
        except:
            return False
        return True

    def divvier(self):
        args = [self.divvier_path, '-mincol', '4', '-divvygap', self.alignfile]
        command = ' '.join(args)
        try:
            os.system(command)
            os.rename(self.alignfile+'.divvy.fas', self.align_trimming)
        except:
            return False
        return True

    def buildtrees(self):
        try:
            if self.tree_software.upper() == 'IQTREE':
                args = [self.iqtree_path, '-s', self.align_trimming,
                        '-m', self.model, '-T', self.threads, '--quiet']
                command = ' '.join(args)
                os.system(command)
                os.rename(self.align_trimming+'.treefile', self.treefile)
            elif self.tree_software.upper() == 'FASTTREE':
                args = [self.fasttree_path,
                        self.align_trimming, '>', self.treefile]
                command = ' '.join(args)
                os.system(command)
        except:
            return False
        if self.delete_detail == True:
            for file in (self.sequencefile, self.align_trimming+'.bionj', self.align_trimming+'.iqtree', self.align_trimming+'.ckp.gz',
                         self.align_trimming+'.log', self.align_trimming+'.mldist', self.align_trimming+'.model.gz'):
                try:
                    os.remove(file)
                except OSError:
                    pass
        return True

    def run(self):
        alignment = pd.read_csv(self.alignment, header=None)
        alignment.replace('.', np.nan, inplace=True)
        alignment.dropna(thresh=int(self.minimum), inplace=True)
        if hasattr(self, 'gff') and hasattr(self, 'lens'):
            gff = base.newgff(self.gff)
            lens = base.newlens(self.lens, self.position)
            alignment = pd.merge(
                alignment, gff[['chr', self.position]], left_on=0, right_on=gff.index, how='left')
            alignment.dropna(subset=['chr', 'order'], inplace=True)
            alignment['order'] = alignment['order'].astype(int)
            alignment = alignment[alignment['chr'].isin(lens.index)]
            alignment.drop(alignment.columns[-2:], axis=1, inplace=True)
        data = self.grouping(alignment)
        fout = open(self.trees_file, 'w')
        fout.close()
        for i in range(0, len(data), 100):
            trees = ' '.join([str(k) for k in data[i:i+100]])
            args = ['cat', trees, '>>', self.trees_file]
            command = ' '.join([str(k) for k in args])
            os.system(command)
        df = pd.read_csv(self.trees_file, header=None, sep='\t')
        df[0].to_csv(self.trees_file, index=None, sep='\t', header=False)
        print("done")

================================================
FILE: command.txt
================================================
python setup.py sdist bdist_wheel
twine upload dist/*

================================================
FILE: setup.py
================================================
#!/usr/bin/env python
# -*- coding: UTF-8 -*-

from setuptools import find_packages, setup

with open("README.md", "r", encoding='utf-8') as fh:
    long_description = fh.read()

required = ['pandas>=1.1.0', 'numpy', 'biopython', 'matplotlib', 'scipy', 'tabulate']

setup(
    name="wgdi",
    version="0.75",
    author="Pengchuan Sun",
    author_email="sunpengchuan@gmail.com",
    description="A user-friendly toolkit for evolutionary analyses of whole-genome duplications and ancestral karyotypes",
    license="BSD License",
    long_description=long_description,
    long_description_content_type="text/markdown",
    url="https://github.com/SunPengChuan/wgdi",
    packages=find_packages(),
    package_data={'': ['*.conf','*.ini', '*.csv']},
    classifiers=[
        "Intended Audience :: Science/Research",
        "Programming Language :: Python :: 3",
        "License :: OSI Approved :: BSD License",
        "Operating System :: OS Independent",
    ],
    entry_points={
        'console_scripts': [
            'wgdi = wgdi.run:main',
        ]
    },
    zip_safe=True,
    install_requires=required
)


================================================
FILE: wgdi/__init__.py
================================================


================================================
FILE: wgdi/align_dotplot.py
================================================
import re
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import wgdi.base as base

class align_dotplot:
    def __init__(self, options):
        # Default values
        self.position = 'order'
        self.figsize = 'default'
        self.classid = 'class1'

        # Initialize from options
        for k, v in options:
            setattr(self, str(k), v)
            print(f'{k} = {v}')
        
        self.ks_area = [float(k) for k in getattr(self, 'ks_area', '-1,3').split(',')]
        self.colors = [str(k) for k in getattr(self, 'colors', 'red,blue,green,black,orange').split(',')]
        self.ancestor_top = None if getattr(self, 'ancestor_top', 'none') == 'none' else self.ancestor_top
        self.ancestor_left = None if getattr(self, 'ancestor_left', 'none') == 'none' else self.ancestor_left

        self.blockinfo_reverse = base.str_to_bool(self.blockinfo_reverse)

    def pair_position(self, alignment, loc1, loc2, colors):
        alignment.index = alignment.index.map(loc1)
        data = []
        for i, k in enumerate(alignment.columns):
            df = alignment[k].map(loc2).dropna()
            for idx, row in df.items():
                data.append([idx, row, colors[i]])
        return pd.DataFrame(data, columns=['loc1', 'loc2', 'color'])

    def run(self):
        axis = [0, 1, 1, 0]

        # Lens generation and figure size
        lens1 = base.newlens(self.lens1, self.position)
        lens2 = base.newlens(self.lens2, self.position)
        
        if re.search(r'\d', self.figsize):
            self.figsize = [float(k) for k in self.figsize.split(',')]
        else:
            self.figsize = np.array([1, float(lens1.sum()) / float(lens2.sum())]) * 10
            
        plt.rcParams['ytick.major.pad'] = 0

        # Create plot
        fig, ax = plt.subplots(figsize=self.figsize)
        ax.xaxis.set_ticks_position('top')
        step1, step2 = 1 / float(lens1.sum()), 1 / float(lens2.sum())

        # Process Ancestor Data
        if self.ancestor_left:
            axis[0] = -0.02
            lens_ancestor_left = self.process_ancestor(self.ancestor_left, lens1.index)

        if self.ancestor_top:
            axis[3] = -0.02
            lens_ancestor_top = self.process_ancestor(self.ancestor_top, lens2.index)

        base.dotplot_frame(fig, ax, lens1, lens2, step1, step2, 
                           self.genome1_name, self.genome2_name, [0, 1])

        # Process GFF files
        gff1, gff2 = base.newgff(self.gff1), base.newgff(self.gff2)
        gff1 = base.gene_location(gff1, lens1, step1, self.position)
        gff2 = base.gene_location(gff2, lens2, step2, self.position)

        if self.ancestor_top:
            self.ancestor_position(ax, gff2, lens_ancestor_top, 'top')

        if self.ancestor_left:
            self.ancestor_position(ax, gff1, lens_ancestor_left, 'left')

        # Process block info and alignment
        bkinfo = self.process_blockinfo(lens1,lens2)
        align = self.alignment(gff1, gff2, bkinfo)
        alignment = align[gff1.columns[-len(bkinfo[self.classid].drop_duplicates()):]]
        alignment.to_csv(self.savefile, header=False)

        # Create scatter plot
        df = self.pair_position(alignment, gff1['loc'], gff2['loc'], self.colors)
        plt.scatter(df['loc2'], df['loc1'], s=float(self.markersize), c=df['color'], 
                    alpha=0.5, edgecolors=None, linewidths=0, marker='o')

        ax.axis(axis)
        plt.subplots_adjust(left=0.07, right=0.97, top=0.93, bottom=0.03)
        plt.savefig(self.savefig, dpi=500)
        plt.show()

    def process_ancestor(self, ancestor_file, lens_index):
        df = pd.read_csv(ancestor_file, sep="\t", header=None)
        df[0] = df[0].astype(str)
        df[3] = df[3].astype(str)
        df[4] = df[4].astype(int)
        df[4] = df[4] / df[4].max()
        return df[df[0].isin(lens_index)]

    def process_blockinfo(self, lens1, lens2):
        bkinfo = pd.read_csv(self.blockinfo, index_col='id')
        if self.blockinfo_reverse ==  True:
            bkinfo[['chr1', 'chr2']] = bkinfo[['chr2', 'chr1']]
            bkinfo[['block1', 'block2']] = bkinfo[['block2', 'block1']]
        bkinfo['chr1'] = bkinfo['chr1'].astype(str)
        bkinfo['chr2'] = bkinfo['chr2'].astype(str)
        bkinfo[self.classid] = bkinfo[self.classid].astype(str)
        return bkinfo[bkinfo['chr1'].isin(lens1.index) & (bkinfo['chr2'].isin(lens2.index))]

    def alignment(self, gff1, gff2, bkinfo):
        gff1['uid'] = gff1['chr'] + 'g' + gff1['order'].astype(str)
        gff2['uid'] = gff2['chr'] + 'g' + gff2['order'].astype(str)
        gff1['id'] = gff1.index
        gff2['id'] = gff2.index
        
        for cl, group in bkinfo.groupby(self.classid):
            name = f'l{cl}'
            gff1[name] = ''
            group = group.sort_values(by=['length'], ascending=True)

            for _, row in group.iterrows():
                block = self.create_block_dataframe(row)
                if block.empty:
                    continue
                block1_min, block1_max = block['block1'].agg(['min', 'max'])
                area = gff1[(gff1['chr'] == row['chr1']) & 
                            (gff1['order'] >= block1_min) & 
                            (gff1['order'] <= block1_max)].index
                
                block['id1'] = (row['chr1'] + 'g' + block['block1'].astype(str)).map(
                    dict(zip(gff1['uid'], gff1.index)))
                block['id2'] = (row['chr2'] + 'g' + block['block2'].astype(str)).map(
                    dict(zip(gff2['uid'], gff2.index)))

                gff1.loc[block['id1'].values, name] = block['id2'].values
                gff1.loc[gff1.index.isin(area) & gff1[name].eq(''), name] = '.'
        return gff1

    def create_block_dataframe(self, row):
        b1, b2, ks = row['block1'].split('_'), row['block2'].split('_'), row['ks'].split('_')
        ks = list(map(float, ks[1:])) if ks[0] == '' else list(map(float, ks))
        block = pd.DataFrame(np.array([b1, b2, ks]).T, columns=['block1', 'block2', 'ks'])
        block['block1'] = block['block1'].astype(int)
        block['block2'] = block['block2'].astype(int)
        block['ks'] = block['ks'].astype(float)
        return block[(block['ks'] <= self.ks_area[1]) & 
                     (block['ks'] >= self.ks_area[0])].drop_duplicates(subset=['block1'], keep='first')

    def ancestor_position(self, ax, gff, lens, mark):
        for _, row in lens.iterrows():
            loc1 = gff[(gff['chr'] == row[0]) & (gff['order'] == int(row[1]))].index
            loc2 = gff[(gff['chr'] == row[0]) & (gff['order'] == int(row[2]))].index
            loc1, loc2 = gff.loc[[loc1[0], loc2[0]], 'loc']
            if mark == 'top':
                width = abs(loc1-loc2)
                loc = [min(loc1, loc2), 0]
                height = -0.02
            if mark == 'left':
                height = abs(loc1-loc2)
                loc = [-0.02, min(loc1, loc2), ]
                width = 0.02
            base.Rectangle(ax, loc, height, width, row[3], row[4])

================================================
FILE: wgdi/ancestral_karyotype.py
================================================
import pandas as pd
from Bio import SeqIO
import wgdi.base as base


class ancestral_karyotype:
    def __init__(self, options):
        self.mark = 'aak'
        
        # Set attributes from options
        for k, v in options:
            setattr(self, str(k), v)
            print(f"{k} = {v}")

    def run(self):
        # Load and filter data
        gff = base.newgff(self.gff)
        ancestor = base.read_classification(self.ancestor)
        gff = gff[gff['chr'].isin(ancestor[0].values.tolist())]

        # Create new gff copy and initialize required variables
        newgff = gff.copy()
        data, num = [], 1

        # Create dictionary mapping chromosome to order
        chr_arr = ancestor[3].drop_duplicates().to_list()
        chr_dict = {chr: idx + 1 for idx, chr in enumerate(chr_arr)}
        ancestor['order'] = ancestor[3].map(chr_dict)

        dict1, dict2 = {}, {}

        # Process ancestor and gff information
        for (cla, order), group in ancestor.groupby([4, 'order'], sort=[False, False]):
            for index, row in group.iterrows():
                index1 = gff[(gff['chr'] == row[0]) & (gff['order'] >= row[1]) & (gff['order'] <= row[2])].index
                newgff.loc[index1, 'chr'] = str(num)
                
                # Store results in data
                for k in index1:
                    data.append(newgff.loc[k, :].values.tolist() + [k])

            dict1[str(num)] = cla
            dict2[str(num)] = group[3].values[0]
            num += 1

        # Create dataframe from the data collected
        df = pd.DataFrame(data)

        # Filter based on peptide file
        pep = SeqIO.to_dict(SeqIO.parse(self.pep_file, "fasta"))
        df = df[df[6].isin(pep.keys())]

        # Assign new names and order
        for name, group in df.groupby(0):
            df.loc[group.index, 'order'] = range(1, len(group) + 1)
            df.loc[group.index, 'newname'] = [f"{self.mark}{name}g{i:05d}" for i in range(1, len(group) + 1)]

        # Set data types and sort
        df['order'] = df['order'].astype(int)
        df = df[[0, 'newname', 1, 2, 3, 'order', 6]].sort_values(by=[0, 'order'])

        # Save output files
        df.to_csv(self.ancestor_gff, sep="\t", index=False, header=None)
        lens = df.groupby(0).max()[[2, 'order']]
        lens.to_csv(self.ancestor_lens, sep="\t", header=None)

        # Add extra columns and save final results
        lens[1] = 1
        lens['color'] = lens.index.map(dict2)
        lens['class'] = lens.index.map(dict1)
        lens[[1, 'order', 'color', 'class']].to_csv(self.ancestor_file, sep="\t", header=None)

        # Update peptide sequences with new IDs and save
        id_dict = df.set_index(6).to_dict()['newname']
        seqs = []

        for seq_record in SeqIO.parse(self.pep_file, "fasta"):
            if seq_record.id in id_dict:
                seq_record.id = id_dict[seq_record.id]
                seqs.append(seq_record)

        SeqIO.write(seqs, self.ancestor_pep, "fasta")


================================================
FILE: wgdi/ancestral_karyotype_repertoire.py
================================================

import numpy as np
import pandas as pd
from Bio import SeqIO

import wgdi.base as base

class ancestral_karyotype_repertoire():
    def __init__(self, options):
        self.gap = 5
        self.direction = 0.01
        self.mark = 'aak1s'
        self.blockinfo_reverse = False
        for k, v in options:
            setattr(self, str(k), v)
            print(k, ' = ', v)
        self.blockinfo_reverse =  base.str_to_bool(self.blockinfo_reverse)

    def run(self):
        gff1 = base.newgff(self.gff1)
        gff2 = base.newgff(self.gff2)
        bkinfo = pd.read_csv(self.blockinfo, index_col='id')
        if self.blockinfo_reverse == True:
            bkinfo[['chr1', 'chr2']] = bkinfo[['chr2', 'chr1']]
            bkinfo[['block1', 'block2']] = bkinfo[['block2', 'block1']]
        for index, row in bkinfo.iterrows():
            block1, block2 = row['block1'].split('_'), row['block2'].split('_')
            block1, block2 = [int(k) for k in block1], [int(k) for k in block2]
            if int(block1[1])-int(block1[0]) < 0:
                self.direction = -0.01
            for i in range(1, len(block2)):
                if abs(block1[i]-block1[i-1]) == 1 and abs(block2[i]-block2[i-1]) < int(self.gap):
                    gff1_id = gff1[(gff1['chr'] == str(row['chr1'])) & (
                        gff1['order'] == block1[i])].index[0]
                    order = gff1.loc[gff1_id, 'order']
                    gff1_row = gff1.loc[gff1_id, :].copy()
                    for num in range(block2[i-1], block2[i]):
                        order = order + self.direction
                        id = gff2[(gff2['chr'] == str(row['chr2']))
                                  & (gff2['order'] == num)].index[0]
                        gff1_row['order'] = order
                        gff1.loc[id, :] = gff1_row
        df = gff1.copy()
        df = df.sort_values(by=['chr', 'order'])
        for name, group in df.groupby(['chr']):
            df.loc[group.index, 'order'] = list(range(1, len(group)+1))
            df.loc[group.index, 'newname'] = list(
                [str(self.mark)+str(name)+'g'+str(i).zfill(5) for i in range(1, len(group)+1)])
        df['order'] = df['order'].astype(int)
        df['oldname'] = df.index
        columns = ['chr', 'newname', 'start',
                   'end', 'strand', 'order', 'oldname']
        df[columns].to_csv(self.ancestor_gff, sep="\t",
                           index=False, header=None)
        lens = df.groupby('chr').max()[['end', 'order']]
        lens['end'] = lens['end'].astype(np.int64)
        lens.to_csv(self.ancestor_lens, sep="\t", header=None)
        ancestor = base.read_classification(self.ancestor)
        for index, row in ancestor.iterrows():
            ancestor.at[index, 1] = 1
            ancestor.at[index, 2] = lens.at[str(row[0]),'order']
        ancestor.to_csv(self.ancestor_new, sep="\t", index=False, header=None)
        id_dict = df['newname'].to_dict()
        seqs = []
        for seq_record in SeqIO.parse(self.ancestor_pep, "fasta"):
            if seq_record.id in id_dict:
                seq_record.id = id_dict[seq_record.id]
            else:
                continue
            seq_record.description = ''
            seqs.append(seq_record)
        SeqIO.write(seqs, self.ancestor_pep_new, "fasta")


================================================
FILE: wgdi/base.py
================================================
import configparser
import hashlib
import os
import re

import matplotlib
import matplotlib.patches as mpatches
import numpy as np
import pandas as pd
from Bio import SeqIO

import wgdi


def gen_md5_id(item):
    """Generate MD5 hash for the given item."""
    return hashlib.md5(item.encode('utf-8')).hexdigest()


def config():
    """Read configuration from the example conf.ini file."""
    conf = configparser.ConfigParser()
    conf.read(os.path.join(wgdi.__path__[0], 'example/conf.ini'))
    return conf.items('ini')


def load_conf(file, section):
    """Load configuration items from the specified section."""
    conf = configparser.ConfigParser()
    conf.read(file)
    return conf.items(section)


def rewrite(file, section):
    """Rewrite the configuration file to keep only the specified section."""
    conf = configparser.ConfigParser()
    conf.read(file)
    if conf.has_section(section):
        for k in conf.sections():
            if k != section:
                conf.remove_section(k)
        conf.write(open(os.path.join(wgdi.__path__[0], 'example/conf.ini'), 'w'))
        print('Option ini has been modified')
    else:
        print('Option ini no change')


def read_colinearscan(file):
    """Read colinearscan output and parse into data structure."""
    data, b, flag, num = [], [], 0, 1
    with open(file) as f:
        for line in f:
            line = line.strip()
            if re.match(r"the", line):
                num = re.search(r'\d+', line).group()
                b = []
                flag = 1
                continue
            if re.match(r"\>LOCALE", line):
                flag = 0
                p = re.split(':', line)
                if b:
                    data.append([num, b, p[1]])
                b = []
                continue
            if flag == 1:
                a = re.split(r"\s", line)
                b.append(a)
    if b:
        data.append([num, b, p[1]])
    return data


def read_mcscanx(fn):
    """Read mcscanx output and parse into data structure."""
    with open(fn) as f1:
        data, b = [], []
        flag, num = 0, 0
        for line in f1:
            line = line.strip()
            if re.match(r"## Alignment", line):
                flag = 1
                if not b:
                    arr = re.findall(r"[\d+\.]+", line)[0]
                    continue
                data.append([num, b, 0])
                b = []
                num = re.findall(r"\d+", line)[0]
                continue
            if flag == 0:
                continue
            a = re.split(r"\:", line)
            c = re.split(r"\s+", a[1])
            b.append([c[1], c[1], c[2], c[2]])
        if b:
            data.append([num, b, 0])
    return data


def read_jcvi(fn):
    """Read jcvi output and parse into data structure."""
    with open(fn) as f1:
        data, b = [], []
        num = 1
        for line in f1:
            line = line.strip()
            if re.match(r"###", line):
                if b:
                    data.append([num, b, 0])
                    b = []
                num += 1
                continue
            a = re.split(r"\t", line)
            b.append([a[0], a[0], a[1], a[1]])
        if b:
            data.append([num, b, 0])
    return data


def read_collinearity(fn):
    """Read collinearity output and parse into data structure."""
    with open(fn) as f1:
        data, b = [], []
        flag, arr = 0, []
        for line in f1:
            line = line.strip()
            if re.match(r"# Alignment", line):
                flag = 1
                if not b:
                    arr = re.findall(r'[\.\d+]+', line)
                    continue
                data.append([arr[0], b, arr[2]])
                b = []
                arr = re.findall(r'[\.\d+]+', line)
                continue
            if flag == 0:
                continue
            b.append(re.split(r"\s", line))
        if b:
            data.append([arr[0], b, arr[2]])
    return data


def read_ks(file, col):
    """Read KS values from file and select specified column."""
    ks = pd.read_csv(file, sep='\t')
    ks.drop_duplicates(subset=['id1', 'id2'], keep='first', inplace=True)
    ks[col] = ks[col].astype(float)
    ks = ks[ks[col] >= 0]
    ks.index = ks['id1'] + ',' + ks['id2']
    return ks[col]


def get_median(data):
    """Calculate the median of the data list."""
    if not data:
        return 0
    data_sorted = sorted(data)
    half = len(data_sorted) // 2
    return (data_sorted[half] + data_sorted[-(half + 1)]) / 2


def cds_to_pep(cds_file, pep_file, fmt='fasta'):
    """Translate CDS sequences to peptide sequences and write to file."""
    records = list(SeqIO.parse(cds_file, fmt))
    for rec in records:
        rec.seq = rec.seq.translate()
    SeqIO.write(records, pep_file, 'fasta')
    return True


def newblast(file, score, evalue, gene_loc1, gene_loc2, reverse):
    """Filter BLAST results based on score, evalue, and gene locations."""
    blast = pd.read_csv(file, sep="\t", header=None)
    
    if reverse == 'true':
        blast[[0, 1]] = blast[[1, 0]]
    blast = blast[(blast[11] >= score) & (blast[10] < evalue) & (blast[1] != blast[0])]
    blast = blast[(blast[0].isin(gene_loc1.index)) & (blast[1].isin(gene_loc2.index))]
    blast.drop_duplicates(subset=[0, 1], keep='first', inplace=True)
    blast[0] = blast[0].astype(str)
    blast[1] = blast[1].astype(str)
    return blast


def newgff(file):
    """Read GFF file and rename columns with appropriate data types."""
    gff = pd.read_csv(file, sep="\t", header=None, index_col=1)
    gff.rename(columns={0: 'chr', 2: 'start', 3: 'end', 4: 'strand', 5: 'order'}, inplace=True)
    gff['chr'] = gff['chr'].astype(str)
    gff['start'] = gff['start'].astype(np.int64)
    gff['end'] = gff['end'].astype(np.int64)
    gff['strand'] = gff['strand'].astype(str)
    gff['order'] = gff['order'].astype(int)
    return gff


def newlens(file, position):
    """Read lens file and select position based on 'order' or 'end'."""
    lens = pd.read_csv(file, sep="\t", header=None, index_col=0)
    lens.index = lens.index.astype(str)
    if position == 'order':
        lens = lens[2]
    elif position == 'end':
        lens = lens[1]
    return lens


def read_classification(file):
    """Read classification data and convert columns to appropriate types."""
    classification = pd.read_csv(file, sep="\t", header=None)
    classification[0] = classification[0].astype(str)
    classification[1] = classification[1].astype(int)
    classification[2] = classification[2].astype(int)
    classification[3] = classification[3].astype(str)
    classification[4] = classification[4].astype(int)
    return classification


def gene_location(gff, lens, step, position):
    """Calculate gene locations based on lens and step."""
    gff = gff[gff['chr'].isin(lens.index)].copy()
    if gff.empty:
        print('Stoped! \n\nChromosomes in gff file and lens file do not correspond.')
        exit(0)
    dict_chr = dict(zip(lens.index, np.append(np.array([0]), lens.cumsum()[:-1].values)))
    gff['loc'] = ''
    for name, group in gff.groupby('chr'):
        gff.loc[group.index, 'loc'] = (dict_chr[name] + group[position]) * step
    return gff


def dotplot_frame(fig, ax, lens1, lens2, step1, step2, genome1_name, genome2_name, arr, pad = 0):
    """Set up the dotplot frame with grid lines and labels."""
    for k in lens1.cumsum()[:-1] * step1:
        ax.axhline(y=k, alpha=0.8, color='black', lw=0.5)
    for k in lens2.cumsum()[:-1] * step2:
        ax.axvline(x=k, alpha=0.8, color='black', lw=0.5)
    align = dict(family='DejaVu Sans', style='italic', horizontalalignment="center", verticalalignment="center")
    yticks = lens1.cumsum() * step1 - 0.5 * lens1 * step1
    ax.set_yticks(yticks)
    ax.set_yticklabels(lens1.index, fontsize = 13, family='DejaVu Sans', style='normal')
    ax.tick_params(axis='y', which='major', pad = pad)
    ax.tick_params(axis='x', which='major', pad = pad)
    xticks = lens2.cumsum() * step2 - 0.5 * lens2 * step2
    ax.set_xticks(xticks)
    ax.set_xticklabels(lens2.index, fontsize = 13, family='DejaVu Sans', style='normal')
    ax.xaxis.set_ticks_position('none')
    ax.yaxis.set_ticks_position('none')
    if arr[0] <= 0:
        ax.text(-0.06, 0.5, genome1_name, weight='semibold', fontsize=16, rotation=90, **align)
    else:
        ax.text(-0.06, 0.5, genome1_name, weight='semibold', fontsize=16, rotation=90, **align)
    if arr[1] < 0:
        ax.text(0.5, -0.06, genome2_name, weight='semibold', fontsize=16, **align)
    else:
        ax.text(0.5, -0.06, genome2_name, weight='semibold', fontsize=16, **align)

def Bezier3(plist, t):
    """Calculate Bezier curve of degree 3."""
    p0, p1, p2 = plist
    return p0 * (1 - t) ** 2 + 2 * p1 * t * (1 - t) + p2 * t ** 2


def Bezier4(plist, t):
    """Calculate Bezier curve of degree 4."""
    p0, p1, p2, p3, p4 = plist
    return p0 * (1 - t) ** 4 + 4 * p1 * t * (1 - t) ** 3 + 6 * p2 * t ** 2 * (1 - t) ** 2 + 4 * p3 * (1 - t) * t ** 3 + p4 * t ** 4


def Rectangle(ax, loc, height, width, color, alpha):
    """Draw a rectangle on the axes with specified properties."""
    p = mpatches.Rectangle(loc, width, height, edgecolor=None, facecolor=color, alpha=alpha)
    ax.add_patch(p)

def str_to_bool(s):
    if isinstance(s, bool):
        return s 
    return str(s).strip().lower() == 'true'

================================================
FILE: wgdi/block_correspondence.py
================================================
import re
import numpy as np
import pandas as pd
import wgdi.base as base

class block_correspondence():
    def __init__(self, options):
        # Default values
        self.tandem = True
        self.pvalue = 0.2
        self.position = 'order'
        self.block_length = 5
        self.tandem_length = 200
        self.tandem_ratio = 1
        self.ks_hit = 0.5

        # Set user-defined options
        for k, v in options:
            setattr(self, str(k), v)
            print(k, ' = ', v)

        # Parse ks_area and homo if present
        self.ks_area = [float(k) for k in getattr(self, 'ks_area', '-1,3').split(',')]
        self.homo = [float(k) for k in self.homo.split(',')]
        self.tandem_ratio = float(self.tandem_ratio)
        self.tandem = base.str_to_bool(self.tandem)

    def run(self):
        lens1 = base.newlens(self.lens1, self.position)
        lens2 = base.newlens(self.lens2, self.position)
        
        # Load block information from CSV
        bkinfo = pd.read_csv(self.blockinfo)
        bkinfo = self.preprocess_blockinfo(bkinfo, lens1, lens2)
        
        # Initialize correspondence DataFrame
        cor = self.initialize_correspondence(lens1, lens2)
        
        # If no tandem allowed, remove tandem regions
        if not self.tandem:
            bkinfo = self.remove_tandem(bkinfo)
        
        # Remove low KS hits
        bkinfo = self.remove_ks_hit(bkinfo)

        # Find collinearity regions and save results
        collinear_indices = self.collinearity_region(cor, bkinfo, lens1)
        bkinfo.loc[bkinfo.index.isin(collinear_indices), :].to_csv(self.savefile, index=False)

    def preprocess_blockinfo(self, bkinfo, lens1, lens2):
        bkinfo['chr1'] = bkinfo['chr1'].astype(str)
        bkinfo['chr2'] = bkinfo['chr2'].astype(str)
        
        # Filter by length, chromosome indices, and p-value
        bkinfo = bkinfo[(bkinfo['length'] >= int(self.block_length)) & 
                        (bkinfo['chr1'].isin(lens1.index)) & 
                        (bkinfo['chr2'].isin(lens2.index)) & 
                        (bkinfo['pvalue'] <= float(self.pvalue))]
        
        # Filter by tandem ratio if the column exists
        if 'tandem_ratio' in bkinfo.columns:
            bkinfo = bkinfo[bkinfo['tandem_ratio'] <= self.tandem_ratio]
        
        return bkinfo

    def initialize_correspondence(self, lens1, lens2):
        # Create correspondence DataFrame with initial values
        cor = [[k, i, 0, lens1[i], j, 0, lens2[j], float(self.homo[0]), float(self.homo[1])] 
               for k in range(1, int(self.multiple) + 1) 
               for i in lens1.index 
               for j in lens2.index]
        
        cor = pd.DataFrame(cor, columns=['sub', 'chr1', 'start1', 'end1', 'chr2', 'start2', 'end2', 'homo1', 'homo2'])
        cor['chr1'] = cor['chr1'].astype(str)
        cor['chr2'] = cor['chr2'].astype(str)
        
        return cor

    def remove_tandem(self, bkinfo):
        # Remove tandem regions from the DataFrame
        group = bkinfo[bkinfo['chr1'] == bkinfo['chr2']].copy()
        group['start'] = group['start1'] - group['start2']
        group['end'] = group['end1'] - group['end2']
        tandem_condition = (group['start'].abs() <= int(self.tandem_length)) | (group['end'].abs() <= int(self.tandem_length))
        index_to_remove = group[tandem_condition].index
        return bkinfo.drop(index_to_remove)

    def remove_ks_hit(self, bkinfo):
        # Remove records with insufficient KS hits
        for index, row in bkinfo.iterrows():
            ks = self.get_ks_value(row['ks'])
            ks_ratio = len([k for k in ks if self.ks_area[0] <= k <= self.ks_area[1]]) / len(ks)
            if ks_ratio < self.ks_hit:
                bkinfo.drop(index, inplace=True)
        return bkinfo

    def get_ks_value(self, ks_str):
        # Extract and return KS values as floats
        ks = ks_str.split('_')
        ks = list(map(float, ks[1:])) if ks[0] == '' else list(map(float, ks))
        return ks

    def collinearity_region(self, cor, bkinfo, lens):
        collinear_indices = []
        for (chr1, chr2), group in bkinfo.groupby(['chr1', 'chr2']):
            group = group.sort_values(by=['length'], ascending=False)
            df = pd.Series(0, index=range(1, int(lens[str(chr1)]) + 1))
            for index, row in group.iterrows():
                # Check homology conditions
                if not self.is_valid_homo(row):
                    continue
                # Update the block series and compute ratio
                b1 = [int(k) for k in row['block1'].split('_')]
                df1 = df.copy()
                df1[b1] += 1
                ratio = (len(df1[df1 > 0]) - len(df[df > 0])) / len(b1)
                if ratio < 0.5:
                    continue
                df[b1] += 1
                collinear_indices.append(index)
        
        return collinear_indices

    def is_valid_homo(self, row):
        # Check if the homology values are within the specified range
        return self.homo[0] <= row['homo' + self.multiple] <= self.homo[1]


================================================
FILE: wgdi/block_info.py
================================================
import numpy as np
import pandas as pd
import wgdi.base as base


class block_info:
    def __init__(self, options):
        self.repeat_number = 20
        self.ks_col = 'ks_NG86'
        self.blast_reverse = False
        for k, v in options:
            setattr(self, str(k), v)
            print(f"{k} = {v}")
        
        self.repeat_number = int(self.repeat_number)
        self.blast_reverse = base.str_to_bool(self.blast_reverse)

    def block_position(self, collinearity, blast, gff1, gff2, ks):
        data = []
        for block in collinearity:
            blk_homo, blk_ks = [], []

            # Skip blocks with missing gene coordinates in GFF files
            if block[1][0][0] not in gff1.index or block[1][0][2] not in gff2.index:
                continue
            
            # Extract chromosome info
            chr1, chr2 = gff1.at[block[1][0][0], 'chr'], gff2.at[block[1][0][2], 'chr']
            
            # Extract start and end positions
            array1, array2 = [float(i[1]) for i in block[1]], [float(i[3]) for i in block[1]]
            start1, end1 = array1[0], array1[-1]
            start2, end2 = array2[0], array2[-1]
            
            block1, block2 = [], []
            for k in block[1]:
                block1.append(int(float(k[1])))
                block2.append(int(float(k[3])))
                
                # Check for KS values
                pair_ks = self.get_ks_value(ks, k)
                blk_ks.append(pair_ks)

                # Retrieve blast homo data
                if k[0]+","+k[2] in blast.index:
                    blk_homo.append(blast.loc[k[0]+","+k[2], [f'homo{i}' for i in range(1, 6)]].values.tolist())
            
            ks_median, ks_average = self.calculate_ks_statistics(blk_ks)
            homo = self.calculate_homo_statistics(blk_homo)

            blkks = '_'.join([str(k) for k in blk_ks])
            block1 = '_'.join([str(k) for k in block1])
            block2 = '_'.join([str(k) for k in block2])
            
            # Calculate tandem ratio
            tandem_ratio = self.tandem_ratio(blast, gff2, block[1])
            
            # Store the results
            data.append([
                block[0], chr1, chr2, start1, end1, start2, end2, block[2], len(block[1]), 
                ks_median, ks_average, *homo, block1, block2, blkks, tandem_ratio
            ])
        
        # Create a DataFrame with the results
        data_df = pd.DataFrame(data, columns=[
            'id', 'chr1', 'chr2', 'start1', 'end1', 'start2', 'end2', 'pvalue', 'length', 
            'ks_median', 'ks_average', 'homo1', 'homo2', 'homo3', 'homo4', 'homo5', 
            'block1', 'block2', 'ks', 'tandem_ratio'
        ])

        # Calculate density
        data_df['density1'] = data_df['length'] / ((data_df['end1'] - data_df['start1']).abs() + 1)
        data_df['density2'] = data_df['length'] / ((data_df['end2'] - data_df['start2']).abs() + 1)

        return data_df

    def get_ks_value(self, ks, k):
        """Return KS value for the given pair of genes."""
        pair = f"{k[0]},{k[2]}"
        if pair in ks.index:
            return ks[pair]
        pair_rev = f"{k[2]},{k[0]}"
        if pair_rev in ks.index:
            return ks[pair_rev]
        return -1

    def calculate_ks_statistics(self, blk_ks):
        """Calculate KS statistics: median and average."""
        ks_arr = [k for k in blk_ks if k >= 0]
        if len(ks_arr) == 0:
            return -1, -1
        ks_median = base.get_median(ks_arr)
        ks_average = sum(ks_arr) / len(ks_arr)
        return ks_median, ks_average

    def calculate_homo_statistics(self, blk_homo):
        """Calculate homo statistics by averaging across all blocks."""
        df = pd.DataFrame(blk_homo)
        homo = df.mean().values if len(df) > 0 else [-1, -1, -1, -1, -1]
        return homo

    def blast_homo(self, blast, gff1, gff2, repeat_number):
        """Assign homo values based on blast data."""
        index = [group.sort_values(by=11, ascending=False)[:repeat_number].index.tolist() for name, group in blast.groupby([0])]
        blast = blast.loc[np.concatenate([k[:repeat_number] for k in index], dtype=object), [0, 1]]
        blast = blast.assign(homo1=np.nan, homo2=np.nan, homo3=np.nan, homo4=np.nan, homo5=np.nan)

        # Assign homo values
        for i in range(1, 6):
            bluenum = i + 5
            redindex = np.concatenate([k[:i] for k in index], dtype=object)
            blueindex = np.concatenate([k[i:bluenum] for k in index], dtype=object)
            grayindex = np.concatenate([k[bluenum:repeat_number] for k in index], dtype=object)
            blast.loc[redindex, f'homo{i}'] = 1
            blast.loc[blueindex, f'homo{i}'] = 0
            blast.loc[grayindex, f'homo{i}'] = -1
        
        blast['chr1_order'] = blast[0].map(gff1['order'])
        blast['chr2_order'] = blast[1].map(gff2['order'])
        return blast

    def tandem_ratio(self, blast, gff2, block):
        """Calculate tandem ratio for a block."""
        block = pd.DataFrame(block)[[0, 2]].rename(columns={0: 'id1', 2: 'id2'})
        block['order2'] = block['id2'].map(gff2['order'])

        # Filter block_blast data
        block_blast = blast[(blast[0].isin(block['id1'].values)) & (blast[1].isin(block['id2'].values))].copy()
        block_blast = pd.merge(block_blast, block, left_on=0, right_on='id1', how='left')
        block_blast['difference'] = (block_blast['chr2_order'] - block_blast['order2']).abs()

        # Filter based on difference and calculate ratio
        block_blast = block_blast[(block_blast['difference'] <= self.repeat_number) & (block_blast['difference'] > 0)]
        return len(block_blast[0].unique()) / len(block) * len(block_blast) / (len(block) + len(block_blast))

    def run(self):
        """Main function to run the analysis."""
        # Initialize required datasets
        lens1 = base.newlens(self.lens1, self.position)
        lens2 = base.newlens(self.lens2, self.position)
        gff1 = base.newgff(self.gff1)
        gff2 = base.newgff(self.gff2)

        # Filter GFF files based on chromosome indices
        gff1 = gff1[gff1['chr'].isin(lens1.index)]
        gff2 = gff2[gff2['chr'].isin(lens2.index)]

        # Load blast data
        blast = base.newblast(self.blast, int(self.score), float(self.evalue), gff1, gff2, self.blast_reverse)
        blast = self.blast_homo(blast, gff1, gff2, self.repeat_number)
        blast.index = blast[0] + ',' + blast[1]

        # Get collinearity data
        collinearity = self.auto_file(gff1, gff2)

        # Load ks data if necessary
        ks = pd.Series([]) if self.ks == 'none' or self.ks == '' or not hasattr(self, 'ks') else base.read_ks(self.ks, self.ks_col)

        # Get the block position data
        data = self.block_position(collinearity, blast, gff1, gff2, ks)
        data['class1'] = 0
        data['class2'] = 0

        # Save results
        data.to_csv(self.savefile, index=None)

    def auto_file(self, gff1, gff2):
        """Auto-detect and read collinearity file."""
        with open(self.collinearity) as f:
            p = ' '.join(f.readlines()[0:30])
        
        # Handle different file formats
        if 'path length' in p or 'MAXIMUM GAP' in p:
            return base.read_colinearscan(self.collinearity)
        elif 'MATCH_SIZE' in p or '## Alignment' in p:
            return self.process_mcscanx(gff1, gff2)
        elif '# Alignment' in p:
            return base.read_collinearity(self.collinearity)
        elif '###' in p:
            return self.process_jcvi(gff1, gff2)

    def process_mcscanx(self, gff1, gff2):
        """Process MCScanX format collinearity data."""
        col = base.read_mcscanx(self.collinearity)
        collinearity = []
        for block in col:
            newblock = [k for k in block[1] if k[0] in gff1.index and k[2] in gff2.index]
            if newblock:
                for k in newblock:
                    k[1], k[3] = gff1.at[k[0], 'order'], gff2.at[k[2], 'order']
                collinearity.append([block[0], newblock, block[2]])
        return collinearity

    def process_jcvi(self, gff1, gff2):
        """Process JCVI format collinearity data."""
        col = base.read_jcvi(self.collinearity)
        collinearity = []
        for block in col:
            newblock = [k for k in block[1] if k[0] in gff1.index and k[2] in gff2.index]
            if newblock:
                for k in newblock:
                    k[1], k[3] = gff1.at[k[0], 'order'], gff2.at[k[2], 'order']
                collinearity.append([block[0], newblock, block[2]])
        return collinearity


================================================
FILE: wgdi/block_ks.py
================================================
import re
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import wgdi.base as base


class block_ks:
    def __init__(self, options):
        # Default parameters
        self.markersize = 0.8
        self.figsize = 'default'
        self.tandem_length = 200
        self.blockinfo_reverse = False
        self.tandem = False
        self.area = [0, 3]
        self.position = 'order'
        self.ks_col = 'ks_NG86'
        self.pvalue = 0.01
        
        # Overriding default parameters with options
        for k, v in options:
            setattr(self, str(k), v)
            print(f"{k} = {v}")
        
        # Parsing area as a float list
        self.area = [float(k) for k in str(self.area).split(',')]
        self.markersize =  float(self.markersize)
        self.tandem_length =  int(self.tandem_length)
        
        self.blockinfo_reverse =  base.str_to_bool(self.blockinfo_reverse)
        self.remove_tandem =  base.str_to_bool(self.remove_tandem)

    def block_position(self, bkinfo, lens1, lens2, step1, step2):
        pos, pairs = [], []
        
        # Create mappings for chromosome positions
        dict_y_chr = dict(zip(lens1.index, np.append([0], lens1.cumsum()[:-1].values)))
        dict_x_chr = dict(zip(lens2.index, np.append([0], lens2.cumsum()[:-1].values)))
        
        # Iterate through block information
        for _, row in bkinfo.iterrows():
            block1 = row['block1'].split('_')
            block2 = row['block2'].split('_')
            ks = row['ks'].split('_')
            
            locy_median = (dict_y_chr[row['chr1']] + 0.5 * (row['end1'] + row['start1'])) * step1
            locx_median = (dict_x_chr[row['chr2']] + 0.5 * (row['end2'] + row['start2'])) * step2
            pos.append([locx_median, locy_median, row['ks_median']])
            
            # Ensure ks length matches block length
            if len(block1) != len(ks):
                ks = ks[1:]
                
            for i in range(len(block1)):
                locy = (dict_y_chr[row['chr1']] + float(block1[i])) * step1
                locx = (dict_x_chr[row['chr2']] + float(block2[i])) * step2
                pairs.append([locx, locy, float(ks[i])])
        
        return pos, pairs

    def remove_tandem(self, bkinfo):
        # Filter for same-chromosome blocks
        group = bkinfo[bkinfo['chr1'] == bkinfo['chr2']].copy()
        
        # Calculate block start and end differences
        group['start'] = group['start1'] - group['start2']
        group['end'] = group['end1'] - group['end2']
        
        # Remove tandems based on threshold
        index = group[(group['start'].abs() <= self.tandem_length) |
                      (group['end'].abs() <= self.tandem_length)].index
        return bkinfo.drop(index)

    def run(self):
        # Initialize axis and chromosome lens
        axis = [0, 1, 1, 0]
        lens1 = base.newlens(self.lens1, self.position)
        lens2 = base.newlens(self.lens2, self.position)
        
        # Parse figsize
        if re.search(r'\d', self.figsize):
            self.figsize = [float(k) for k in self.figsize.split(',')]
        else:
            self.figsize = np.array([1, float(lens1.sum()) / float(lens2.sum())]) * 10
        
        # Calculate step sizes
        step1 = 1 / float(lens1.sum())
        step2 = 1 / float(lens2.sum())
        
        # Create figure and axes
        fig, ax = plt.subplots(figsize=self.figsize)
        plt.rcParams['ytick.major.pad'] = 0
        ax.xaxis.set_ticks_position('top')
        
        # Plot dotplot frame
        base.dotplot_frame(fig, ax, lens1, lens2, step1, step2,
                           self.genome1_name, self.genome2_name, [0, 1])
        
        # Load block information
        bkinfo = pd.read_csv(self.blockinfo)
        
        # Handle reverse block information
        if self.blockinfo_reverse == True:
            bkinfo[['chr1', 'chr2']] = bkinfo[['chr2', 'chr1']]
            bkinfo[['block1', 'block2']] = bkinfo[['block2', 'block1']]
        
        # Filter block information
        bkinfo['chr1'] = bkinfo['chr1'].astype(str)
        bkinfo['chr2'] = bkinfo['chr2'].astype(str)
        bkinfo = bkinfo[(bkinfo['length'] >= int(self.block_length)) & 
                        (bkinfo['chr1'].isin(lens1.index)) & 
                        (bkinfo['chr2'].isin(lens2.index)) & 
                        (bkinfo['pvalue'] < float(self.pvalue))]
        
        # Remove tandem duplicates if required
        if self.tandem == False:
            bkinfo = self.remove_tandem(bkinfo)
        
        # Calculate positions and pairs
        pos, pairs = self.block_position(bkinfo, lens1, lens2, step1, step2)
        
        # Filter pairs by ks value
        df = pd.DataFrame(pairs, columns=['loc1', 'loc2', 'ks'])
        df = df[(df['ks'] >= self.area[0]) & (df['ks'] <= self.area[1])]
        df.drop_duplicates(inplace=True)
        
        # Plot scatter
        cm = plt.cm.get_cmap('gist_rainbow')
        sc = plt.scatter(df['loc1'], df['loc2'], s=self.markersize, c=df['ks'],
                         alpha=0.9, edgecolors=None, linewidths=0, marker='o', 
                         vmin=self.area[0], vmax=self.area[1], cmap=cm)
        
        # Add colorbar
        cbar = fig.colorbar(sc, shrink=0.5, pad=0.03, fraction=0.1)
        align = dict(family='DejaVu Sans', style='normal',
                     horizontalalignment="center", verticalalignment="center")
        cbar.set_label('Ks', labelpad=12.5, fontsize=16, **align)
        
        # Set axis and save figure
        ax.axis(axis)
        plt.subplots_adjust(left=0.09, right=0.96, top=0.93, bottom=0.03)
        plt.savefig(self.savefig, dpi=500)
        plt.show()


================================================
FILE: wgdi/circos.py
================================================
import re
import sys

import matplotlib as mpl
import matplotlib.patches as mpatches
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import wgdi.base as base


class circos():
    def __init__(self, options):
        self.figsize = '10,10'
        self.position = 'order'
        self.label_size = 9
        self.label_radius = 0.015
        self.column_names = [None]*100
        for k, v in options:
            setattr(self, str(k), v)
            print(k, ' = ', v)
        self.figsize = [float(k) for k in self.figsize.split(',')]
        self.ring_width = float(self.ring_width)
        if hasattr(self, 'legend_square'):
            self.legend_square = [float(k)
                                  for k in self.legend_square.split(',')]
        else:
            self.legend_square = 0.04, 0.04

    def plot_circle(self, loc_chr, radius, color='black', lw=1, alpha=1, linestyle='-'):
        for k in loc_chr:
            start, end = loc_chr[k]
            t = np.arange(start, end, 0.005)
            x, y = (radius) * np.cos(t), (radius) * np.sin(t)
            plt.plot(x, y, linestyle=linestyle,
                     color=color, lw=lw, alpha=alpha)

    def plot_labels(self, root, labels, loc_chr, radius, horizontalalignment="center", verticalalignment="center", fontsize=6,
                    color='black'):
        for k in loc_chr:
            loc = sum(loc_chr[k]) * 0.5
            x, y = radius * np.cos(loc), radius * np.sin(loc)
            self.Wedge(root, (x, y), self.label_radius, 0,
                       360, self.label_radius, 'white', 1)
            if 1 * np.pi < loc < 2 * np.pi:
                loc += np.pi
            plt.text(x, y, labels[k], horizontalalignment=horizontalalignment, verticalalignment=verticalalignment,
                     fontsize=fontsize, color=color, rotation=0)

    def Wedge(self, ax, loc, radius, start, end, width, color, alpha):
        p = mpatches.Wedge(loc, radius, start, end, width=width,
                           edgecolor=None, facecolor=color, alpha=alpha)
        ax.add_patch(p)

    def plot_bar(self, df, radius, length, lw, color, alpha):
        for k in df[df.columns[0]].drop_duplicates().values:
            if str(k) not in color.keys():
                color[str(k)] = 'black'
            if k in ['', np.nan]:
                continue
            df_chr = df.groupby(df.columns[0]).get_group(k)
            x1, y1 = radius * \
                np.cos(df_chr['rad']), radius * np.sin(df_chr['rad'])
            x2, y2 = (radius + length) * \
                np.cos(df_chr['rad']), (radius + length) * \
                np.sin(df_chr['rad'])
            x = np.array(
                [x1.values, x2.values, [np.nan] * x1.size]).flatten('F')
            y = np.array(
                [y1.values, y2.values, [np.nan] * x1.size]).flatten('F')
            plt.plot(x, y, linestyle='-',
                     color=color[str(k)], lw=lw, alpha=alpha)

    def chr_location(self, lens, angle_gap, angle):
        start, end, loc_chr = 0, 0.2*angle_gap, {}
        for k in lens.index:
            end += angle_gap + angle * (float(lens[k]))
            start = end - angle * (float(lens[k]))
            loc_chr[k] = [float(start), float(end)]
        return loc_chr

    def deal_alignment(self, alignment, gff, lens, loc_chr, angle):
        alignment.replace('\s+', '', inplace=True)
        alignment.replace('.', '', inplace=True)
        print(alignment.dropna(subset=[2, 3],how='all'))
        # exit(0)
        newalignment = alignment.copy()
        for i in range(len(alignment.columns)):
            alignment[i] = alignment[i].astype(str)
            newalignment[i] = alignment[i].map(gff['chr'].to_dict())
        newalignment['loc'] = alignment[0].map(gff[self.position].to_dict())
        newalignment[0] = newalignment[0].astype('str')
        newalignment['loc'] = newalignment['loc'].astype('float')
        newalignment = newalignment[newalignment[0].isin(lens.index) == True]
        newalignment['rad'] = np.nan
        for name, group in newalignment.groupby(0):
            if str(name) not in loc_chr:
                continue
            newalignment.loc[group.index, 'rad'] = loc_chr[str(
                name)][0]+angle * group['loc']
        print(newalignment.dropna(subset=[2, 3,4],how='all'))
        return newalignment

    def deal_ancestor(self, alignment, gff, lens, loc_chr, angle, al):
        alignment.replace('\s+', '', inplace=True)
        alignment.replace('.', np.nan, inplace=True)
        newalignment = pd.merge(alignment, gff, left_on=0, right_on=gff.index)
        newalignment['rad'] = np.nan
        for name, group in newalignment.groupby('chr'):
            if str(name) not in loc_chr:
                continue
            newalignment.loc[group.index, 'rad'] = loc_chr[str(
                name)][0]+angle * group[self.position]
        newalignment.index = newalignment[0]
        newalignment[0] = newalignment[0].map(newalignment['rad'].to_dict())
        data = []
        for index_al, row_al in al.iterrows():
            for k in alignment.columns[1:]:
                alignment[k] = alignment[k].astype(str)
                group = newalignment[(newalignment['chr'] == row_al['chr']) & (
                    newalignment['order'] >= row_al['start']) & (newalignment['order'] <= row_al['end'])].copy()
                group.loc[:, k] = group.loc[:, k].map(
                    newalignment['rad']).values
                group.dropna(subset=[k], inplace=True)
                group.index = group.index.map(newalignment['rad'].to_dict())
                group['color'] = row_al['color']
                group = group[group[k].notnull()]
                data += group[[0, k, 'color']].values.tolist()
        df = pd.DataFrame(data, columns=['loc1', 'loc2', 'color'])
        return df

    def plot_collinearity(self, data, radius, lw=0.02, alpha=1):
        for name, group in data.groupby('color'):
            x, y = np.array([]), np.array([])
            for index, row in group.iterrows():
                ex1x, ex1y = radius * \
                    np.cos(row['loc1']), radius*np.sin(row['loc1'])
                ex2x, ex2y = radius * \
                    np.cos(row['loc2']), radius*np.sin(row['loc2'])
                ex3x, ex3y = radius * (1-abs(row['loc1']-row['loc2'])/np.pi) * np.cos((row['loc1']+row['loc2'])*0.5), radius * (
                    1-abs(row['loc1']-row['loc2'])/np.pi) * np.sin((row['loc1']+row['loc2'])*0.5)
                x1 = [ex1x, 0.5*ex3x, ex2x]
                y1 = [ex1y, 0.5*ex3y, ex2y]
                step = .002
                t = np.arange(0, 1+step, step)
                xt = base.Bezier3(x1, t)
                yt = base.Bezier3(y1, t)
                x = np.hstack((x, xt, np.nan))
                y = np.hstack((y, yt, np.nan))
            plt.plot(x, y, color=name, lw=lw, alpha=alpha)

    def plot_legend(self, ax, chr_color, width, height):
        (x1, x2) = ax.get_xlim()
        (y1, y2) = ax.get_ylim()
        a = 1000
        for k, v in enumerate(chr_color.keys(), 0):
            h = y1-k//a*height*2
            k = k % a
            if x1 + width * k > x2-width:
                a = k
                h = y1-k//a*height*2
                k = k % a
            loc = [x1 + width * k, h]
            base.Rectangle(ax, loc, height, width, chr_color[v], 1)
            plt.text(loc[0] + width*0.382, h-0.618*height, v, fontsize=12)
        ax.set_ylim(h-2*height, y2)

    def run(self):
        fig, ax = plt.subplots(figsize=self.figsize)
        mpl.rcParams['agg.path.chunksize'] = 100000000
        lens = base.newlens(self.lens, self.position)
        radius, angle_gap = float(self.radius), float(self.angle_gap)
        angle = (2 * np.pi - (int(len(lens))+1.5)
                 * angle_gap) / (int(lens.sum()))
        loc_chr = self.chr_location(lens, angle_gap, angle)
        list_colors = [str(k).strip() for k in re.split(',|:', self.colors)]
        chr_color = dict(zip(list_colors[::2], list_colors[1::2]))
        gff = base.newgff(self.gff)
        if hasattr(self, 'ancestor'):
            ancestor = pd.read_csv(self.ancestor, header=None)
            al = pd.read_csv(self.ancestor_location, sep='\t', header=None)
            al.rename(columns={0: 'chr', 1: 'start',
                               2: 'end', 3: 'color'}, inplace=True)
            al['chr'] = al['chr'].astype(str)
            data = self.deal_ancestor(ancestor, gff, lens, loc_chr, angle, al)
            self.plot_collinearity(data, radius, lw=0.1, alpha=0.8)

        if hasattr(self, 'alignment'):
            alignment = pd.read_csv(self.alignment, header=None)
            print(alignment)
            newalignment = self.deal_alignment(
                alignment, gff, lens, loc_chr, angle)
            if ',' in self.column_names:
                names = [str(k) for k in self.column_names.split(',')]
            else:
                names = [None]*len(newalignment.columns)
            n = 0
            align = dict(family='Arial', verticalalignment="center",
                         horizontalalignment="center")
            print(newalignment)
            for k, v in enumerate(newalignment.columns[1:-2]):
                r = radius + self.ring_width*(k+1)
                print(k,v,r)
                self.plot_circle(loc_chr, r, lw=0.5, alpha=1, color='grey')
                self.plot_bar(newalignment[[v, 'rad']], r + self.ring_width *
                              0.15, self.ring_width*0.7, 0.15, chr_color, 1)
                if n % 2 == 0:
                    loc = 0.05
                    x, y = (r+self.ring_width*0.5) * \
                        np.cos(loc), (r+self.ring_width*0.5) * np.sin(loc)
                    plt.text(x, y, names[n], rotation=loc *
                             180 / np.pi, fontsize=self.label_size, **align)
                else:
                    loc = -0.08
                    x, y = (r+self.ring_width*0.5) * \
                        np.cos(loc), (r+self.ring_width*0.5) * np.sin(loc)
                    plt.text(x, y, names[n], fontsize=self.label_size,
                             rotation=loc * 180 / np.pi, **align)
                n += 1
        if hasattr(self, 'ancestor'):
            colors = al['color'].drop_duplicates().values.tolist()
            ancestor_chr_color = dict(zip(range(1, len(colors)+1), colors))
            self.plot_legend(ax, ancestor_chr_color,
                             self.legend_square[0], self.legend_square[1])
        if hasattr(self, 'alignment'):
            del chr_color['nan']
            self.plot_legend(
                ax, chr_color, self.legend_square[0], self.legend_square[1])
        labels = self.chr_label + lens.index
        labels = dict(zip(lens.index, labels))
        self.plot_labels(ax, labels, loc_chr, radius +
                         self.ring_width*0.3, fontsize=self.label_size)

        plt.axis('off')
        a = (ax.get_ylim()[1]-ax.get_ylim()[0]) / \
            (ax.get_xlim()[1]-ax.get_xlim()[0])
        fig.set_size_inches(self.figsize[0], self.figsize[0]*a, forward=True)
        plt.savefig(self.savefig, dpi=500)
        plt.show()
        sys.exit(0)


================================================
FILE: wgdi/collinearity.py
================================================
import numpy as np
import pandas as pd


class collinearity:
    def __init__(self, options, points):
        # Default values
        self.gap_penalty = -1
        self.over_length = 0
        self.mg1 = 40
        self.mg2 = 40
        self.pvalue = 1
        self.over_gap = 3
        self.points = points
        self.p_value = 0
        self.coverage_ratio = 0.8
        
        # Set user-defined options
        for k, v in options:
            setattr(self, str(k), v)

        # Initialize grading and mg values
        self.grading = [50, 40, 25] if not hasattr(self, 'grading') else [int(k) for k in self.grading.split(',')]
        self.mg1, self.mg2 = [40, 40] if not hasattr(self, 'mg') else [int(k) for k in self.mg.split(',')]

        # Convert string values to floats
        self.pvalue = float(self.pvalue)
        self.coverage_ratio = float(self.coverage_ratio)

    def get_matrix(self):
        """Initialize the matrix for the collinearity points."""
        self.points['usedtimes1'] = 0
        self.points['usedtimes2'] = 0
        self.points['times'] = 1
        self.points['score1'] = self.points['grading']
        self.points['score2'] = self.points['grading']
        self.points['path1'] = self.points.index.to_numpy().reshape(len(self.points), 1).tolist()
        self.points['path2'] = self.points['path1']
        self.points_init = self.points.copy()
        self.mat_points = self.points

    def run(self):
        """Run the main collinearity processing."""
        self.get_matrix()
        self.score_matrix()
        data = []

        # Process points for maxPath in the positive direction
        points1 = self.points[['loc1', 'loc2', 'score1', 'path1', 'usedtimes1']].sort_values(by=['score1'], ascending=False)
        points1.drop(index=points1[points1['usedtimes1'] < 1].index, inplace=True)
        points1.columns = ['loc1', 'loc2', 'score', 'path', 'usedtimes']
        
        while (self.over_length >= self.over_gap or len(points1) >= self.over_gap):
            if self.max_path(points1):
                if self.p_value > self.pvalue:
                    continue
                data.append([self.path, self.p_value, self.score])

        # Process points for maxPath in the negative direction
        points2 = self.points[['loc1', 'loc2', 'score2', 'path2', 'usedtimes2']].sort_values(by=['score2'], ascending=False)
        points2.drop(index=points2[points2['usedtimes2'] < 1].index, inplace=True)
        points2.columns = ['loc1', 'loc2', 'score', 'path', 'usedtimes']

        while (self.over_length >= self.over_gap) or (len(points2) >= self.over_gap):
            if self.max_path(points2):
                if self.p_value > self.pvalue:
                    continue
                data.append([self.path, self.p_value, self.score])

        return data

    def score_matrix(self):
        """Calculate the scoring matrix for the points."""
        for index, row, col in self.points[['loc1', 'loc2']].itertuples():
            # Get points within a certain range
            points = self.points[(self.points['loc1'] > row) & 
                                 (self.points['loc2'] > col) & 
                                 (self.points['loc1'] < row + self.mg1) & 
                                 (self.points['loc2'] < col + self.mg2)]
            
            row_i_old, gap = row, self.mg2
            for index_ij, row_i, col_j, grading in points[['loc1', 'loc2', 'grading']].itertuples():
                if col_j - col > gap and row_i > row_i_old:
                    break
                score = grading + (row_i - row + col_j - col) * self.gap_penalty
                score1 = score + self.points.at[index, 'score1']
                if score > 0 and self.points.at[index_ij, 'score1'] < score1:
                    self.points.at[index_ij, 'score1'] = score1
                    self.points.at[index, 'usedtimes1'] += 1
                    self.points.at[index_ij, 'usedtimes1'] += 1
                    self.points.at[index_ij, 'path1'] = self.points.at[index, 'path1'] + [index_ij]
                    gap = min(col_j - col, gap)
                    row_i_old = row_i

        # Reverse processing to handle negative direction
        points_reverse = self.points.sort_values(by=['loc1', 'loc2'], ascending=[False, True])
        for index, row, col in points_reverse[['loc1', 'loc2']].itertuples():
            points = points_reverse[(points_reverse['loc1'] < row) & 
                                    (points_reverse['loc2'] > col) & 
                                    (points_reverse['loc1'] > row - self.mg1) & 
                                    (points_reverse['loc2'] < col + self.mg2)]
            
            row_i_old, gap = row, self.mg2
            for index_ij, row_i, col_j, grading in points[['loc1', 'loc2', 'grading']].itertuples():
                if col_j - col > gap and row_i < row_i_old:
                    break
                score = grading + (row - row_i + col_j - col) * self.gap_penalty
                score2 = score + self.points.at[index, 'score2']
                if score > 0 and self.points.at[index_ij, 'score2'] < score2:
                    self.points.at[index_ij, 'score2'] = score2
                    self.points.at[index, 'usedtimes2'] += 1
                    self.points.at[index_ij, 'usedtimes2'] += 1
                    self.points.at[index_ij, 'path2'] = self.points.at[index, 'path2'] + [index_ij]
                    gap = min(col_j - col, gap)
                    row_i_old = row_i

    def max_path(self, points):
        """Find the maximum path for the given points."""
        if len(points) == 0:
            self.over_length = 0
            return False
        
        # Initialize path score and index
        self.score, self.path_index = points.loc[points.index[0], ['score', 'path']]
        self.path = points[points.index.isin(self.path_index)]
        self.over_length = len(self.path_index)
        
        # Check if the block overlaps with other blocks
        if self.over_length >= self.over_gap and len(self.path) / self.over_length > self.coverage_ratio:
            points.drop(index=self.path.index, inplace=True)
            [loc1_min, loc2_min], [loc1_max, loc2_max] = self.path[['loc1', 'loc2']].agg(['min', 'max']).to_numpy()

            # Calculate p-value
            gap_init = self.points_init[(loc1_min <= self.points_init['loc1']) & 
                                        (self.points_init['loc1'] <= loc1_max) & 
                                        (loc2_min <= self.points_init['loc2']) & 
                                        (self.points_init['loc2'] <= loc2_max)].copy()
            
            self.p_value = self.p_value_estimated(gap_init, loc1_max - loc1_min + 1, loc2_max - loc2_min + 1)
            self.path = self.path.sort_values(by=['loc1'], ascending=[True])[['loc1', 'loc2']]
            return True
        else:
            points.drop(index=points.index[0], inplace=True)
        return False

    def p_value_estimated(self, gap, L1, L2):
        """Estimate p-value based on the given gap and lengths."""
        N1 = gap['times'].sum()
        N = len(gap)
        self.points_init.loc[gap.index, 'times'] += 1
        m = len(self.path)
        a = (1 - self.score / m / self.grading[0]) * (N1 - m + 1) / N * (L1 - m + 1) * (L2 - m + 1) / L1 / L2
        return round(a, 4)


================================================
FILE: wgdi/dotplot.py
================================================
import re

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import wgdi.base as base


class dotplot():
    def __init__(self, options):
        self.multiple = 1
        self.score = 100
        self.evalue = 1e-5
        self.repeat_number = 20
        self.markersize = 0.5
        self.figsize = 'default'
        self.position = 'order'
        self.ancestor_top = None
        self.ancestor_left = None
        self.blast_reverse = False
        for k, v in options:
            setattr(self, str(k), v)
            print(k, ' = ', v)
        if self.ancestor_top == 'none' or self.ancestor_top == '':
            self.ancestor_top = None
        if self.ancestor_left == 'none' or self.ancestor_left == '':
            self.ancestor_left = None
        base.str_to_bool(self.blast_reverse)

    def pair_positon(self, blast, gff1, gff2, rednum, repeat_number):
        blast['color'] = ''
        blast['loc1'] = blast[0].map(gff1['loc'])
        blast['loc2'] = blast[1].map(gff2['loc'])
        bluenum = 5+rednum
        index = [group.sort_values(by=[11], ascending=[False])[:repeat_number].index.tolist()
                 for name, group in blast.groupby([0])]
        reddata = np.array([k[:rednum] for k in index], dtype=object)
        bluedata = np.array([k[rednum:bluenum] for k in index], dtype=object)
        graydata = np.array([k[bluenum:repeat_number] for k in index], dtype=object)
        if len(reddata):
            redindex = np.concatenate(reddata)
        else:
            redindex = []
        if len(bluedata):
            blueindex = np.concatenate(bluedata)
        else:
            blueindex = []
        if len(graydata):
            grayindex = np.concatenate(graydata)
        else:
            grayindex = []
        blast.loc[redindex, 'color'] = 'red'
        blast.loc[blueindex, 'color'] = 'blue'
        blast.loc[grayindex, 'color'] = 'gray'
        return blast[blast['color'].str.contains(r'\w')]

    def run(self):
        axis = [0, 1, 1, 0]
        left, right, top, bottom = 0.07, 0.97, 0.93, 0.03
        lens1 = base.newlens(self.lens1, self.position)
        lens2 = base.newlens(self.lens2, self.position)
        step1 = 1 / float(lens1.sum())
        step2 = 1 / float(lens2.sum())
        if self.ancestor_left != None:
            axis[0] = -0.02
            lens_ancestor_left = pd.read_csv(
                self.ancestor_left, sep="\t", header=None)
            lens_ancestor_left[0] = lens_ancestor_left[0].astype(str)
            lens_ancestor_left[3] = lens_ancestor_left[3].astype(str)
            lens_ancestor_left[4] = lens_ancestor_left[4].astype(int)
            lens_ancestor_left[4] = lens_ancestor_left[4] / lens_ancestor_left[4].max()
            lens_ancestor_left = lens_ancestor_left[lens_ancestor_left[0].isin(
                lens1.index)]
        if self.ancestor_top != None:
            axis[3] = -0.02
            lens_ancestor_top = pd.read_csv(
                self.ancestor_top, sep="\t", header=None)
            lens_ancestor_top[0] = lens_ancestor_top[0].astype(str)
            lens_ancestor_top[3] = lens_ancestor_top[3].astype(str)
            lens_ancestor_top[4] = lens_ancestor_top[4].astype(int)
            lens_ancestor_top[4] = lens_ancestor_top[4] / lens_ancestor_top[4].max()
            lens_ancestor_top = lens_ancestor_top[lens_ancestor_top[0].isin(
                lens2.index)]
        if re.search(r'\d', self.figsize):
            self.figsize = [float(k) for k in self.figsize.split(',')]
        else:
            self.figsize = np.array(
                [1, float(lens1.sum())/float(lens2.sum())])*10
        plt.rcParams['ytick.major.pad'] = 0
        fig, ax = plt.subplots(figsize=self.figsize)
        ax.xaxis.set_ticks_position('top')
        base.dotplot_frame(fig, ax, lens1, lens2, step1, step2,
                           self.genome1_name, self.genome2_name, [axis[0], axis[3]])
        gff1 = base.newgff(self.gff1)
        gff2 = base.newgff(self.gff2)
        gff1 = base.gene_location(gff1, lens1, step1, self.position)
        gff2 = base.gene_location(gff2, lens2, step2, self.position)
        if self.ancestor_top != None:
            top = top
            self.aree_left = self.ancestor_posion(ax, gff2, lens_ancestor_top, 'top')
        if self.ancestor_left != None:
            left = left
            self.aree_top = self.ancestor_posion(ax, gff1, lens_ancestor_left, 'left')
        print('read gffs')
        blast = base.newblast(self.blast, int(self.score),
                              float(self.evalue), gff1, gff2, self.blast_reverse)
        if len(blast) ==0:
            print('Stoped! \n\nThe gene id in blast file does not correspond to gff1 and gff2.')
            exit(0)
        print('read blast')
        df = self.pair_positon(blast, gff1, gff2,
                               int(self.multiple), int(self.repeat_number))
        print('deal blast')
        ax.scatter(df['loc2'], df['loc1'], s=float(self.markersize), c=df['color'],
                   alpha=0.5, edgecolors=None, linewidths=0, marker='o')
        ax.axis(axis)
        plt.subplots_adjust(left=left, right=right, top=top, bottom=bottom)
        plt.savefig(self.savefig, dpi=300)
        plt.show()

    def ancestor_posion(self, ax, gff, lens, mark):
        data = []
        for index, row in lens.iterrows():
            loc1 = gff[(gff['chr'] == row[0]) & (
                gff['order'] == int(row[1]))].index
            loc2 = gff[(gff['chr'] == row[0]) & (
                gff['order'] == int(row[2])-1)].index
            loc1, loc2 = gff.loc[[loc1[0], loc2[0]], 'loc']
            if mark == 'top':
                width = abs(loc1-loc2)
                loc = [min(loc1, loc2), 0]
                height = -0.02
                base.Rectangle(ax, loc, height, width, row[3], row[4])
            if mark == 'left':
                height = abs(loc1-loc2)
                loc = [-0.02, min(loc1, loc2), ]
                width = 0.02
                base.Rectangle(ax, loc, height, width, row[3], row[4])
            data.append([loc, height, width, row[3], row[4]])
        return data


================================================
FILE: wgdi/example/__init__.py
================================================


================================================
FILE: wgdi/example/align.conf
================================================
[alignment]
blockinfo = block information file (.csv)
blockinfo_reverse = false
classid =  class1
gff1 =  gff1 file
gff2 =  gff2 file
lens1 = lens1 file
lens2 = lens2 file
genome1_name =  Genome1 name
genome2_name =  Genome2 name
markersize = 0.5
ks_area = -1,3
position = order
colors = red,blue,green
figsize = 10,10
savefile = savefile(.csv)
savefig= save image(.png, .pdf, .svg)

================================================
FILE: wgdi/example/alignmenttrees.conf
================================================
[alignmenttrees]
alignment = alignment file (.csv)
gff = gff file (reference genome, If alignment has no reference species, delete it)
lens = lens file (If alignment has no reference species, delete it)
dir = output folder
sequence_file = sequence file (.fa)
cds_file = cds file (.fa)
codon_positon = 1,2,3  (1,2 mean codon1&2; 1,2,3 mean no codon removed)
trees_file =  trees (.nwk)
align_software = (mafft,muscle)
tree_software =  (iqtree,fasttree)
threads = 1 (Number,AUTO)
model = MFP
trimming =  (trimal,divvier)
minimum = 4
delete_detail = true


================================================
FILE: wgdi/example/ancestral_karyotype.conf
================================================
[ancestral_karyotype]
gff = gff file (cat the relevant 'gff' files into a file)
pep_file = pep file (cat the relevant 'pep.fa' files into a file)
ancestor = ancestor file  (this file requires you to provide)
mark = aak 
ancestor_gff =  result file
ancestor_lens =  result file
ancestor_pep =  result file
ancestor_file =  result file

================================================
FILE: wgdi/example/ancestral_karyotype_repertoire.conf
================================================
[ancestral_karyotype_repertoire]
blockinfo =  block information (*.csv)
# blockinfo: processed *.csv
blockinfo_reverse =  False
gff1 =  gff1 file (ancestor's gff)
gff2 =  gff2 file (the other species's gff)
gap = 5
mark = aak1s
ancestor = ancestor file 
#current ancestor file
ancestor_new =  result file
ancestor_pep =  ancestor pep file 
#cat all pep files together
ancestor_pep_new =  result file
ancestor_gff =  result file
ancestor_lens =  result file


================================================
FILE: wgdi/example/blockinfo.conf
================================================
[blockinfo]
blast = blast file
gff1 =  gff1 file
gff2 =  gff2 file
lens1 = lens1 file
lens2 = lens2 file
collinearity = collinearity file
score = 100
evalue = 1e-5
repeat_number = 20
position = order
ks = ks file
ks_col = ks_NG86
savefile = block information (*.csv)


================================================
FILE: wgdi/example/blockks.conf
================================================
[blockks]
lens1 = lens1 file
lens2 = lens2 file
genome1_name =  Genome1 name
genome2_name =  Genome2 name
blockinfo = block information (*.csv)
pvalue = 0.2
tandem = true
tandem_length = 200
markersize = 1
area = 0,2
block_length =  minimum length
figsize = 8,8
savefig = save image(.png, .pdf, .svg)


================================================
FILE: wgdi/example/circos.conf
================================================
[circos]
gff =  gff file
lens =  lens file
radius = 0.2
angle_gap = 0.05
ring_width = 0.015
colors  = 1:c,2:m,3:blue,4:gold,5:red,6:lawngreen,7:darkgreen,8:k,9:darkred,10:gray
alignment = alignment file 
chr_label = chr
ancestor = ancestor alignment file 
ancestor_location = ancestor file 
figsize = 10,10
label_size = 9
position = order
legend_square = 0.04, 0.04
column_names = 1,2,3,4,5
savefig = result(.png, .pdf, .svg)


================================================
FILE: wgdi/example/collinearity.conf
================================================
[collinearity]
gff1 = gff1 file
gff2 = gff2 file
lens1 = lens1 file
lens2 = lens2 file
blast = blast file
blast_reverse = false
comparison = genomes
multiple  = 1
process = 8
evalue = 1e-5
score = 100
grading = 50,30,25
mg = 25,25
pvalue = 1
repeat_number = 20
positon = order
savefile = collinearity file


================================================
FILE: wgdi/example/conf.ini
================================================
[ini]
mafft_path = /home/sunpc/micromamba/envs/wgdi/bin/mafft
pal2nal_path = /home/sunpc/micromamba/envs/wgdi/bin/pal2nal.pl
yn00_path = /home/sunpc/micromamba/envs/wgdi/bin/yn00
muscle_path = /home/sunpc/micromamba/envs/wgdi/bin/muscle
iqtree_path =  /home/sunpc/micromamba/envs/wgdi/bin/iqtree
trimal_path = /home/sunpc/micromamba/envs/wgdi/bin/trimal
fasttree_path = /home/sunpc/micromamba/envs/wgdi/bin/fasttree
divvier_path = /home/sunpc/micromamba/envs/wgdi/bin/divvier


================================================
FILE: wgdi/example/corr.conf
================================================
[correspondence]
blockinfo =  blockinfo file(.csv) 
lens1 = lens1 file
lens2 = lens2 file
tandem = true
tandem_length = 200
pvalue = 0.2
block_length = 5
tandem_ratio = 0.5
multiple  = 1
homo = -1,1
savefile = savefile(.csv)


================================================
FILE: wgdi/example/dotplot.conf
================================================
[dotplot]
blast = blast file
gff1 =  gff1 file
gff2 =  gff2 file
lens1 = lens1 file
lens2 = lens2 file
genome1_name =  Genome1 name
genome2_name =  Genome2 name
multiple  = 1
score = 100
evalue = 1e-5
repeat_number = 10
position = order
blast_reverse = false
ancestor_left = ancestor file or none
ancestor_top = ancestor file or none
markersize = 0.5
figsize = 10,10
savefig = savefile(.png, .pdf, .svg)


================================================
FILE: wgdi/example/fusion_positions_database.conf
================================================
[fusion_positions_database]
pep = pep file
gff = gff file
fusion_positions = fusion_positions file
# Number of gene sets on each side of the breakpoint
ancestor_gff =  result file
ancestor_lens =  result file
ancestor_pep =  result file
ancestor_file =  result file


================================================
FILE: wgdi/example/fusions_detection.conf
================================================
[fusions_detection]
blockinfo = block information (*.csv)
ancestor = ancestor file
#The number of genes spanned by a synteny block on both sides of a breakpoint.
min_genes_per_side = 5
density = 0.3
filtered_blockinfo = result blockinfo (.csv)


================================================
FILE: wgdi/example/karyotype.conf
================================================
[karyotype]
ancestor = ancestor chromosome file
width = 0.5
figsize = 10,6.18
savefig = save image(.png, .pdf, .svg)

================================================
FILE: wgdi/example/karyotype_mapping.conf
================================================
[karyotype_mapping]
blast = blast file
blast_reverse = false
gff1 = gff1 file
gff2 = gff2 file 
score = 100
evalue = 1e-5
repeat_number = 5
ancestor_left = ancestor location file (Only one of ('left', 'top') can be reserved)
ancestor_top = ancestor location file
the_other_lens = the other lens file
blockinfo = block information (*.csv)
blockinfo_reverse = false
limit_length = 5
the_other_ancestor_file =  result file 

================================================
FILE: wgdi/example/ks.conf
================================================
[ks]
cds_file = 	cds file 
#cat all cds files together
pep_file = 	pep file
#cat all pep files together
align_software = muscle
pairs_file = gene pairs file
ks_file = ks result

================================================
FILE: wgdi/example/ks_fit_result.csv
================================================
,color,linewidth,linestyle,,,,,,
csa_csa,red,2,-,2.532090116,1.510453744,0.229652282,1.638111687,2.048906176,0.345639862
vvi_vvi,blue,2,-,3.00367275,1.288717936,0.177816426,,,
vvi_oin_gamma,orange,2,-,1.910418336,1.328469514,0.262257112,,,
vvi_oin,orange,2,--,4.948194212,0.882608858,0.10426873,,,
vvi_csa,green,2,--,2.470770292464022,1.4131842495219498,0.21391959288821544,,,


================================================
FILE: wgdi/example/ksfigure.conf
================================================
[ksfigure]
ksfit = ksfit result(*.csv)
labelfontsize = 15
legendfontsize = 15
xlabel = none            
ylabel = none            
title = none
area = 0,2
figsize = 10,6.18
shadow = true (true/false)
savefig =  save image(.png, .pdf, .svg)


================================================
FILE: wgdi/example/kspeaks.conf
================================================
[kspeaks]
blockinfo = block information (*.csv)
pvalue = 0.2
tandem = true
block_length = int number
ks_area = 0,10
multiple  = 1
homo = 0,1
fontsize = 9
area = 0,3
figsize = 10,6.18
savefig = saving image(.png,.pdf)
savefile = ks medain savefile


================================================
FILE: wgdi/example/peaksfit.conf
================================================
[peaksfit]
blockinfo = block information (*.csv)
mode = median
bins_number = 200
ks_area = 0,10
fontsize = 9
area = 0,3
figsize = 10,6.18
shadow = true 
savefig = saving image(.png,.pdf,.svg)

================================================
FILE: wgdi/example/pindex.conf
================================================
[pindex]
alignment = alignment file (.csv)
gff = gff file
lens =lens file
gap = 50
retention = 0.05
diff = 0.05
remove_delta = (true/false)
savefile = result file(.csv)


================================================
FILE: wgdi/example/polyploidy_classification.conf
================================================
[polyploidy classification]
blockinfo = block information (*.csv)
ancestor_left = ancestor file
ancestor_top = ancestor file
classid = class1,class2
same_protochromosome =  False
same_subgenome =  False
savefile = result file(.csv)

================================================
FILE: wgdi/example/retain.conf
================================================
[retain]
alignment = alignment file
gff = gff file
lens = lens file
colors = red,blue,green
refgenome = shorthand
figsize = 10,12
step = 50
ylabel = y label
savefile = retain file (result)
savefig = result(.png, .pdf, .svg)


================================================
FILE: wgdi/example/shared_fusion.conf
================================================
[shared_fusion]
blockinfo = block information (*.csv)
# The new lens file is the output filtered by lens file.
lens1 = lens file, new lens file
lens2 =  lens file,  new lens file
ancestor_left = ancestor file
ancestor_top = ancestor file
classid = class1,class2
limit_length = 5
filtered_blockinfo = result blockinfo (.csv)

================================================
FILE: wgdi/fusion_positions_database.py
================================================
import pandas as pd
import os
from Bio import SeqIO

class fusion_positions_database:
    def __init__(self, options):
        for k, v in options:
            setattr(self, k, v)
            print(f'{k} = {v}')

    def run(self):
        # Load and remove duplicates from data
        gff = pd.read_csv(self.gff, sep="\t", header=None, dtype={0: str, 5: int}).drop_duplicates()
        pep = SeqIO.to_dict(SeqIO.parse(self.pep, "fasta"))
        df = pd.read_csv(self.fusion_positions, sep="\t", header=None, dtype={0: str, 1: int, 2:int, 3:str}).drop_duplicates()
        
        # Load ancestral sequence file if it exists
        seqs = SeqIO.to_dict(SeqIO.parse(self.ancestor_pep, "fasta")) if os.path.exists(self.ancestor_pep) else {}

        sf_gff, sf_lens = [], []

        # Process fusion positions
        for _, row in df.iterrows():
            newchr = row[3]
            newgff = gff[(gff[0] == row[0]) & 
                         (gff[5] >= row[1] - row[2]) & 
                         (gff[5] < row[1] + row[2])].copy()
            newgff['id'] = [f"{newchr}s{str(row[0]).zfill(2)}g{str(i).zfill(3)}" for i in range(1, len(newgff) + 1)]

            sf_position = row[1] - newgff.iloc[0, 5]
            sf_lens.append([newchr, sf_position, len(newgff)])
            
            # For each gene in the filtered GFF region
            for _, gff_row in newgff.iterrows():
                if gff_row[1] in pep and gff_row['id'] not in seqs:
                    gene = pep[gff_row[1]][:]
                    gene.id, gene.description = gff_row['id'], ''
                    seqs[gff_row['id']] = gene
                    # Collect data for the final GFF output
                    sf_gff.append([gff_row['id'], newchr, sf_position, gff_row[2], gff_row[3], gff_row[4], gff_row[1]])

        # Write sequences to FASTA file
        SeqIO.write(seqs.values(), self.ancestor_pep, 'fasta')

        # Save filtered GFF data
        if sf_gff:
            sf_gff = pd.DataFrame(sf_gff)
            sf_gff.rename(columns={3: 'start', 4: 'end', 5: 'strand'}, inplace=True)
            sf_gff['order'] = sf_gff[0].str[-3:].astype(int)
            sf_gff[[1, 0, 'start', 'end', 'strand', 'order', 6]].to_csv(self.ancestor_gff, sep="\t", mode='a', index=False, header=None)
            sf_lens = pd.DataFrame(sf_lens).drop_duplicates()
            sf_lens.to_csv(self.ancestor_lens, sep="\t", mode='a', index=False, header=None)

            # Generate ancestral sequence data
            ancestor = []
            for _, row in sf_lens.iterrows():
                ancestor.append([row[0], 1, row[1], 'red', 1])
                ancestor.append([row[0], row[1] + 1, row[2], 'blue', 1])
            pd.DataFrame(ancestor).to_csv(self.ancestor_file, sep="\t", mode='a', index=False, header=None)

        # Remove duplicates from the output files
        for file in [self.ancestor_gff, self.ancestor_lens, self.ancestor_file]:
            df = pd.read_csv(file, header=None).drop_duplicates().to_csv(file, index=False, header=None)


================================================
FILE: wgdi/fusions_detection.py
================================================
import pandas as pd
from tabulate import tabulate

class fusions_detection:
    def __init__(self, options):
        self.min_genes_per_side = 5
        self.density = 0.3
        for k, v in options:
            setattr(self, k, v)
            print(f"{k} = {v}")
        self.min_genes_per_side = int(self.min_genes_per_side)
        self.density = float(self.density)

    def run(self):
        # Load the ancestor file and process the positions
        ancestor = pd.read_csv(self.ancestor, sep='\t', header=None)
        position = ancestor.groupby(0)[2].unique().apply(pd.Series)
        bkinfo = pd.read_csv(self.blockinfo)
        newbkinfo = bkinfo.head(0)
        
        # Iterate over each row in the position dataframe
        for index, row in position.iterrows():
            # Filter the bkinfo dataframe based on chr2 and density
            filtered_group = bkinfo[(bkinfo['chr2'] == index) & (bkinfo['density2'] >= self.density)].copy()
            # Split the block2 column and stack the resulting series
            df = filtered_group['block2'].str.split('_', expand=True).stack().astype(int)
            # Count the number of genes greater and less than the current position
            filtered_group['greater'] = (df > row[0]).groupby(level=0).sum()
            filtered_group['less'] = (df < row[0]).groupby(level=0).sum()
            # Filter the group based on the minimum number of genes per side
            filtered_group = filtered_group[(filtered_group['greater'] >= self.min_genes_per_side) & (filtered_group['less'] >= self.min_genes_per_side)]
            # Concatenate the filtered group with the newbkinfo dataframe
            newbkinfo = pd.concat([newbkinfo, filtered_group])
        if len(newbkinfo) ==0:
            print("\nNo shared fusion breakpoints detected")
            exit(0)

        # Get and print the shared fusion positions
        newbkinfo.to_csv(self.filtered_blockinfo, header=True, index=False)
        non_overlap_counts = newbkinfo.groupby('chr2').apply(self.count_non_overlapping)
        data = [(chr2, count) for chr2, count in non_overlap_counts.items()]
        print("\nThe following are the shared fusion breakpoints and counts:")
        print(tabulate(data, headers=["Fusion Breakpoint", "Count"], tablefmt="github"))

    def count_non_overlapping(self, group):
        if len(group) == 1:
            return 1
        grouped = group.groupby('chr1')
        total_count = 0
        for chr1, chr_group in grouped:
            chr_group = chr_group.sort_values(by='start1').reset_index(drop=True)
            count = 0
            current_end = -1 
            for _, row in chr_group.iterrows():
                start1, end1 = row['start1'], row['end1']
                if start1 > current_end:
                    count += 1
                    current_end = end1 
            total_count += count
        return total_count

================================================
FILE: wgdi/karyotype.py
================================================
import matplotlib.pyplot as plt
import pandas as pd

import wgdi.base as base


class karyotype():
    def __init__(self, options):
        self.width = 0.5
        for k, v in options:
            setattr(self, str(k), v)
            print(str(k), ' = ', v)
        if hasattr(self, 'figsize'):
            self.figsize = [float(k) for k in self.figsize.split(',')]
        else:
            self.figsize = 10, 6.18
        if hasattr(self, 'width'):
            self.width = float(self.width)
        else:
            self.width = 0.5

    def run(self):
        fig, ax = plt.subplots(figsize=self.figsize)
        ancestor_lens = pd.read_csv(
            self.ancestor, sep="\t", header=None)
        ancestor_lens[0] = ancestor_lens[0].astype(str)
        ancestor_lens[3] = ancestor_lens[3].astype(str)
        ancestor_lens[4] = ancestor_lens[4].astype(int)
        ancestor_lens[4] = ancestor_lens[4] / ancestor_lens[4].max()
        chrs = ancestor_lens[0].drop_duplicates().to_list()
        ax.bar(chrs, 10, color='white', alpha=0)
        for index, row in ancestor_lens.iterrows():
            base.Rectangle(ax, [chrs.index(row[0])-self.width*0.5,
                                row[1]], row[2]-row[1], self.width, row[3], row[4])
        ax.tick_params(labelsize=15)
        ax.spines['top'].set_visible(False)
        ax.spines['right'].set_visible(False)
        ax.spines['left'].set_visible(False)
        ax.spines['bottom'].set_visible(False)
        ax.set_xticks([])
        ax.set_yticks([])
        plt.savefig(self.savefig, dpi=500)
        plt.show()


================================================
FILE: wgdi/karyotype_mapping.py
================================================
import numpy as np
import pandas as pd

import wgdi.base as base


class karyotype_mapping:
    def __init__(self, options):
        # Initialize default attributes
        self.blast_reverse = False
        self.blockinfo_reverse = False
        self.position = 'order'
        self.block_length = 5
        self.limit_length = 5
        self.repeat_number = 20
        self.score = 100
        self.evalue = 1e-5

        # Update attributes with provided keyword arguments and print them
        for k, v in options:
            setattr(self, k, v)
            print(f"{k} = {v}")
        
        self.blast_reverse = base.str_to_bool(self.blast_reverse)
        self.blockinfo_reverse = base.str_to_bool(self.blockinfo_reverse)
        self.limit_length = int(self.limit_length)

    def karyotype_left(self, pairs, ancestor, gff1, gff2):
        # Loop through each row in ancestor to set color and classification in gff1
        for _, row in ancestor.iterrows():
            loc_min, loc_max = sorted([row[1], row[2]])
            index1 = gff1[(gff1['chr'] == row[0]) &
                          (gff1['order'] >= loc_min) &
                          (gff1['order'] <= loc_max)].index
            gff1.loc[index1, ['color', 'classification']] = row[3], row[4]

        # Merge pairs with gff1 and update gff2 with color and classification
        data = pd.merge(pairs, gff1, left_on=0, right_index=True, how='left')
        data.drop_duplicates(subset=[1], inplace=True)
        data.set_index(1, inplace=True)
        gff2.loc[data.index, ['color', 'classification']] = data[['color', 'classification']]
        return gff2

    def karyotype_top(self, pairs, ancestor, gff1, gff2):
        # Loop through each row in ancestor to set color and classification in gff2
        for _, row in ancestor.iterrows():
            loc_min, loc_max = sorted([row[1], row[2]])
            index1 = gff2[(gff2['chr'] == row[0]) &
                          (gff2['order'] >= loc_min) &
                          (gff2['order'] <= loc_max)].index
            gff2.loc[index1, ['color', 'classification']] = row[3], row[4]

        # Merge pairs with gff2 and update gff1 with color and classification
        data = pd.merge(pairs, gff2, left_on=1, right_index=True, how='left')
        data.drop_duplicates(subset=[0], inplace=True)
        data.set_index(0, inplace=True)
        gff1.loc[data.index, ['color', 'classification']] = data[['color', 'classification']]
        return gff1

    def karyotype_map(self, gff, lens):
        # Filter gff based on lens index and non-null color
        gff = gff[gff['chr'].isin(lens.index) & gff['color'].notnull()]
        ancestor = []
        # Group by chromosome and process each group to create ancestor records
        for chr, group in gff.groupby('chr'):
            color, class_id, arr = '', 1, []
            for _, row in group.iterrows():
                if color ==  row['color'] and class_id == row['classification']:
                    arr.append(row['order'])
                else:
                    if len(arr) >= self.limit_length:
                        ancestor.append([chr, min(arr), max(arr), color, class_id, len(arr)])
                    color, class_id = row['color'], row['classification']
                    arr = []
                    if len(ancestor) >= 1 and color == ancestor[-1][3] and class_id == ancestor[-1][4] and chr == ancestor[-1][0]:
                        arr.append(ancestor[-1][1])
                        arr += np.random.randint(ancestor[-1][1], ancestor[-1][2], size=ancestor[-1][5]-1).tolist()
                        ancestor.pop()
                    arr.append(row['order'])
            if len(arr) >= self.limit_length:
                ancestor.append([chr, min(arr), max(arr), color, class_id, len(arr)])

        ancestor = pd.DataFrame(ancestor)
        # Adjust min and max positions for each chromosome group
        for chr, group in ancestor.groupby(0):
            ancestor.loc[group.index[0], 1] = 1
            ancestor.loc[group.index[-1], 2] = lens[chr]
        ancestor[4] = ancestor[4].astype(int)
        return ancestor[[0, 1, 2, 3, 4, 5]]

    def colinear_gene_pairs(self, bkinfo, gff1, gff2):
        gff1 = gff1.reset_index()
        gff2 = gff2.reset_index()
        
        gff1_indexed = gff1.set_index(['chr', 'order'])
        gff2_indexed = gff2.set_index(['chr', 'order'])
        
        data = []
        for _, row in bkinfo.iterrows():
            b1 = list(map(int, row['block1'].split('_')))
            b2 = list(map(int, row['block2'].split('_')))

            for order1, order2 in zip(b1, b2):
                a = gff1_indexed.loc[(row['chr1'], order1), 1]
                b = gff2_indexed.loc[(row['chr2'], order2), 1]
                data.append([a, b])
        return pd.DataFrame(data)
    
    def new_ancestor(self, ancestor, gff1, gff2, blast):
        # Iterate through ancestor rows to adjust positions based on neighboring rows
        for i in range(1, len(ancestor)):
            if ancestor.iloc[i, 0] == ancestor.iloc[i-1, 0]:
                area = ancestor.iloc[i, 1] - ancestor.iloc[i-1, 2]
                if area <= 5:
                    ancestor.iloc[i-1, 2] = ancestor.iloc[i, 1] - 1
                else:
                    index1 = gff1[(gff1['chr'] == ancestor.iloc[i, 0]) &
                                (gff1['order'] >= ancestor.iloc[i-1, 2]+1) &
                                (gff1['order'] <= ancestor.iloc[i, 1]-1)].index
                    index2 = gff2[gff2['color'] == ancestor.iloc[i-1, 3]].index
                    index3 = gff2[gff2['color'] == ancestor.iloc[i, 3]].index

                    newblast1 = blast[(blast[0].isin(index1)) & (blast[1].isin(index2))]
                    newblast2 = blast[(blast[0].isin(index1)) & (blast[1].isin(index3))]

                    if len(newblast1) >= len(newblast2):
                        ancestor.iloc[i-1, 2] = ancestor.iloc[i, 1] - 1
                    else:
                        ancestor.iloc[i, 1] = ancestor.iloc[i-1, 2] + 1
        for chr, group in ancestor.groupby(0):
            if len(group) == 1:
                continue
            newgff1 = gff1[gff1['chr'] == chr]
            for i in range(1, len(group)):
                if group.iloc[i, 5] > 200:
                    continue

                index_left = newgff1[(newgff1['order'] >= group.iloc[i, 1]) &
                                (newgff1['order'] <= group.iloc[i, 2])].index
                blast_left = blast[blast[0].isin(index_left)]

                index_prev = gff2[gff2['color'] == group.iloc[i-1, 3]].index
                blast_prev = blast_left[blast_left[1].isin(index_prev)]

                index_curr = gff2[gff2['color'] == group.iloc[i, 3]].index
                blast_curr = blast_left[blast_left[1].isin(index_curr)]

                if len(blast_curr) <= len(blast_prev):
                    ancestor.loc[group.index[i],3] = ancestor.loc[group.index[i]-1,3]

                if i < len(group)-1:
                    index_next = gff2[gff2['color'] == group.iloc[i+1, 3]].index
                    blast_next = blast_left[blast_left[1].isin(index_next)]
                    if len(blast_next) > max(len(blast_prev),len(blast_curr)):
                        ancestor.loc[group.index[i],3] = ancestor.loc[group.index[i]+1,3]
        
        ancestor['group'] = (ancestor[0].shift(1) != ancestor[0]) | (ancestor[3].shift(1) != ancestor[3]) | (ancestor[4].shift(1) != ancestor[4])
        ancestor['group'] = ancestor['group'].cumsum()
        result = ancestor.groupby('group').agg({
            0: 'first',
            1: 'min',
            2: 'max',
            3: 'first',
            4: 'first',
        }).reset_index(drop=True)

        return result

    def run(self):
        # Read and process block information
        bkinfo = pd.read_csv(self.blockinfo, index_col='id')
        bkinfo['chr1'] = bkinfo['chr1'].astype(str)
        bkinfo['chr2'] = bkinfo['chr2'].astype(str)
        if self.blockinfo_reverse == True:
            bkinfo[['chr1', 'chr2']] =  bkinfo[['chr2', 'chr1']]
            bkinfo[['block1', 'block2']] =  bkinfo[['block2', 'block1']]
        bkinfo = bkinfo[bkinfo['length'] > int(self.block_length)]

        # Read GFF and lens data
        gff1 = base.newgff(self.gff1)
        gff2 = base.newgff(self.gff2)
        lens = base.newlens(self.the_other_lens, self.position)
        blast = base.newblast(self.blast, int(self.score), float(self.evalue), gff1, gff2, self.blast_reverse)
        # blast.drop_duplicates(subset=[0], keep='first', inplace=True)

        # Find colinear gene pairs
        pairs = self.colinear_gene_pairs(bkinfo, gff1, gff2)

        # Depending on available attributes, call either karyotype_top or karyotype_left
        if hasattr(self, 'ancestor_top'):
            ancestor = base.read_classification(self.ancestor_top)
            data = self.karyotype_top(pairs, ancestor, gff1, gff2)
        elif hasattr(self, 'ancestor_left'):
            ancestor = base.read_classification(self.ancestor_left)
            data = self.karyotype_left(pairs, ancestor, gff1, gff2)
            gff1, gff2 = gff2, gff1
            blast.iloc[:, :2] = blast.iloc[:, [1, 0]].to_numpy()
        else:
            print('Missing ancestor file.')
            exit(0)

        # Map the data and create the final ancestor file
        the_other_ancestor_file = self.karyotype_map(data, lens)
        the_other_ancestor_file = self.new_ancestor(the_other_ancestor_file, gff1, gff2, blast)
        the_other_ancestor_file.to_csv(self.the_other_ancestor_file, sep='\t', header=False, index=False)

================================================
FILE: wgdi/ks.py
================================================
import os
import sys
import numpy as np
import pandas as pd
from Bio import SeqIO
import subprocess
from Bio.Phylo.PAML import yn00
import wgdi.base as base


class ks:
    def __init__(self, options):
        base_conf = base.config()
        self.pair_pep_file = 'pair.pep'
        self.pair_cds_file = 'pair.cds'
        self.prot_align_file = 'prot.aln'
        self.mrtrans = 'pair.mrtrans'
        self.pair_yn = 'pair.yn'

        for k, v in base_conf:
            setattr(self, str(k), v)
        for k, v in options:
            setattr(self, str(k), v)
            print(f'{str(k)} = {v}')

    def auto_file(self):
        pairs = []
        with open(self.pairs_file) as f:
            p = ' '.join(f.readlines()[:30])

        # Detect file format and process accordingly
        if 'path length' in p or 'MAXIMUM GAP' in p:
            collinearity = base.read_colinearscan(self.pairs_file)
            pairs = [[v[0], v[2]] for k in collinearity for v in k[1]]
        elif 'MATCH_SIZE' in p or '## Alignment' in p:
            collinearity = base.read_mcscanx(self.pairs_file)
            pairs = [[v[0], v[2]] for k in collinearity for v in k[1]]
        elif '# Alignment' in p:
            collinearity = base.read_collinearity(self.pairs_file)
            pairs = [[v[0], v[2]] for k in collinearity for v in k[1]]
        elif '###' in p:
            collinearity = base.read_jcvi(self.pairs_file)
            pairs = [[v[0], v[2]] for k in collinearity for v in k[1]]
        elif ',' in p:
            collinearity = pd.read_csv(self.pairs_file, header=None)
            pairs = collinearity.values.tolist()
        else:
            collinearity = pd.read_csv(self.pairs_file, header=None, sep='\t')
            pairs = collinearity.values.tolist()

        df = pd.DataFrame(pairs).drop_duplicates()
        df[0] = df[0].astype(str)
        df[1] = df[1].astype(str)
        df.index = df[0] + ',' + df[1]
        return df

    def run(self):
        # Load sequence data
        cds = SeqIO.to_dict(SeqIO.parse(self.cds_file, "fasta"))
        pep = SeqIO.to_dict(SeqIO.parse(self.pep_file, "fasta"))
        df_pairs = self.auto_file()

        # Check if ks file exists and load it, otherwise create a new one
        if os.path.exists(self.ks_file):
            ks = pd.read_csv(self.ks_file, sep='\t').drop_duplicates()
            kscopy = ks.copy()
            names = ks.columns.tolist()
            names[0], names[1] = names[1], names[0]
            kscopy.columns = names
            ks = pd.concat([ks, kscopy])
            ks['id'] = ks['id1'] + ',' + ks['id2']
            df_pairs.drop(np.intersect1d(df_pairs.index, ks['id'].to_numpy()), inplace=True)
            ks_file = open(self.ks_file, 'a+')
        else:
            ks_file = open(self.ks_file, 'w')
            ks_file.write('\t'.join(['id1', 'id2', 'ka_NG86', 'ks_NG86', 'ka_YN00', 'ks_YN00']) + '\n')

        # Filter valid pairs based on sequence data
        df_pairs = df_pairs[
            (df_pairs[0].isin(cds.keys())) & (df_pairs[1].isin(cds.keys())) &
            (df_pairs[0].isin(pep.keys())) & (df_pairs[1].isin(pep.keys()))
        ]

        pairs = df_pairs[[0, 1]].to_numpy()

        if len(pairs) > 0 and pairs[0][0][:3] == pairs[0][1][:3]:
            allpairs = []
            pair_hash = {}
            for k in pairs:
                if k[0] + ',' + k[1] in pair_hash or k[1] + ',' + k[0] in pair_hash:
                    continue
                else:
                    pair_hash[k[0] + ',' + k[1]] = 1
                    pair_hash[k[1] + ',' + k[0]] = 1
                    allpairs.append(k)
            pairs = allpairs

        for k in pairs:
            cds_gene1, cds_gene2 = cds[k[0]], cds[k[1]]
            cds_gene1.id, cds_gene2.id = 'gene1', 'gene2'
            pep_gene1, pep_gene2 = pep[k[0]], pep[k[1]]
            pep_gene1.id, pep_gene2.id = 'gene1', 'gene2'

            # Write sequences to files
            SeqIO.write([cds[k[0]], cds[k[1]]], self.pair_cds_file, "fasta")
            SeqIO.write([pep[k[0]], pep[k[1]]], self.pair_pep_file, "fasta")

            # Compute Ka/Ks values
            kaks = self.pair_kaks(['gene1', 'gene2'])
            if kaks is None:
                continue

            ks_file.write('\t'.join([str(i) for i in list(k) + list(kaks)]) + '\n')

        ks_file.close()

        # Clean up temporary files
        for file in [
            self.pair_pep_file, self.pair_cds_file, self.mrtrans, self.pair_yn,
            self.prot_align_file, '2YN.dN', '2YN.dS', '2YN.t', 'rst', 'rst1', 'yn00.ctl', 'rub'
        ]:
            try:
                os.remove(file)
            except OSError:
                pass

    def pair_kaks(self, k):
        self.align()
        pal = self.pal2nal()
        if not pal:
            return []

        kaks = self.run_yn00()
        if kaks is None:
            return []

        kaks_new = [
            kaks[k[0]][k[1]]['NG86']['dN'], kaks[k[0]][k[1]]['NG86']['dS'],
            kaks[k[0]][k[1]]['YN00']['dN'], kaks[k[0]][k[1]]['YN00']['dS']
        ]
        return kaks_new

    def align(self):
        if self.align_software == 'mafft':
            try:
                command = [self.mafft_path, '--quiet', self.pair_pep_file, '>', self.prot_align_file]
                subprocess.run(" ".join(command), shell=True, check=True)
            except subprocess.CalledProcessError as e:
                print(f"Error while running MAFFT: {e}")

        elif self.align_software == 'muscle':
            try:
                command = [self.muscle_path, '-align', self.pair_pep_file, '-output', self.prot_align_file, '-quiet']
                subprocess.run(" ".join(command), shell=True, check=True)
            except subprocess.CalledProcessError as e:
                print(f"Error while running Muscle: {e}")

    def pal2nal(self):
        args = ['perl', self.pal2nal_path, self.prot_align_file, self.pair_cds_file, '-output paml -nogap', '>' + self.mrtrans]
        command = ' '.join(args)
        try:
            os.system(command)
        except:
            return False
        return True

    def run_yn00(self):
        yn = yn00.Yn00()
        yn.alignment = self.mrtrans
        yn.out_file = self.pair_yn
        yn.set_options(icode=0, commonf3x4=0, weighting=0, verbose=1)

        try:
            run_result = yn.run(command=self.yn00_path)
        except:
            run_result = None
        return run_result


================================================
FILE: wgdi/ks_peaks.py
================================================
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.stats.kde import gaussian_kde

import wgdi.base as base

class kspeaks:
    def __init__(self, options):
        # Default values
        self.tandem_length = 200
        self.figsize = 10, 6.18
        self.fontsize = 9
        self.block_length = 3
        self.area = 0, 3
        self.tandem =  True

        # Set options passed in
        for k, v in options:
            setattr(self, str(k), v)
            print(f'{str(k)} = {v}')

        # Convert string values to lists of floats
        self.homo = [float(k) for k in self.homo.split(',')]
        self.ks_area = [float(k) for k in self.ks_area.split(',')]
        self.figsize = [float(k) for k in self.figsize.split(',')]
        self.area = [float(k) for k in self.area.split(',')]
        self.pvalue = float(self.pvalue)
        self.block_length = int(self.block_length)
        self.tandem = base.str_to_bool(self.tandem)

    def remove_tandem(self, bkinfo):
        """
        Remove tandem duplications based on start and end position differences.
        """
        group = bkinfo[bkinfo['chr1'] == bkinfo['chr2']].copy()
        group.loc[:, 'start'] = group.loc[:, 'start1'] - group.loc[:, 'start2']
        group.loc[:, 'end'] = group.loc[:, 'end1'] - group.loc[:, 'end2']
        
        # Drop rows where start or end difference is within tandem length
        index = group[(group['start'].abs() <= self.tandem_length) | 
                      (group['end'].abs() <= self.tandem_length)].index
        bkinfo = bkinfo.drop(index)
        return bkinfo

    def ks_kde(self, df):
        """
        Perform kernel density estimation (KDE) on Ks data.
        """
        # Clean up 'ks' column by removing leading underscores
        df.loc[df['ks'].str.startswith('_'), 'ks'] = df.loc[df['ks'].str.startswith('_'), 'ks'].str[1:]
        
        ks = df['ks'].str.split('_')
        arr = []
        ks_ave = []
        
        # Collect individual Ks values and calculate average Ks per row
        for v in ks.values:
            v = [float(k) for k in v if float(k) >= 0]
            if len(v) == 0:
                continue
            arr.extend(v)
            ks_ave.append(sum(v) / len(v))  # Mean of each row's Ks values
        
        # KDE for three distributions: median, average, total
        kdemedian = gaussian_kde(df['ks_median'].values)
        kdemedian.set_bandwidth(bw_method=kdemedian.factor / 3.)
        
        kdeaverage = gaussian_kde(ks_ave)
        kdeaverage.set_bandwidth(bw_method=kdeaverage.factor / 3.)
        
        kdetotal = gaussian_kde(arr)
        kdetotal.set_bandwidth(bw_method=kdetotal.factor / 3.)

        return [kdemedian, kdeaverage, kdetotal]

    def run(self):
        """
        Main method to process the data, perform KDE, and generate the plot.
        """
        plt.rcParams['ytick.major.pad'] = 0
        fig, ax = plt.subplots(figsize=self.figsize)

        # Read the block info file
        bkinfo = pd.read_csv(self.blockinfo)
        bkinfo['chr1'] = bkinfo['chr1'].astype(str)
        bkinfo['chr2'] = bkinfo['chr2'].astype(str)
        bkinfo['length'] = bkinfo['length'].astype(int)

        # Filter based on block length and p-value
        bkinfo = bkinfo[(bkinfo['length'] > self.block_length) &
                        (bkinfo['pvalue'] < self.pvalue)]

        # Remove tandem duplications if needed
        if self.tandem == False:
            bkinfo = self.remove_tandem(bkinfo)

        # Further filtering based on homozygous range and Ks area
        bkinfo = bkinfo[bkinfo[f'homo{self.multiple}'] >= self.homo[0]]
        bkinfo = bkinfo[bkinfo[f'homo{self.multiple}'] <= self.homo[1]]
        bkinfo = bkinfo[bkinfo['ks_median'] >= self.ks_area[0]]
        bkinfo = bkinfo[bkinfo['ks_median'] <= self.ks_area[1]]

        # Perform KDE on the Ks data
        kdemedian, kdeaverage, kdetotal = self.ks_kde(bkinfo)

        # Define the range for the x-axis (Ks values)
        dist_space = np.linspace(self.area[0], self.area[1], 500)

        # Plot the KDE results
        ax.plot(dist_space, kdemedian(dist_space), color='red', label='block median')
        ax.plot(dist_space, kdeaverage(dist_space), color='black', label='block average')
        ax.plot(dist_space, kdetotal(dist_space), color='blue', label='all pairs')

        # Set plot labels, grid, and limits
        ax.grid()
        ax.set_xlabel(r'${K_{s}}$', fontsize=20)
        ax.set_ylabel('Frequency', fontsize=20)
        ax.tick_params(labelsize=18)
        ax.set_xlim(self.area)
        ax.legend(fontsize=20)

        # Adjust layout for better display
        plt.subplots_adjust(left=0.09, right=0.96, top=0.93, bottom=0.12)

        # Save the figure
        plt.savefig(self.savefig, dpi=500)
        plt.show()

        # Save the filtered data to CSV
        bkinfo.to_csv(self.savefile, index=False)

================================================
FILE: wgdi/ksfigure.py
================================================
import re
import sys

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import wgdi.base as base
from scipy import stats


class ksfigure():
    def __init__(self, options):
        self.figsize = 10, 6.18
        self.legendfontsize = 30
        self.labelfontsize = 9
        self.area = 0, 3
        self.shadow = True
        self.mode = 'median'
        for k, v in options:
            setattr(self, str(k), v)
            print(str(k), ' = ', v)
        if self.xlabel == 'none' or self.xlabel == '':
            self.xlabel = r'Synonymous nucleotide subsititution (${K_{s}}$)'
        if self.ylabel == 'none' or self.ylabel == '':
            self.ylabel = 'kernel density of syntenic blocks'
        if self.title == 'none' or self.title == '':
            self.title = ''
        self.figsize = [float(k) for k in self.figsize.split(',')]
        self.area = [float(k) for k in self.area.split(',')]
        self.shadow = base.str_to_bool(self.shadow)

    def Gaussian_distribution(self, t, k):
        y = np.zeros(len(t))
        for i in range(0, int((len(k) - 1) / 3)+1):
            if np.isnan(k[3 * i + 2]):
                continue
            k[3 * i + 2] = float(k[3 * i + 2])/np.sqrt(2)
            k[3 * i + 0] = float(k[3 * i + 0]) * \
                np.sqrt(2*np.pi)*float(k[3 * i + 2])
            y1 = stats.norm.pdf(
                t, float(k[3 * i + 1]), float(k[3 * i + 2])) * float(k[3 * i + 0])
            y = y+y1
        return y

    def run(self):
        plt.rcParams['ytick.major.pad'] = 0
        fig, ax = plt.subplots(figsize=self.figsize)
        ksfit = pd.read_csv(self.ksfit, index_col=0)
        t = np.arange(self.area[0], self.area[1], 0.0005)
        col = [k for k in ksfit.columns if re.match('Unnamed:', k)]
        for index, row in ksfit.iterrows():
            ax.plot(t, self.Gaussian_distribution(
                t, row[col].values), linestyle=row['linestyle'], color=row['color'],alpha=0.8, label=index, linewidth=row['linewidth'])
            if self.shadow == True:
                ax.fill_between(t, 0, self.Gaussian_distribution(t, row[col].values),  color=row['color'], alpha=0.15, interpolate=True, edgecolor=None, label=index,)
        align = dict(family='Arial', verticalalignment="center",
                     horizontalalignment="center")
        ax.set_xlabel(self.xlabel, fontsize=self.labelfontsize,
                      labelpad=20, **align)
        ax.set_ylabel(self.ylabel, fontsize=self.labelfontsize,
                      labelpad=20, **align)
        ax.set_title(self.title, weight='bold',
                     fontsize=self.labelfontsize, **align)
        plt.tick_params(labelsize=10)
        handles,labels = ax.get_legend_handles_labels()
        df = pd.DataFrame({  'handles': handles, 'labels': labels})
        df.drop_duplicates(subset='labels', keep='first', inplace=True)
        handles, labels = df['handles'].tolist(), df['labels'].tolist()
        if self.shadow == True:
            plt.legend(handles=handles,labels=labels,loc='upper right', prop={
                   'family': 'Arial', 'style': 'italic', 'size': self.legendfontsize})
        else:
            plt.legend(handles=handles,labels=labels,loc='upper right', prop={
                   'family': 'Arial', 'style': 'italic', 'size': self.legendfontsize})
        plt.gca().spines['top'].set_visible(False)
        plt.gca().spines['right'].set_visible(False)
        plt.savefig(self.savefig, dpi=500)
        plt.show()
        sys.exit(0)


================================================
FILE: wgdi/peaksfit.py
================================================
import re
import sys

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.optimize import curve_fit
from scipy.stats import gaussian_kde, linregress

import wgdi.base as base


class peaksfit():
    def __init__(self, options):
        self.figsize = 10, 6.18
        self.fontsize = 9
        self.area = 0, 3
        self.mode = 'median'
        self.histogram_only = False
        for k, v in options:
            setattr(self, str(k), v)
            print(str(k), ' = ', v)
        self.figsize = [float(k) for k in self.figsize.split(',')]
        self.area = [float(k) for k in self.area.split(',')]
        self.bins_number = int(self.bins_number)
        self.peaks = 1
        self.histogram_only = base.str_to_bool(self.histogram_only)

    def ks_values(self, df):
        df.loc[df['ks'].str.startswith('_'),'ks']= df.loc[df['ks'].str.startswith('_'),'ks'].str[1:]
        ks = df['ks'].str.split('_')
        ks_total = []
        ks_average = []
        for v in ks.values:
            ks_total.extend([float(k) for k in v])
        ks_average = df['ks_average'].values
        ks_median = df['ks_median'].values
        return [ks_median, ks_average, ks_total]

    def gaussian_fuc(self, x, *params):
        y = np.zeros_like(x)
        for i in range(0, len(params), 3):
            amp = float(params[i])
            ctr = float(params[i+1])
            wid = float(params[i+2])
            y = y + amp * np.exp(-((x - ctr)/wid)**2)
        return y

    def kde_fit(self, data, x):
        kde = gaussian_kde(data)
        kde.set_bandwidth(bw_method=kde.factor/3.)
        p = kde(x)
        guess = [1,1, 1]*self.peaks
        popt, pcov = curve_fit(self.gaussian_fuc, x, p, guess, maxfev = 80000)
        popt = [abs(k) for k in popt]
        data = []
        y = self.gaussian_fuc(x, *popt)
        for i in range(0, len(popt), 3):
            array = [popt[i], popt[i+1], popt[i+2]]
            data.append(self.gaussian_fuc(x, *array))
        slope, intercept, r_value, p_value, std_err = linregress(p, y)
        print("\nR-square: "+str(r_value**2))
        print("The gaussian fitting curve parameters are :")
        print('  |  '.join([str(k) for k in popt]))
        return y, data

    def run(self):
        plt.rcParams['ytick.major.pad'] = 0
        fig, ax = plt.subplots(figsize=self.figsize)
        bkinfo = pd.read_csv(self.blockinfo)
        ks_median, ks_average, ks_total = self.ks_values(bkinfo)
        data = eval('ks_'+self.mode)
        data = [k for k in data if self.area[0] <= k <= self.area[1]]
        x = np.linspace(self.area[0], self.area[1], self.bins_number)
        n, bins, patches = ax.hist(data, int(
            self.bins_number), density=1, facecolor='blue', alpha=0.3, label='Histogram')
        if self.histogram_only == True:
            pass
        else:
            y, fit = self.kde_fit(data, x)
            ax.plot(x, y, color='black', linestyle='-', label='Gaussian fitting')
        ax.grid()
        align = dict(family='Arial', verticalalignment="center",
                     horizontalalignment="center")
        ax.set_xlabel(r'${K_{s}}$', fontsize=20)
        ax.set_ylabel('Frequency', fontsize=20)
        ax.tick_params(labelsize=18)
        ax.legend(fontsize=20)
        ax.set_xlim(self.area)
        plt.subplots_adjust(left=0.09, right=0.96, top=0.93, bottom=0.12)
        plt.savefig(self.savefig, dpi=500)
        plt.show()
        sys.exit(0)


================================================
FILE: wgdi/pindex.py
================================================
import os
import sys

import numpy as np
import pandas as pd
import wgdi.base as base


class pindex():
    def __init__(self, options):
        self.remove_delta = True
        self.position = 'order'
        self.retention = 0.05
        self.diff = 0.05
        self.gap = 50
        for k, v in options:
            setattr(self, str(k), v)
            print(k, ' = ', v)
        self.gap = int(self.gap)
        self.retention = float(self.retention)
        self.diff = float(self.diff)

    def Pindex(self, sub1, sub2):
        r1 = self.retain(sub1)
        r2 = self.retain(sub2)
        r = []
        for i in range(len(r2)):
            if(r1[i] < self.retention or r2[i] < self.retention):
                r.append(0)
                continue
            d = (r1[i]-r2[i])/(r1[i]+r2[i])*0.5
            if d > self.diff:
                r.append(1)
            elif -d > self.diff:
                r.append(-1)
            else:
                r.append(0)
        a, b, c = len([i for i in r if i == 1]), len(
            [i for i in r if i == -1]), len([i for i in r if i == 0])
        return [a, -b, c, len(r)]

    def retain(self, arr):
        a = []
        for i in range(0, len(arr), 2*self.gap):
            start, end = i-self.gap, i+self.gap
            genenum, retainnum = 0, 0
            for j in range(start, end):
                if((j >= int(len(arr))) or (j < 0)):
                    continue
                else:
                    retainnum += arr[j]
                    genenum += 1
            a.append(float(retainnum/genenum))
        return a

    def run(self):
        alignment = pd.read_csv(self.alignment, header=None, index_col=0)
        alignment.replace(r'\w+', 1, regex=True, inplace=True)
        alignment.replace('.', 0, inplace=True)
        alignment.fillna(0, inplace=True)
        gff = base.newgff(self.gff)
        lens = base.newlens(self.lens, self.position)
        gff = gff[gff['chr'].isin(lens.index)]
        alignment = alignment.join(gff[['chr', self.position]], how='left')
        alignment.dropna(axis=0, how='any', inplace=True)
        p = self.cal_pindex(alignment)
        print('Polyploidy-index: ', p)
        sys.exit(0)

    def cal_pindex(self, alignment):
        data, df = [], []
        columns = alignment.columns[:-2].tolist()
        for i in range(len(columns)-1):
            for j in range(i+1, len(columns)):
                b = []
                for chr, group in alignment.groupby('chr'):
                    sub1 = group.loc[:, columns[i]].tolist()
                    sub2 = group.loc[:, columns[j]].tolist()
                    p = self.Pindex(sub1, sub2)
                    b.append(p)
                    df.append([i, j, chr]+p)
                sub_diver = sum([abs(k[0]+k[1]) for k in b])
                if self.remove_delta == True:
                    sub_total = sum([abs(k[1])+abs(k[0]) for k in b])
                    if sub_total == 0:
                        c = 0
                    else:
                        c = sub_diver/sub_total
                else:
                    sub_total = sum([abs(k[1])+abs(k[0])+abs(k[2]) for k in b])
                    c = sub_diver/sub_total
                data.append(c)
        df = pd.DataFrame(df, columns=[
                          'sub1', 'sub2', 'chr', 'sub1_high', 'sub2_high', 'No_diff', 'Total'])
        df['sub2_high'] = df['sub2_high'].abs()
        self.infomation(df)
        print('\nPolyploidy-index between subgenomes are ', data)
        return sum(data)/len(data)

    def turn_percentage(self, x):
        return '(%.2f%%)' % (x * 100)

    def infomation(self, df):
        data = []
        for names, group in df.groupby(['sub1', 'sub2']):
            newgroup = pd.concat([group.head(1), group],
                                 axis=0, ignore_index=True)
            cols = ['sub1_high', 'sub2_high', 'No_diff', 'Total']
            newgroup.loc[0, cols] = group.loc[:, cols].sum()
            group1 = newgroup.copy()
            group1[cols] = group1[cols].astype(str)
            newgroup['sub1_high'] = (
                newgroup['sub1_high'] / newgroup['Total']).apply(self.turn_percentage)
            newgroup['sub2_high'] = (
                newgroup['sub2_high'] / newgroup['Total']).apply(self.turn_percentage)
            newgroup['No_diff'] = (
                newgroup['No_diff'] / newgroup['Total']).apply(self.turn_percentage)
            newgroup['Total'] = (
                newgroup['Total'] / group['Total'].sum()).apply(self.turn_percentage)
            newgroup[cols] = group1[cols]+newgroup[cols]
            group_list = []
            a = newgroup[['chr']+cols].columns.to_numpy()
            a[0] = 'Chromosome'
            a[1], a[2] = 'Sub_'+str(names[0]+1), 'Sub_'+str(names[1]+1)
            group_list.append(a)
            b = newgroup[['chr']+cols].to_numpy()
            b[0][0] = 'Total'
            for k in b:
                group_list.append(k)
            group_list = np.array(group_list).T
            for k in group_list:
                data.append(k)
        data = pd.DataFrame(data)
        data.to_csv(self.savefile, header=None, index=None)


================================================
FILE: wgdi/polyploidy_classification.py
================================================
import pandas as pd
import wgdi.base as base


class polyploidy_classification:
    def __init__(self, options):
        self.same_protochromosome = False
        self.same_subgenome = False
        for k, v in options:
            setattr(self, str(k), v)
            print(f"{k} = {v}")
        
        self.same_protochromosome = base.str_to_bool(self.same_protochromosome)
        self.same_subgenome = base.str_to_bool(self.same_subgenome)
        
        # Initialize classid with a default value if not provided
        self.classid = [str(k) for k in getattr(self, 'classid', 'class1,class2').split(',')]

    def run(self):
        # Read input files
        ancestor_left = base.read_classification(self.ancestor_left)
        ancestor_top = base.read_classification(self.ancestor_top)
        bkinfo = pd.read_csv(self.blockinfo)

        # Ensure chr1 and chr2 are treated as strings
        bkinfo['chr1'] = bkinfo['chr1'].astype(str)
        bkinfo['chr2'] = bkinfo['chr2'].astype(str)

        # Filter rows where chr1 and chr2 match ancestor values
        bkinfo = bkinfo[bkinfo['chr1'].isin(ancestor_left[0].values) & bkinfo['chr2'].isin(ancestor_top[0].values)]

        # Initialize additional columns
        bkinfo[self.classid[0]] = 0
        bkinfo[self.classid[1]] = 0
        bkinfo[self.classid[0] + '_color'] = ''
        bkinfo[self.classid[1] + '_color'] = ''
        bkinfo['diff'] = 0.0

        # Processing the first classification (ancestor_left vs chr1)
        for name, group in bkinfo.groupby('chr1'):
            d1 = ancestor_left[ancestor_left[0] == name]
            for index1, row1 in group.iterrows():
                a, b = sorted([row1['start1'], row1['end1']])
                a, b = int(a), int(b)
                for index2, row2 in d1.iterrows():
                    c, d = sorted([row2[1], row2[2]])
                    h = len([k for k in range(a, b) if k in range(c, d)]) / (b - a)
                    if h > bkinfo.loc[index1, 'diff']:
                        bkinfo.loc[index1, 'diff'] = float(h)
                        bkinfo.loc[index1, self.classid[0]] = row2[4]
                        bkinfo.loc[index1, self.classid[0] + '_color'] = row2[3]

        # Reset 'diff' and process the second classification (ancestor_top vs chr2)
        bkinfo['diff'] = 0.0
        for name, group in bkinfo.groupby('chr2'):
            d2 = ancestor_top[ancestor_top[0] == name]
            for index1, row1 in group.iterrows():
                a, b = sorted([row1['start2'], row1['end2']])
                a, b = int(a), int(b)
                for index2, row2 in d2.iterrows():
                    c, d = sorted([row2[1], row2[2]])
                    h = len([k for k in range(a, b) if k in range(c, d)]) / (b - a)
                    if h > bkinfo.loc[index1, 'diff']:
                        bkinfo.loc[index1, 'diff'] = float(h)
                        bkinfo.loc[index1, self.classid[1]] = row2[4]
                        bkinfo.loc[index1, self.classid[1] + '_color'] = row2[3]

        # Uncomment if you want to filter rows where both colors match
        if self.same_protochromosome == True:
            bkinfo = bkinfo[bkinfo[self.classid[1] + '_color'] == bkinfo[self.classid[0] + '_color']]
        if self.same_subgenome == True:
            bkinfo = bkinfo[bkinfo[self.classid[1]] == bkinfo[self.classid[0]]]  

        # Save the result to a CSV file
        bkinfo.to_csv(self.savefile, index=False)


================================================
FILE: wgdi/retain.py
================================================
import matplotlib.pyplot as plt
import pandas as pd
import wgdi.base as base

class retain:
    def __init__(self, options):
        self.position = 'order'
        
        # Initialize the options by setting attributes dynamically
        for k, v in options:
            setattr(self, str(k), v)
            print(f"{str(k)} = {v}")

        # Handle the ylim parameter, which defines the y-axis limits
        self.ylim = [float(k) for k in self.ylim.split(',')] if hasattr(self, 'ylim') else [0, 1]
        
        # Handle the colors and figsize parameters
        self.colors = [str(k) for k in self.colors.split(',')]
        self.figsize = [float(k) for k in self.figsize.split(',')]

    def run(self):
        # Load GFF and lens data
        gff = base.newgff(self.gff)
        lens = base.newlens(self.lens, self.position)
        
        # Filter GFF data based on lens chromosome index
        gff = gff[gff['chr'].isin(lens.index)]
        
        # Load alignment data and join with GFF
        alignment = pd.read_csv(self.alignment, header=None, index_col=0)
        alignment = alignment.join(gff[['chr', self.position]], how='left')
        
        # Perform alignment processing
        self.retain = self.align_chr(alignment)
        
        # Save the processed data to a file
        self.retain[self.retain.columns[:-2]].to_csv(self.savefile, sep='\t', header=None)
        
        # Create a figure for plotting
        fig, axs = plt.subplots(len(lens), 1, sharex=True, sharey=True, figsize=tuple(self.figsize))
        fig.add_subplot(111, frameon=False)
        
        align = dict(family='DejaVu Sans', verticalalignment="center", horizontalalignment="center")

        
        # Hide all the spines and ticks on the plot
        for spine in plt.gca().spines.values():
            spine.set_visible(False)
        plt.tick_params(top=False, bottom=False, left=False, right=False, labelleft=False, labelbottom=False)
        
        # Group the retain data by chromosome and plot each chromosome's data
        groups = self.retain.groupby('chr')
        for i, chr_name in enumerate(lens.index):
            group = groups.get_group(chr_name)

            if len(lens) == 1:
                for j, col in enumerate(self.retain.columns[:-2]):
                    axs.plot(group['order'].values, group[col].values,
                                linestyle='-', color=self.colors[j], linewidth=1)
                axs.spines['right'].set_visible(False)
                axs.spines['top'].set_visible(False)
                axs.set_ylim(self.ylim)
                axs.tick_params(labelsize=12)                
            else:
                # Plot each column's data for the current chromosome
                for j, col in enumerate(self.retain.columns[:-2]):
                    axs[i].plot(group['order'].values, group[col].values,
                                linestyle='-', color=self.colors[j], linewidth=1)
            
                # Hide the right and top spines for each subplot
                axs[i].spines['right'].set_visible(False)
                axs[i].spines['top'].set_visible(False)
                axs[i].set_ylim(self.ylim)
                axs[i].tick_params(labelsize=12)

        for i, chr_name in enumerate(lens.index):
            if len(lens) == 1:
                x, y = axs.get_xlim()[1] * 0.90, axs.get_ylim()[1] * 0.8
                axs.text(x, y, f"{self.refgenome} {chr_name}", fontsize=14, **align)
            else:
                # Add a label for the reference genome and chromosome
                x, y = axs[i].get_xlim()[1] * 0.90, axs[i].get_ylim()[1] * 0.8
                axs[i].text(x, y, f"{self.refgenome} {chr_name}", fontsize=14, **align)
        
        # Adjust layout and save the figure as an image
        plt.ylabel(f"{self.ylabel}\n\n\n\n", fontsize=18, **align)
        plt.subplots_adjust(left=0.1, right=0.95, top=0.95, bottom=0.05)
        plt.savefig(self.savefig, dpi=500)
        plt.show()

    def align_chr(self, alignment):
        """
        Perform the alignment processing for each chromosome by updating the values.
        """
        for i in alignment.columns[:-2]:
            # Update values: set '1' for valid values, '0' for invalid, and fill NaN with 0
            alignment.loc[alignment[i].str.contains(r'\w', na=False), i] = 1
            alignment.loc[alignment[i] == '.', i] = 0
            alignment.loc[alignment[i] == ' ', i] = 0
            alignment[i] = alignment[i].astype('float64').fillna(0)
            
            # Apply the moving average function to each group by chromosome
            for chr_name, group in alignment.groupby(['chr']):
                a = self.moving_average(group[i].values.tolist())
                alignment.loc[group.index, i] = a
        return alignment

    def moving_average(self, arr):
        """
        Calculate a moving average over a specified window size.
        This function smooths the input array using a sliding window.
        """
        a = []
        for i in range(len(arr)):
            # Define the window range
            start, end = max(0, i - int(self.step)), min(len(arr), i + int(self.step))
            ave = sum(arr[start:end]) / (end - start)
            a.append(ave)
        return a


================================================
FILE: wgdi/run.py
================================================
import argparse
import os
import shutil
import sys

import wgdi
import wgdi.base as base
from wgdi.align_dotplot import align_dotplot
from wgdi.block_correspondence import block_correspondence
from wgdi.block_info import block_info
from wgdi.block_ks import block_ks
from wgdi.circos import circos
from wgdi.dotplot import dotplot
from wgdi.karyotype import karyotype
from wgdi.karyotype_mapping import karyotype_mapping
from wgdi.ks import ks
from wgdi.ks_peaks import kspeaks
from wgdi.ksfigure import ksfigure
from wgdi.peaksfit import peaksfit
from wgdi.pindex import pindex
from wgdi.polyploidy_classification import polyploidy_classification
from wgdi.retain import retain
from wgdi.run_colliearity import mycollinearity
from wgdi.trees import trees
from wgdi.ancestral_karyotype import ancestral_karyotype
from wgdi.ancestral_karyotype_repertoire import ancestral_karyotype_repertoire
from wgdi.shared_fusion import shared_fusion
from wgdi.fusion_positions_database import fusion_positions_database
from wgdi.fusions_detection import fusions_detection


# Argument parser setup
parser = argparse.ArgumentParser(
    prog='wgdi', usage='%(prog)s [options]', epilog="",
    formatter_class=argparse.RawDescriptionHelpFormatter
)

parser.description = '''\
WGDI(Whole-Genome Duplication Integrated): A user-friendly toolkit for evolutionary analyses of whole-genome duplications and ancestral karyotypes.

    https://wgdi.readthedocs.io/en/latest/
    -------------------------------------- 
'''

parser.add_argument("-v", "--version", action='version', version='0.75')
parser.add_argument("-d", dest="dotplot", help="Show homologous gene dotplot")
parser.add_argument("-icl", dest="improvedcollinearity", help="Improved version of ColinearScan ")
parser.add_argument("-ks", dest="calks", help="Calculate Ka/Ks for homologous gene pairs by YN00")
parser.add_argument("-bk", dest="blockks", help="Show Ks of blocks in a dotplot")
parser.add_argument("-bi", dest="blockinfo", help="Collinearity and Ks speculate whole genome duplication")
parser.add_argument("-c", dest="correspondence", help="Extract event-related genomic alignment")
parser.add_argument("-kp", dest="kspeaks", help="A simple way to get ks peaks")
parser.add_argument("-kf", dest="ksfigure", help="A simple way to draw ks distribution map")
parser.add_argument("-pf", dest="peaksfit", help="Gaussian fitting of ks distribution")
parser.add_argument("-pc", dest="polyploidy_classification", help="Polyploid distinguish among subgenomes")
parser.add_argument("-a", dest="alignment", help="Show event-related genomic alignment in a dotplot")
parser.add_argument("-k", dest="karyotype", help="Show genome evolution from reconstructed ancestors")
parser.add_argument("-ak", dest="ancestral_karyotype", help="Generation of ancestral karyotypes from chromosomes that retain same structures in genomes")
parser.add_argument("-akr", dest="ancestral_karyotype_repertoire", help="Incorporate genes from collinearity blocks into the ancestral karyotype repertoire")
parser.add_argument("-km", dest="karyotype_mapping", help="Mapping from the known karyotype result to this species")
parser.add_argument("-fpd", dest="fusion_positions_database", help="Extract the fusion positions dataset")
parser.add_argument("-fd", dest="fusions_detection", help="Determine whether these fusion events occur in other genomes")
parser.add_argument("-sf", dest="shared_fusion", help="Quickly find shared fusions between species")
parser.add_argument("-at", dest="alignmenttrees", help="Collinear genes construct phylogenetic trees")
parser.add_argument("-p", dest="pindex", help="Polyploidy-index characterize the degree of divergence between subgenomes of a polyploidy")
parser.add_argument("-r", dest="retain", help="Show subgenomes in gene retention or genome fractionation")
parser.add_argument("-ci", dest="circos", help="A simple way to run circos")
parser.add_argument("-conf", dest="configure", help="Display and modify the environment variable")

args = parser.parse_args()

# Function to run subprograms based on options
def run_subprogram(program, conf, name):
    options = base.load_conf(conf, name)
    r = program(options)
    r.run()

# Function to configure environment
def run_configure():
    base.rewrite(args.configure, 'ini')

# Main function to decide which module to run based on input arguments
def module_to_run(argument, conf):
    switcher = {
        'dotplot': (dotplot, conf, 'dotplot'),
        'correspondence': (block_correspondence, conf, 'correspondence'),
        'alignment': (align_dotplot, conf, 'alignment'),
        'retain': (retain, conf, 'retain'),
        'blockks': (block_ks, conf, 'blockks'),
        'blockinfo': (block_info, conf, 'blockinfo'),
        'calks': (ks, conf, 'ks'),
        'circos': (circos, conf, 'circos'),
        'kspeaks': (kspeaks, conf, 'kspeaks'),
        'peaksfit': (peaksfit, conf, 'peaksfit'),
        'ksfigure': (ksfigure, conf, 'ksfigure'),
        'pindex': (pindex, conf, 'pindex'),
        'alignmenttrees': (trees, conf, 'alignmenttrees'),
        'improvedcollinearity': (mycollinearity, conf, 'collinearity'),
        'configure': run_configure,
        'polyploidy_classification': (polyploidy_classification, conf, 'polyploidy classification'),
        'karyotype': (karyotype, conf, 'karyotype'),
        'ancestral_karyotype': (ancestral_karyotype, conf, 'ancestral_karyotype'),
        'karyotype_mapping': (karyotype_mapping, conf, 'karyotype_mapping'),
        'ancestral_karyotype_repertoire': (ancestral_karyotype_repertoire, conf, 'ancestral_karyotype_repertoire'),
        'shared_fusion': (shared_fusion, conf, 'shared_fusion'),
        'fusion_positions_database': (fusion_positions_database, conf, 'fusion_positions_database'),
        'fusions_detection': (fusions_detection, conf, 'fusions_detection'),
    }
    
    if argument == 'configure':
        run_configure()
    else:
        program, conf, name = switcher.get(argument)
        if program:
            run_subprogram(program, conf, name)


# Main entry point
def main():
    path = wgdi.__path__[0]
    options = {
        'dotplot': 'dotplot.conf',
        'correspondence': 'corr.conf',
        'alignment': 'align.conf',
        'retain': 'retain.conf',
        'blockks': 'blockks.conf',
        'blockinfo': 'blockinfo.conf',
        'calks': 'ks.conf',
        'circos': 'circos.conf',
        'kspeaks': 'kspeaks.conf',
        'ksfigure': 'ksfigure.conf',
        'pindex': 'pindex.conf',
        'alignmenttrees': 'alignmenttrees.conf',
        'peaksfit': 'peaksfit.conf',
        'configure': 'conf.ini',
        'improvedcollinearity': 'collinearity.conf',
        'polyploidy_classification': 'polyploidy_classification.conf',
        'karyotype': 'karyotype.conf',
        'ancestral_karyotype': 'ancestral_karyotype.conf',
        'ancestral_karyotype_repertoire': 'ancestral_karyotype_repertoire.conf',
        'karyotype_mapping': 'karyotype_mapping.conf',
        'shared_fusion': 'shared_fusion.conf',
        'fusion_positions_database': 'fusion_positions_database.conf',
        'fusions_detection': 'fusions_detection.conf',
    }

    for arg in vars(args):
        value = getattr(args, arg)
        if value is not None:
            if value in ['?', 'help', 'example']:
                with open(os.path.join(path, 'example', options[arg])) as f:
                    print(f.read())
                
                if arg == 'ksfigure' and not os.path.exists('ks_fit_result.csv'):
                    shutil.copy2(os.path.join(wgdi.__path__[0], 'example/ks_fit_result.csv'), os.getcwd())
            elif not os.path.exists(value):
                print(f'{value} not exists')
                sys.exit(0)
            else:
                module_to_run(arg, value)


if __name__ == "__main__":
    main()


================================================
FILE: wgdi/run_colliearity.py
================================================
import gc
import re
import sys
from multiprocessing import Pool

import numpy as np
import pandas as pd

import wgdi.base as base
import wgdi.collinearity as improvedcollinearity


class mycollinearity():
    def __init__(self, options):
        # Initialize parameters with default values
        self.repeat_number = 10
        self.multiple = 1
        self.score = 100
        self.evalue = 1e-5
        self.blast_reverse = False
        self.over_gap  = 5
        self.comparison = 'genomes'
        self.options = options

        for k, v in options:
            setattr(self, str(k), v)
            print(f"{str(k)} = {v}")
        self.position = 'order'
        # Parse grading values
        if hasattr(self, 'grading'):
            self.grading = [int(k) for k in self.grading.split(',')]
        else:
            self.grading = [50, 40, 25]
        # Ensure process is an integer
        if hasattr(self, 'process'):
            self.process = int(self.process)
        else:
            self.process = 4
        self.over_gap  = int(self.over_gap )
        base.str_to_bool(self.blast_reverse)

    def deal_blast_for_chromosomes(self, blast, rednum, repeat_number):
        bluenum = rednum
        blast = blast.sort_values(by=[0, 11], ascending=[True, False])
        def assign_grading(group):
            group['cumcount'] = group.groupby(1).cumcount()
            group = group[group['cumcount'] <= repeat_number]
            group['grading'] = pd.cut(
                group['cumcount'],
                bins=[-1, 0, bluenum, repeat_number],
                labels=self.grading,
                right=True
            )
            return group
        newblast = blast.groupby(['chr1', 'chr2']).apply(assign_grading).reset_index(drop=True)
        newblast['grading'] = newblast['grading'].astype(int)
        return newblast[newblast['grading'] > 0]
    
    def deal_blast_for_genomes(self, blast, rednum, repeat_number):
        # Initialize the grading column
        blast['grading'] = 0
        
        # Define the blue number as the sum of rednum and the predefined constant
        bluenum = 4 + rednum
        
        # Get the indices for each group by sorting the 11th column in descending order
        index = [group.sort_values(by=[11], ascending=[False])[:repeat_number].index.tolist()
                for name, group in blast.groupby([0])]
        
        # Split the indices into red, blue, and gray groups
        reddata = np.array([k[:rednum] for k in index], dtype=object)
        bluedata = np.array([k[rednum:bluenum] for k in index], dtype=object)
        graydata = np.array([k[bluenum:repeat_number] for k in index], dtype=object)
        
        # Concatenate the results into flat lists
        redindex = np.concatenate(reddata) if reddata.size else []
        blueindex = np.concatenate(bluedata) if bluedata.size else []
        grayindex = np.concatenate(graydata) if graydata.size else []

        # Update the grading column based on the group indices
        blast.loc[redindex, 'grading'] = self.grading[0]
        blast.loc[blueindex, 'grading'] = self.grading[1]
        blast.loc[grayindex, 'grading'] = self.grading[2]

        # Return only the rows with non-zero grading
        return blast[blast['grading'] > 0]

    def run(self):
        # Read and process lens files
        lens1 = base.newlens(self.lens1, 'order')
        lens2 = base.newlens(self.lens2, 'order')
        # Read and process gff files
        gff1 = base.newgff(self.gff1)
        gff2 = base.newgff(self.gff2)
        # Filter gff data based on lens indices
        gff1 = gff1[gff1['chr'].isin(lens1.index)]
        gff2 = gff2[gff2['chr'].isin(lens2.index)]
        # Process blast data

        blast = base.newblast(self.blast, int(self.score), float(self.evalue),gff1, gff2, self.blast_reverse)

        # Map positions and chromosome information
        blast['loc1'] = blast[0].map(gff1[self.position])
        blast['loc2'] = blast[1].map(gff2[self.position])
        blast['chr1'] = blast[0].map(gff1['chr'])
        blast['chr2'] = blast[1].map(gff2['chr'])
        # Apply blast filtering and grading
        if self.comparison.lower() == 'genomes':
            blast = self.deal_blast_for_genomes(blast, int(self.multiple), int(self.repeat_number))
        if self.comparison.lower() == 'chromosomes':
            blast = self.deal_blast_for_chromosomes(blast, int(self.multiple), int(self.repeat_number))
        print(f"The filtered homologous gene pairs are {len(blast)}.\n")
        if len(blast) < 1:
            print("Stopped!\n\nIt may be that the id1 and id2 in the BLAST file do not match with (gff1, lens1) and (gff2, lens2).")
            sys.exit(1)
        # Group blast data by 'chr1' and 'chr2'
        total = []
        for (chr1, chr2), group in blast.groupby(['chr1', 'chr2']):
            total.append([chr1, chr2, group])
        del blast, group
        gc.collect()
        # Determine chunk size for multiprocessing
        n = int(np.ceil(len(total) / float(self.process)))
        result, data = '', []
        try:
            # Initialize multiprocessing Pool
            pool = Pool(self.process)
            for i in range(0, len(total), n):
                # Apply single_pool function asynchronously
                data.append(pool.apply_async(
                    self.single_pool, args=(total[i:i + n], gff1, gff2, lens1, lens2)
                ))
            pool.close()
            pool.join()
        except:
            pool.terminate()
        for k in data:
            # Collect results from async tasks
            text = k.get()
            if text:
                result += text
        # Write final output to file
        result = re.split('\n', result)
        fout = open(self.savefile, 'w')
        num = 1
        for line in result:
            if re.match(r"# Alignment", line):
                # Replace alignment number
                s = f'# Alignment {num}:'
                fout.write(s + line.split(':')[1] + '\n')
                num += 1
                continue
            if len(line) > 0:
                fout.write(line + '\n')
        fout.close()
        sys.exit(0)

    def single_pool(self, group, gff1, gff2, lens1, lens2):
        text = ''
        for bk in group:
            chr1, chr2 = str(bk[0]), str(bk[1])
            print(f'Running {chr1} vs {chr2}')
            # Extract and sort points
            points = bk[2][['loc1', 'loc2', 'grading']].sort_values(
                by=['loc1', 'loc2'], ascending=[True, True]
            )
            # Initialize collinearity analysis
            collinearity = improvedcollinearity.collinearity(
                self.options, points)
            data = collinearity.run()
            if not data:
                continue
            # Extract gene information
            gf1 = gff1[gff1['chr'] == chr1].reset_index().set_index('order')[[1, 'strand']]
            gf2 = gff2[gff2['chr'] == chr2].reset_index().set_index('order')[[1, 'strand']]
            n = 1
            for block, evalue, score in data:
                if len(block) < self.over_gap:
                    continue
                # Map gene names and strands
                block['name1'] = block['loc1'].map(gf1[1])
                block['name2'] = block['loc2'].map(gf2[1])
                block['strand1'] = block['loc1'].map(gf1['strand'])
                block['strand2'] = block['loc2'].map(gf2['strand'])
                block['strand'] = np.where(
                    block['strand1'] == block['strand2'], '1', '-1'
                )
                # Prepare text output
                block['text'] = block.apply(
                    lambda x: f"{x['name1']} {x['loc1']} {x['name2']} {x['loc2']} {x['strand']}\n",
                    axis=1
                )
                # Determine alignment mark
                a, b = block['loc2'].head(2).values
                mark = 'plus' if a < b else 'minus'
                # Append alignment information
                text += f'# Alignment {n}: score={score} pvalue={evalue} N={len(block)} {chr1}&{chr2} {mark}\n'
                text += ''.join(block['text'].values)
                n += 1
        return text

================================================
FILE: wgdi/shared_fusion.py
================================================
import pandas as pd
import wgdi.base as base

class shared_fusion:
    def __init__(self, options):
        for k, v in options:
            setattr(self, str(k), v)
            print(f"{k} = {v}")
        
        # Handle classid and limit_length options
        self.classid = [str(k) for k in self.classid.split(',')] if hasattr(self, 'classid') else ['class1', 'class2']
        self.limit_length = int(self.limit_length) if hasattr(self, 'limit_length') else 20
        
        # Clean and split lens files
        self.lens1 = self.lens1.replace(' ', '').split(',')
        self.lens2 = self.lens2.replace(' ', '').split(',')

    def run(self):
        # Read classification files and block information
        ancestor_left = base.read_classification(self.ancestor_left)
        ancestor_top = base.read_classification(self.ancestor_top)
        bkinfo = pd.read_csv(self.blockinfo)

        # Preprocess blockinfo columns
        bkinfo['chr1'] = bkinfo['chr1'].astype(str)
        bkinfo['chr2'] = bkinfo['chr2'].astype(str)
        bkinfo['start1'] = bkinfo['start1'].astype(int)
        bkinfo['end1'] = bkinfo['end1'].astype(int)
        bkinfo['start2'] = bkinfo['start2'].astype(int)
        bkinfo['end2'] = bkinfo['end2'].astype(int)
        
        # Filter based on ancestor chromosomes
        bkinfo = bkinfo[(bkinfo['chr1'].isin(ancestor_left[0].values)) & 
                        (bkinfo['chr2'].isin(ancestor_top[0].values))]

        # Read lens files
        lens1 = pd.read_csv(self.lens1[0], sep='\t', header=None)
        lens2 = pd.read_csv(self.lens2[0], sep='\t', header=None)
        lens1[0] = lens1[0].astype(str)
        lens2[0] = lens2[0].astype(str)

        # Perform block fusion analysis
        blockinfoout = self.block_fusions(bkinfo, ancestor_left, ancestor_top)

        # Apply filters based on breakpoints and length
        blockinfoout = blockinfoout[(blockinfoout['breakpoints1'] == 1) & 
                                     (blockinfoout['breakpoints2'] == 1)]
        blockinfoout = blockinfoout[(blockinfoout['break_length1'] >= self.limit_length) & 
                                     (blockinfoout['break_length2'] >= self.limit_length)]

        # Save the filtered block info
        blockinfoout.to_csv(self.filtered_blockinfo, index=False)

        # Filter lens data based on the blockinfoout
        lens1 = lens1[lens1[0].isin(blockinfoout['chr1'].values)]
        lens2 = lens2[lens2[0].isin(blockinfoout['chr2'].values)]

        # Save filtered lens data
        lens1.to_csv(self.lens1[1], sep='\t', index=False, header=False)
        lens2.to_csv(self.lens2[1], sep='\t', index=False, header=False)

    def block_fusions(self, bkinfo, ancestor_left, ancestor_top):
        # Initialize new columns in the bkinfo dataframe
        bkinfo['breakpoints1'] = 0
        bkinfo['breakpoints2'] = 0
        bkinfo['break_length1'] = 0
        bkinfo['break_length2'] = 0

        for index, row in bkinfo.iterrows():
            # Process species 1 (chr1)
            a, b = sorted([row['start1'], row['end1']])
            d1 = ancestor_left[(ancestor_left[0] == row['chr1']) & 
                               (ancestor_left[2] >= a) & (ancestor_left[1] <= b)]
            if len(d1) > 1:
                bkinfo.loc[index, 'breakpoints1'] = 1
                breaklength_max = 0
                for _, row2 in d1.iterrows():
                    length_in = len([k for k in range(a, b) if k in range(row2[1], row2[2])])
                    length_out = (b - a) - length_in
                    breaklength_max = max(breaklength_max, min(length_in, length_out) + 1)
                bkinfo.loc[index, 'break_length1'] = breaklength_max

            # Process species 2 (chr2)
            c, d = sorted([row['start2'], row['end2']])
            d2 = ancestor_top[(ancestor_top[0] == row['chr2']) & 
                              (ancestor_top[2] >= c) & (ancestor_top[1] <= d)]
            if len(d2) > 1:
                bkinfo.loc[index, 'breakpoints2'] = 1
                breaklength_max = 0
                for _, row2 in d2.iterrows():
                    length_in = len([k for k in range(c, d) if k in range(row2[1], row2[2])])
                    length_out = (d - c) - length_in
                    breaklength_max = max(breaklength_max, min(length_in, length_out) + 1)
                bkinfo.loc[index, 'break_length2'] = breaklength_max

        return bkinfo


================================================
FILE: wgdi/trees.py
================================================
import os
import shutil
from io import StringIO

import numpy as np
import pandas as pd
from Bio import AlignIO, Seq, SeqIO, SeqRecord
import subprocess

import wgdi.base as base


class trees():
    def __init__(self, options):
        base_conf = base.config()
        self.position = 'order'
        self.alignfile = ''
        self.align_trimming = ''
        self.trimming = 'trimal'
        self.threads = '1'
        self.minimum = 4
        self.tree_software = 'iqtree'
        self.delete_detail = True
        for k, v in base_conf:
            setattr(self, str(k), v)
        for k, v in options:
            setattr(self, str(k), v)
            print(str(k), ' = ', v)
        if hasattr(self, 'codon_position'):
            self.codon_position = [
                int(k)-1 for k in self.codon_position.split(',')]
        else:
            self.codon_position = [0, 1, 2]
        self.delete_detail = base.str_to_bool(self.delete_detail)

    def grouping(self, alignment):
        data = []
        indexs = []
        if not os.path.exists(self.dir):
            os.makedirs(self.dir)
        sequence = SeqIO.to_dict(SeqIO.parse(self.sequence_file, "fasta"))
        if hasattr(self, 'cds_file'):
            seq_cds = SeqIO.to_dict(SeqIO.parse(self.cds_file, "fasta"))
        for index, row in alignment.iterrows():
            file = base.gen_md5_id(str(row.values))
            self.sequencefile = os.path.join(self.dir, file+'.fasta')
            self.alignfile = os.path.join(self.dir, file+'.aln')
            self.align_trimming = self.alignfile+'.trimming'
            self.treefile = os.path.join(self.dir, file+'.aln.treefile')
            if os.path.isfile(self.treefile) and os.path.isfile(self.alignfile):
                data.append(self.treefile)
                indexs.append(index)
                continue
            ids = []
            ids_cds = []
            for i in range(len(row)):
                if type(row[i]) == float and np.isnan(row[i]):
                    continue
                gene_sequence = sequence[row[i]]
                gene_sequence.id = str(int(i)+1)
                gene_sequence.description = ''
                ids.append(gene_sequence)
            SeqIO.write(ids, self.sequencefile, "fasta")
            self.align()
            if hasattr(self, 'cds_file'):
                self.seqcdsfile = os.path.join(self.dir, file+'.cds.fasta')
                for i in range(len(row)):
                    if type(row[i]) == float and np.isnan(row[i]):
                        continue
                    gene_cds = seq_cds[row[i]]
                    gene_cds.id = str(int(i)+1)
                    ids_cds.append(gene_cds)
                SeqIO.write(ids_cds, self.seqcdsfile, "fasta")
                self.pal2nal()
                self.codon()
            if self.trimming.upper() == 'TRIMAL':
                self.trimal()
            if self.trimming.upper() == 'DIVVIER':
                self.divvier()
            self.buildtrees()
            if os.path.isfile(self.treefile):
                data.append(self.treefile)
        return data

    def codon(self):
        if self.codon_position == [0, 1, 2]:
            shutil.move(self.alignfile+'.mrtrans', self.alignfile)
            return True
        records = list(SeqIO.parse(self.alignfile+'.mrtrans', 'fasta'))
        if len(records) == 0:
            return False
        newrecords = []
        def final_list(test_list, x, y): return [
            test_list[i+j] for i in range(0, len(test_list), x) for j in y]
        for k in records:
            if len(k.seq) % 3 > 0:
                return False
            seq = final_list(k.seq, 3, self.codon_position)
            k.seq = ''.join(seq)
            newrecords.append(SeqRecord.SeqRecord(
                Seq.Seq(k.seq), id=k.id, description=''))
        SeqIO.write(newrecords, self.alignfile, 'fasta')
        return True

    def pal2nal(self):
        args = ['perl', self.pal2nal_path, self.alignfile,
                self.seqcdsfile, '-output fasta', '>'+self.alignfile+'.mrtrans']
        command = ' '.join(args)
        try:
            os.system(command)
        except:
            return False
        return True

    def align(self):
        if self.align_software == 'mafft':
            try:
                command = [self.mafft_path,'--quiet', self.sequencefile, '>', self.alignfile]
                subprocess.run(" ".join(command), shell=True, check=True)
            except subprocess.CalledProcessError as e:
                print(f"Error while running MAFFT: {e}")

        if self.align_software == 'muscle':
            try:
                command = [self.muscle_path,'-align', self.sequencefile, '-output', self.alignfile, '-quiet']
                subprocess.run(" ".join(command), shell=True, check=True)
            except subprocess.CalledProcessError as e:
                print(f"Error while running Muscle: {e}")

    def trimal(self):
        args = [self.trimal_path, '-in', self.alignfile,
                '-out', self.align_trimming, '-automated1']
        command = ' '.join(args)
        try:
            os.system(command)
        except:
            return False
        return True

    def divvier(self):
        args = [self.divvier_path, '-mincol', '4', '-divvygap', self.alignfile]
        command = ' '.join(args)
        try:
            os.system(command)
            os.rename(self.alignfile+'.divvy.fas', self.align_trimming)
        except:
            return False
        return True

    def buildtrees(self):
        try:
            if self.tree_software.upper() == 'IQTREE':
                args = [self.iqtree_path, '-s', self.align_trimming,
                        '-m', self.model, '-T', self.threads, '--quiet']
                command = ' '.join(args)
                os.system(command)
                os.rename(self.align_trimming+'.treefile', self.treefile)
            elif self.tree_software.upper() == 'FASTTREE':
                args = [self.fasttree_path,
                        self.align_trimming, '>', self.treefile]
                command = ' '.join(args)
                os.system(command)
        except:
            return False
        if self.delete_detail == True:
            for file in (self.sequencefile, self.align_trimming+'.bionj', self.align_trimming+'.iqtree', self.align_trimming+'.ckp.gz',
                         self.align_trimming+'.log', self.align_trimming+'.mldist', self.align_trimming+'.model.gz'):
                try:
                    os.remove(file)
                except OSError:
                    pass
        return True

    def run(self):
        alignment = pd.read_csv(self.alignment, header=None)
        alignment.replace('.', np.nan, inplace=True)
        alignment.dropna(thresh=int(self.minimum), inplace=True)
        if hasattr(self, 'gff') and hasattr(self, 'lens'):
            gff = base.newgff(self.gff)
            lens = base.newlens(self.lens, self.position)
            alignment = pd.merge(
                alignment, gff[['chr', self.position]], left_on=0, right_on=gff.index, how='left')
            alignment.dropna(subset=['chr', 'order'], inplace=True)
            alignment['order'] = alignment['order'].astype(int)
            alignment = alignment[alignment['chr'].isin(lens.index)]
            alignment.drop(alignment.columns[-2:], axis=1, inplace=True)
        data = self.grouping(alignment)
        fout = open(self.trees_file, 'w')
        fout.close()
        for i in range(0, len(data), 100):
            trees = ' '.join([str(k) for k in data[i:i+100]])
            args = ['cat', trees, '>>', self.trees_file]
            command = ' '.join([str(k) for k in args])
            os.system(command)
        df = pd.read_csv(self.trees_file, header=None, sep='\t')
        df[0].to_csv(self.trees_file, index=None, sep='\t', header=False)
        print("done")

================================================
FILE: wgdi.egg-info/PKG-INFO
================================================
Metadata-Version: 2.1
Name: wgdi
Version: 0.75
Summary: A user-friendly toolkit for evolutionary analyses of whole-genome duplications and ancestral karyotypes
Home-page: https://github.com/SunPengChuan/wgdi
Author: Pengchuan Sun
Author-email: sunpengchuan@gmail.com
License: BSD License
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: BSD License
Classifier: Operating System :: OS Independent
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=1.1.0
Requires-Dist: numpy
Requires-Dist: biopython
Requires-Dist: matplotlib
Requires-Dist: scipy
Requires-Dist: tabulate

# WGDI

![Latest PyPI version](https://img.shields.io/pypi/v/wgdi.svg) [![Downloads](https://pepy.tech/badge/wgdi/month)](https://pepy.tech/project/wgdi) [![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/wgdi/README.html)

| | |
| --- | --- |
| Author  | Pengchuan Sun ([sunpengchuan](https//github.com/sunpengchuan)) |
| Email   | <sunpengchuan@gmail.com> |
| License | [BSD](http://creativecommons.org/licenses/BSD/) |

## Description

**WGDI (Whole-Genome Duplication Integrated analysis)** is a Python-based command-line tool designed to simplify the analysis of whole-genome duplications (WGD) and cross-species genome alignments. It offers three main workflows that enhance the detection and study of WGD events:

## Key Features

### 1. Polyploid Inference
- Identifies and confirms polyploid events with high accuracy.

### 2. Genomic Homology Inference
- Traces the evolutionary history of duplicated regions across species, with a focus on distinguishing subgenomes. 

### 3. Ancestral Karyotyping
- Reconstructs protochromosomes and traces common chromosomal rearrangements to understand chromosome evolution. 


## Installation

Python package and command line interface (IDLE) for the analysis of whole genome duplications (WGDI). WGDI can be deployed in Windows, Linux, and Mac OS operating systems and can be installed via pip and conda.

#### Bioconda

```
conda install -c bioconda  wgdi
```

#### Pypi

```
pip3 install wgdi
```

Documentation for installation along with a user tutorial, a default parameter file, and test data are provided. please consult the docs at <http://wgdi.readthedocs.io/en/latest/>.

## Tips

Here are some videos with simple examples of WGDI.

###### [WGDI的简单使用（一）](https://www.bilibili.com/video/BV1qK4y1U7eK) or https://youtu.be/k-S6FVcBIQw

###### [WGDI的简单使用（二）](https://www.bilibili.com/video/BV195411P7L1) or https://youtu.be/QiZYFYGclyE

chatting group QQ : 966612552

## Citating WGDI

If you use wgdi in your work, please cite:

> Sun P., Jiao B., Yang Y., Shan L., Li T., Li X., Xi Z., Wang X., and Liu J. (2022). WGDI: A user-friendly toolkit for evolutionary analyses of whole-genome duplications and ancestral karyotypes. Mol. Plant. doi: https://doi.org/10.1016/j.molp.2022.10.018.

## News

## 0.75
* Fixed some issues (-fpd).
* Introduced a threads parameter for the iqtree command within alignmenttrees (-at).

## 0.74
* Improved the the fusion positions dataset (-fpd).
* Fixed some issues (-pc).

## 0.7.1
* Added extract the fusion positions dataset (-fpd).
* Added determine whether these fusion events occur in other genomes (-fd).
* Improved the karyotype_mapping (-km) effect.
* Fixed the problem caused by the Python version, now it is compatible with version 3.12.


## 0.6.5
* Fixed some issues (-sf).
* Added new tips to avoid some errors.

## 0.6.4
* Fixed the problem caused by the Python version, now it is compatible with version 3.11.3.

## 0.6.3
* Fixed some issues (-ks, -sf).

## 0.6.2
* Added find shared fusions between species (-sf).

## 0.6.1

* Fixed issue with alignment (-a). Only version 0.6.0 has this bug.

## 0.6.0

* Fixed issue with improved collinearity (-icl).
* Added a parameter 'tandem_ratio' to blockinfo (-bi).

## 0.5.9

* Update the improved collinearity (-icl). Faster than before, but lower than MCscanX, JCVI.
* Fixed issue with ancestral karyotype repertoire (-akr).

## 0.5.8

* Fixed issue with gene names (-ks).

## 0.5.7
- Fixed issue with chromosome order (-ak).
- Fixed issue with gene names (-ks).  This version is not fixed, please install the latest version.

## 0.5.5 and 0.5.6
* Add ancestral karyotype (-ak)
* Add ancestral karyotype repertoire (-akr)

## 0.5.4
* Improved the karyotype_mapping (-km) effect.
* little change (-at).

## 0.5.3
* Fixed legend issue with (-kf).
* Fixed calculate Ks issue with (-ks).
* Improved the karyotype_mapping (-km) effect.
* Improved the alignmenttrees (-at) effect.

## 0.5.2
* Fixed some bugs.

## 0.5.1
* Fixed the error of the command (-conf).
* Improved the karyotype_mapping (-km) effect.
* Added the available data set of alignmenttree (-at). Low copy data set (for example, single-copy_groups.tsv of sonicparanoid2 software).

## 0.4.9
* The latest version adds karyotype_mapping (-km) and karyotype (-k) display.
* The latest version changes the calculation of extracting pvalue from collinearity (-icl), making this parameter more sensitive. Therefore, it is recommended to set to 0.2 instead of 0.05.
* The latest version has also changed the drawing display of ksfigure (-kf) to make it more beautiful.


================================================
FILE: wgdi.egg-info/SOURCES.txt
================================================
LICENSE
README.md
setup.py
wgdi/__init__.py
wgdi/align_dotplot.py
wgdi/ancestral_karyotype.py
wgdi/ancestral_karyotype_repertoire.py
wgdi/base.py
wgdi/block_correspondence.py
wgdi/block_info.py
wgdi/block_ks.py
wgdi/circos.py
wgdi/collinearity.py
wgdi/dotplot.py
wgdi/fusion_positions_database.py
wgdi/fusions_detection.py
wgdi/karyotype.py
wgdi/karyotype_mapping.py
wgdi/ks.py
wgdi/ks_peaks.py
wgdi/ksfigure.py
wgdi/peaksfit.py
wgdi/pindex.py
wgdi/polyploidy_classification.py
wgdi/retain.py
wgdi/run.py
wgdi/run_colliearity.py
wgdi/shared_fusion.py
wgdi/trees.py
wgdi.egg-info/PKG-INFO
wgdi.egg-info/SOURCES.txt
wgdi.egg-info/dependency_links.txt
wgdi.egg-info/entry_points.txt
wgdi.egg-info/requires.txt
wgdi.egg-info/top_level.txt
wgdi.egg-info/zip-safe
wgdi/example/__init__.py
wgdi/example/align.conf
wgdi/example/alignmenttrees.conf
wgdi/example/ancestral_karyotype.conf
wgdi/example/ancestral_karyotype_repertoire.conf
wgdi/example/blockinfo.conf
wgdi/example/blockks.conf
wgdi/example/circos.conf
wgdi/example/collinearity.conf
wgdi/example/conf.ini
wgdi/example/corr.conf
wgdi/example/dotplot.conf
wgdi/example/fusion_positions_database.conf
wgdi/example/fusions_detection.conf
wgdi/example/karyotype.conf
wgdi/example/karyotype_mapping.conf
wgdi/example/ks.conf
wgdi/example/ks_fit_result.csv
wgdi/example/ksfigure.conf
wgdi/example/kspeaks.conf
wgdi/example/peaksfit.conf
wgdi/example/pindex.conf
wgdi/example/polyploidy_classification.conf
wgdi/example/retain.conf
wgdi/example/shared_fusion.conf

================================================
FILE: wgdi.egg-info/dependency_links.txt
================================================


================================================
FILE: wgdi.egg-info/entry_points.txt
================================================
[console_scripts]
wgdi = wgdi.run:main


================================================
FILE: wgdi.egg-info/requires.txt
================================================
pandas>=1.1.0
numpy
biopython
matplotlib
scipy
tabulate


================================================
FILE: wgdi.egg-info/top_level.txt
================================================
wgdi


================================================
FILE: wgdi.egg-info/zip-safe
================================================