[
  {
    "path": "LICENSE",
    "content": "Copyright (c) 2018-2018, Pengchuan Sun\n\nAll rights reserved.\n\nRedistribution and use in source and binary forms, with or without modification,\nare permitted provided that the following conditions are met:\n\nRedistributions of source code must retain the above copyright notice, this list\nof conditions and the following disclaimer.\n\nRedistributions in binary form must reproduce the above copyright notice, this\nlist of conditions and the following disclaimer in the documentation and/or\nother materials provided with the distribution.\n\nTHIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS \"AS IS\" AND\nANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED\nWARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE\nDISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR\nANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES\n(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;\nLOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON\nANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT\n(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS\nSOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE."
  },
  {
    "path": "README.md",
    "content": "# WGDI\n\n![Latest PyPI version](https://img.shields.io/pypi/v/wgdi.svg) [![Downloads](https://pepy.tech/badge/wgdi/month)](https://pepy.tech/project/wgdi) [![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/wgdi/README.html)\n\n| | |\n| --- | --- |\n| Author  | Pengchuan Sun ([sunpengchuan](https//github.com/sunpengchuan)) |\n| Email   | <sunpengchuan@gmail.com> |\n| License | [BSD](http://creativecommons.org/licenses/BSD/) |\n\n## Description\n\n**WGDI (Whole-Genome Duplication Integrated analysis)** is a Python-based command-line tool designed to simplify the analysis of whole-genome duplications (WGD) and cross-species genome alignments. It offers three main workflows that enhance the detection and study of WGD events:\n\n## Key Features\n\n### 1. Polyploid Inference\n- Identifies and confirms polyploid events with high accuracy.\n\n### 2. Genomic Homology Inference\n- Traces the evolutionary history of duplicated regions across species, with a focus on distinguishing subgenomes. \n\n### 3. Ancestral Karyotyping\n- Reconstructs protochromosomes and traces common chromosomal rearrangements to understand chromosome evolution. \n\n\n## Installation\n\nPython package and command line interface (IDLE) for the analysis of whole genome duplications (WGDI). WGDI can be deployed in Windows, Linux, and Mac OS operating systems and can be installed via pip and conda.\n\n#### Bioconda\n\n```\nconda install -c bioconda  wgdi\n```\n\n#### Pypi\n\n```\npip3 install wgdi\n```\n\nDocumentation for installation along with a user tutorial, a default parameter file, and test data are provided. please consult the docs at <http://wgdi.readthedocs.io/en/latest/>.\n\n## Tips\n\nHere are some videos with simple examples of WGDI.\n\n###### [WGDI的简单使用（一）](https://www.bilibili.com/video/BV1qK4y1U7eK) or https://youtu.be/k-S6FVcBIQw\n\n###### [WGDI的简单使用（二）](https://www.bilibili.com/video/BV195411P7L1) or https://youtu.be/QiZYFYGclyE\n\nchatting group QQ : 966612552\n\n## Citating WGDI\n\nIf you use wgdi in your work, please cite:\n\n> Sun P., Jiao B., Yang Y., Shan L., Li T., Li X., Xi Z., Wang X., and Liu J. (2022). WGDI: A user-friendly toolkit for evolutionary analyses of whole-genome duplications and ancestral karyotypes. Mol. Plant. doi: https://doi.org/10.1016/j.molp.2022.10.018.\n\n## News\n\n## 0.75\n* Fixed some issues (-fpd, -km).\n* Introduced a threads parameter for the iqtree command within alignmenttrees (-at).\n\n## 0.74\n* Improved the the fusion positions dataset (-fpd).\n* Fixed some issues (-pc).\n\n## 0.7.1\n* Added extract the fusion positions dataset (-fpd).\n* Added determine whether these fusion events occur in other genomes (-fd).\n* Improved the karyotype_mapping (-km) effect.\n* Fixed the problem caused by the Python version, now it is compatible with version 3.12.\n\n\n## 0.6.5\n* Fixed some issues (-sf).\n* Added new tips to avoid some errors.\n\n## 0.6.4\n* Fixed the problem caused by the Python version, now it is compatible with version 3.11.3.\n\n## 0.6.3\n* Fixed some issues (-ks, -sf).\n\n## 0.6.2\n* Added find shared fusions between species (-sf).\n\n## 0.6.1\n\n* Fixed issue with alignment (-a). Only version 0.6.0 has this bug.\n\n## 0.6.0\n\n* Fixed issue with improved collinearity (-icl).\n* Added a parameter 'tandem_ratio' to blockinfo (-bi).\n\n## 0.5.9\n\n* Update the improved collinearity (-icl). Faster than before, but lower than MCscanX, JCVI.\n* Fixed issue with ancestral karyotype repertoire (-akr).\n\n## 0.5.8\n\n* Fixed issue with gene names (-ks).\n\n## 0.5.7\n- Fixed issue with chromosome order (-ak).\n- Fixed issue with gene names (-ks).  This version is not fixed, please install the latest version.\n\n## 0.5.5 and 0.5.6\n* Add ancestral karyotype (-ak)\n* Add ancestral karyotype repertoire (-akr)\n\n## 0.5.4\n* Improved the karyotype_mapping (-km) effect.\n* little change (-at).\n\n## 0.5.3\n* Fixed legend issue with (-kf).\n* Fixed calculate Ks issue with (-ks).\n* Improved the karyotype_mapping (-km) effect.\n* Improved the alignmenttrees (-at) effect.\n\n## 0.5.2\n* Fixed some bugs.\n\n## 0.5.1\n* Fixed the error of the command (-conf).\n* Improved the karyotype_mapping (-km) effect.\n* Added the available data set of alignmenttree (-at). Low copy data set (for example, single-copy_groups.tsv of sonicparanoid2 software).\n\n## 0.4.9\n* The latest version adds karyotype_mapping (-km) and karyotype (-k) display.\n* The latest version changes the calculation of extracting pvalue from collinearity (-icl), making this parameter more sensitive. Therefore, it is recommended to set to 0.2 instead of 0.05.\n* The latest version has also changed the drawing display of ksfigure (-kf) to make it more beautiful.\n"
  },
  {
    "path": "__init__.py",
    "content": ""
  },
  {
    "path": "build/lib/wgdi/__init__.py",
    "content": ""
  },
  {
    "path": "build/lib/wgdi/align_dotplot.py",
    "content": "import re\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\nimport wgdi.base as base\n\nclass align_dotplot:\n    def __init__(self, options):\n        # Default values\n        self.position = 'order'\n        self.figsize = 'default'\n        self.classid = 'class1'\n\n        # Initialize from options\n        for k, v in options:\n            setattr(self, str(k), v)\n            print(f'{k} = {v}')\n        \n        self.ks_area = [float(k) for k in getattr(self, 'ks_area', '-1,3').split(',')]\n        self.colors = [str(k) for k in getattr(self, 'colors', 'red,blue,green,black,orange').split(',')]\n        self.ancestor_top = None if getattr(self, 'ancestor_top', 'none') == 'none' else self.ancestor_top\n        self.ancestor_left = None if getattr(self, 'ancestor_left', 'none') == 'none' else self.ancestor_left\n\n        self.blockinfo_reverse = base.str_to_bool(self.blockinfo_reverse)\n\n    def pair_position(self, alignment, loc1, loc2, colors):\n        alignment.index = alignment.index.map(loc1)\n        data = []\n        for i, k in enumerate(alignment.columns):\n            df = alignment[k].map(loc2).dropna()\n            for idx, row in df.items():\n                data.append([idx, row, colors[i]])\n        return pd.DataFrame(data, columns=['loc1', 'loc2', 'color'])\n\n    def run(self):\n        axis = [0, 1, 1, 0]\n\n        # Lens generation and figure size\n        lens1 = base.newlens(self.lens1, self.position)\n        lens2 = base.newlens(self.lens2, self.position)\n        \n        if re.search(r'\\d', self.figsize):\n            self.figsize = [float(k) for k in self.figsize.split(',')]\n        else:\n            self.figsize = np.array([1, float(lens1.sum()) / float(lens2.sum())]) * 10\n            \n        plt.rcParams['ytick.major.pad'] = 0\n\n        # Create plot\n        fig, ax = plt.subplots(figsize=self.figsize)\n        ax.xaxis.set_ticks_position('top')\n        step1, step2 = 1 / float(lens1.sum()), 1 / float(lens2.sum())\n\n        # Process Ancestor Data\n        if self.ancestor_left:\n            axis[0] = -0.02\n            lens_ancestor_left = self.process_ancestor(self.ancestor_left, lens1.index)\n\n        if self.ancestor_top:\n            axis[3] = -0.02\n            lens_ancestor_top = self.process_ancestor(self.ancestor_top, lens2.index)\n\n        base.dotplot_frame(fig, ax, lens1, lens2, step1, step2, \n                           self.genome1_name, self.genome2_name, [0, 1])\n\n        # Process GFF files\n        gff1, gff2 = base.newgff(self.gff1), base.newgff(self.gff2)\n        gff1 = base.gene_location(gff1, lens1, step1, self.position)\n        gff2 = base.gene_location(gff2, lens2, step2, self.position)\n\n        if self.ancestor_top:\n            self.ancestor_position(ax, gff2, lens_ancestor_top, 'top')\n\n        if self.ancestor_left:\n            self.ancestor_position(ax, gff1, lens_ancestor_left, 'left')\n\n        # Process block info and alignment\n        bkinfo = self.process_blockinfo(lens1,lens2)\n        align = self.alignment(gff1, gff2, bkinfo)\n        alignment = align[gff1.columns[-len(bkinfo[self.classid].drop_duplicates()):]]\n        alignment.to_csv(self.savefile, header=False)\n\n        # Create scatter plot\n        df = self.pair_position(alignment, gff1['loc'], gff2['loc'], self.colors)\n        plt.scatter(df['loc2'], df['loc1'], s=float(self.markersize), c=df['color'], \n                    alpha=0.5, edgecolors=None, linewidths=0, marker='o')\n\n        ax.axis(axis)\n        plt.subplots_adjust(left=0.07, right=0.97, top=0.93, bottom=0.03)\n        plt.savefig(self.savefig, dpi=500)\n        plt.show()\n\n    def process_ancestor(self, ancestor_file, lens_index):\n        df = pd.read_csv(ancestor_file, sep=\"\\t\", header=None)\n        df[0] = df[0].astype(str)\n        df[3] = df[3].astype(str)\n        df[4] = df[4].astype(int)\n        df[4] = df[4] / df[4].max()\n        return df[df[0].isin(lens_index)]\n\n    def process_blockinfo(self, lens1, lens2):\n        bkinfo = pd.read_csv(self.blockinfo, index_col='id')\n        if self.blockinfo_reverse ==  True:\n            bkinfo[['chr1', 'chr2']] = bkinfo[['chr2', 'chr1']]\n            bkinfo[['block1', 'block2']] = bkinfo[['block2', 'block1']]\n        bkinfo['chr1'] = bkinfo['chr1'].astype(str)\n        bkinfo['chr2'] = bkinfo['chr2'].astype(str)\n        bkinfo[self.classid] = bkinfo[self.classid].astype(str)\n        return bkinfo[bkinfo['chr1'].isin(lens1.index) & (bkinfo['chr2'].isin(lens2.index))]\n\n    def alignment(self, gff1, gff2, bkinfo):\n        gff1['uid'] = gff1['chr'] + 'g' + gff1['order'].astype(str)\n        gff2['uid'] = gff2['chr'] + 'g' + gff2['order'].astype(str)\n        gff1['id'] = gff1.index\n        gff2['id'] = gff2.index\n        \n        for cl, group in bkinfo.groupby(self.classid):\n            name = f'l{cl}'\n            gff1[name] = ''\n            group = group.sort_values(by=['length'], ascending=True)\n\n            for _, row in group.iterrows():\n                block = self.create_block_dataframe(row)\n                if block.empty:\n                    continue\n                block1_min, block1_max = block['block1'].agg(['min', 'max'])\n                area = gff1[(gff1['chr'] == row['chr1']) & \n                            (gff1['order'] >= block1_min) & \n                            (gff1['order'] <= block1_max)].index\n                \n                block['id1'] = (row['chr1'] + 'g' + block['block1'].astype(str)).map(\n                    dict(zip(gff1['uid'], gff1.index)))\n                block['id2'] = (row['chr2'] + 'g' + block['block2'].astype(str)).map(\n                    dict(zip(gff2['uid'], gff2.index)))\n\n                gff1.loc[block['id1'].values, name] = block['id2'].values\n                gff1.loc[gff1.index.isin(area) & gff1[name].eq(''), name] = '.'\n        return gff1\n\n    def create_block_dataframe(self, row):\n        b1, b2, ks = row['block1'].split('_'), row['block2'].split('_'), row['ks'].split('_')\n        ks = list(map(float, ks[1:])) if ks[0] == '' else list(map(float, ks))\n        block = pd.DataFrame(np.array([b1, b2, ks]).T, columns=['block1', 'block2', 'ks'])\n        block['block1'] = block['block1'].astype(int)\n        block['block2'] = block['block2'].astype(int)\n        block['ks'] = block['ks'].astype(float)\n        return block[(block['ks'] <= self.ks_area[1]) & \n                     (block['ks'] >= self.ks_area[0])].drop_duplicates(subset=['block1'], keep='first')\n\n    def ancestor_position(self, ax, gff, lens, mark):\n        for _, row in lens.iterrows():\n            loc1 = gff[(gff['chr'] == row[0]) & (gff['order'] == int(row[1]))].index\n            loc2 = gff[(gff['chr'] == row[0]) & (gff['order'] == int(row[2]))].index\n            loc1, loc2 = gff.loc[[loc1[0], loc2[0]], 'loc']\n            if mark == 'top':\n                width = abs(loc1-loc2)\n                loc = [min(loc1, loc2), 0]\n                height = -0.02\n            if mark == 'left':\n                height = abs(loc1-loc2)\n                loc = [-0.02, min(loc1, loc2), ]\n                width = 0.02\n            base.Rectangle(ax, loc, height, width, row[3], row[4])"
  },
  {
    "path": "build/lib/wgdi/ancestral_karyotype.py",
    "content": "import pandas as pd\nfrom Bio import SeqIO\nimport wgdi.base as base\n\n\nclass ancestral_karyotype:\n    def __init__(self, options):\n        self.mark = 'aak'\n        \n        # Set attributes from options\n        for k, v in options:\n            setattr(self, str(k), v)\n            print(f\"{k} = {v}\")\n\n    def run(self):\n        # Load and filter data\n        gff = base.newgff(self.gff)\n        ancestor = base.read_classification(self.ancestor)\n        gff = gff[gff['chr'].isin(ancestor[0].values.tolist())]\n\n        # Create new gff copy and initialize required variables\n        newgff = gff.copy()\n        data, num = [], 1\n\n        # Create dictionary mapping chromosome to order\n        chr_arr = ancestor[3].drop_duplicates().to_list()\n        chr_dict = {chr: idx + 1 for idx, chr in enumerate(chr_arr)}\n        ancestor['order'] = ancestor[3].map(chr_dict)\n\n        dict1, dict2 = {}, {}\n\n        # Process ancestor and gff information\n        for (cla, order), group in ancestor.groupby([4, 'order'], sort=[False, False]):\n            for index, row in group.iterrows():\n                index1 = gff[(gff['chr'] == row[0]) & (gff['order'] >= row[1]) & (gff['order'] <= row[2])].index\n                newgff.loc[index1, 'chr'] = str(num)\n                \n                # Store results in data\n                for k in index1:\n                    data.append(newgff.loc[k, :].values.tolist() + [k])\n\n            dict1[str(num)] = cla\n            dict2[str(num)] = group[3].values[0]\n            num += 1\n\n        # Create dataframe from the data collected\n        df = pd.DataFrame(data)\n\n        # Filter based on peptide file\n        pep = SeqIO.to_dict(SeqIO.parse(self.pep_file, \"fasta\"))\n        df = df[df[6].isin(pep.keys())]\n\n        # Assign new names and order\n        for name, group in df.groupby(0):\n            df.loc[group.index, 'order'] = range(1, len(group) + 1)\n            df.loc[group.index, 'newname'] = [f\"{self.mark}{name}g{i:05d}\" for i in range(1, len(group) + 1)]\n\n        # Set data types and sort\n        df['order'] = df['order'].astype(int)\n        df = df[[0, 'newname', 1, 2, 3, 'order', 6]].sort_values(by=[0, 'order'])\n\n        # Save output files\n        df.to_csv(self.ancestor_gff, sep=\"\\t\", index=False, header=None)\n        lens = df.groupby(0).max()[[2, 'order']]\n        lens.to_csv(self.ancestor_lens, sep=\"\\t\", header=None)\n\n        # Add extra columns and save final results\n        lens[1] = 1\n        lens['color'] = lens.index.map(dict2)\n        lens['class'] = lens.index.map(dict1)\n        lens[[1, 'order', 'color', 'class']].to_csv(self.ancestor_file, sep=\"\\t\", header=None)\n\n        # Update peptide sequences with new IDs and save\n        id_dict = df.set_index(6).to_dict()['newname']\n        seqs = []\n\n        for seq_record in SeqIO.parse(self.pep_file, \"fasta\"):\n            if seq_record.id in id_dict:\n                seq_record.id = id_dict[seq_record.id]\n                seqs.append(seq_record)\n\n        SeqIO.write(seqs, self.ancestor_pep, \"fasta\")\n"
  },
  {
    "path": "build/lib/wgdi/ancestral_karyotype_repertoire.py",
    "content": "\nimport numpy as np\nimport pandas as pd\nfrom Bio import SeqIO\n\nimport wgdi.base as base\n\nclass ancestral_karyotype_repertoire():\n    def __init__(self, options):\n        self.gap = 5\n        self.direction = 0.01\n        self.mark = 'aak1s'\n        self.blockinfo_reverse = False\n        for k, v in options:\n            setattr(self, str(k), v)\n            print(k, ' = ', v)\n        self.blockinfo_reverse =  base.str_to_bool(self.blockinfo_reverse)\n\n    def run(self):\n        gff1 = base.newgff(self.gff1)\n        gff2 = base.newgff(self.gff2)\n        bkinfo = pd.read_csv(self.blockinfo, index_col='id')\n        if self.blockinfo_reverse == True:\n            bkinfo[['chr1', 'chr2']] = bkinfo[['chr2', 'chr1']]\n            bkinfo[['block1', 'block2']] = bkinfo[['block2', 'block1']]\n        for index, row in bkinfo.iterrows():\n            block1, block2 = row['block1'].split('_'), row['block2'].split('_')\n            block1, block2 = [int(k) for k in block1], [int(k) for k in block2]\n            if int(block1[1])-int(block1[0]) < 0:\n                self.direction = -0.01\n            for i in range(1, len(block2)):\n                if abs(block1[i]-block1[i-1]) == 1 and abs(block2[i]-block2[i-1]) < int(self.gap):\n                    gff1_id = gff1[(gff1['chr'] == str(row['chr1'])) & (\n                        gff1['order'] == block1[i])].index[0]\n                    order = gff1.loc[gff1_id, 'order']\n                    gff1_row = gff1.loc[gff1_id, :].copy()\n                    for num in range(block2[i-1], block2[i]):\n                        order = order + self.direction\n                        id = gff2[(gff2['chr'] == str(row['chr2']))\n                                  & (gff2['order'] == num)].index[0]\n                        gff1_row['order'] = order\n                        gff1.loc[id, :] = gff1_row\n        df = gff1.copy()\n        df = df.sort_values(by=['chr', 'order'])\n        for name, group in df.groupby(['chr']):\n            df.loc[group.index, 'order'] = list(range(1, len(group)+1))\n            df.loc[group.index, 'newname'] = list(\n                [str(self.mark)+str(name)+'g'+str(i).zfill(5) for i in range(1, len(group)+1)])\n        df['order'] = df['order'].astype(int)\n        df['oldname'] = df.index\n        columns = ['chr', 'newname', 'start',\n                   'end', 'strand', 'order', 'oldname']\n        df[columns].to_csv(self.ancestor_gff, sep=\"\\t\",\n                           index=False, header=None)\n        lens = df.groupby('chr').max()[['end', 'order']]\n        lens['end'] = lens['end'].astype(np.int64)\n        lens.to_csv(self.ancestor_lens, sep=\"\\t\", header=None)\n        ancestor = base.read_classification(self.ancestor)\n        for index, row in ancestor.iterrows():\n            ancestor.at[index, 1] = 1\n            ancestor.at[index, 2] = lens.at[str(row[0]),'order']\n        ancestor.to_csv(self.ancestor_new, sep=\"\\t\", index=False, header=None)\n        id_dict = df['newname'].to_dict()\n        seqs = []\n        for seq_record in SeqIO.parse(self.ancestor_pep, \"fasta\"):\n            if seq_record.id in id_dict:\n                seq_record.id = id_dict[seq_record.id]\n            else:\n                continue\n            seq_record.description = ''\n            seqs.append(seq_record)\n        SeqIO.write(seqs, self.ancestor_pep_new, \"fasta\")\n"
  },
  {
    "path": "build/lib/wgdi/base.py",
    "content": "import configparser\nimport hashlib\nimport os\nimport re\n\nimport matplotlib\nimport matplotlib.patches as mpatches\nimport numpy as np\nimport pandas as pd\nfrom Bio import SeqIO\n\nimport wgdi\n\n\ndef gen_md5_id(item):\n    \"\"\"Generate MD5 hash for the given item.\"\"\"\n    return hashlib.md5(item.encode('utf-8')).hexdigest()\n\n\ndef config():\n    \"\"\"Read configuration from the example conf.ini file.\"\"\"\n    conf = configparser.ConfigParser()\n    conf.read(os.path.join(wgdi.__path__[0], 'example/conf.ini'))\n    return conf.items('ini')\n\n\ndef load_conf(file, section):\n    \"\"\"Load configuration items from the specified section.\"\"\"\n    conf = configparser.ConfigParser()\n    conf.read(file)\n    return conf.items(section)\n\n\ndef rewrite(file, section):\n    \"\"\"Rewrite the configuration file to keep only the specified section.\"\"\"\n    conf = configparser.ConfigParser()\n    conf.read(file)\n    if conf.has_section(section):\n        for k in conf.sections():\n            if k != section:\n                conf.remove_section(k)\n        conf.write(open(os.path.join(wgdi.__path__[0], 'example/conf.ini'), 'w'))\n        print('Option ini has been modified')\n    else:\n        print('Option ini no change')\n\n\ndef read_colinearscan(file):\n    \"\"\"Read colinearscan output and parse into data structure.\"\"\"\n    data, b, flag, num = [], [], 0, 1\n    with open(file) as f:\n        for line in f:\n            line = line.strip()\n            if re.match(r\"the\", line):\n                num = re.search(r'\\d+', line).group()\n                b = []\n                flag = 1\n                continue\n            if re.match(r\"\\>LOCALE\", line):\n                flag = 0\n                p = re.split(':', line)\n                if b:\n                    data.append([num, b, p[1]])\n                b = []\n                continue\n            if flag == 1:\n                a = re.split(r\"\\s\", line)\n                b.append(a)\n    if b:\n        data.append([num, b, p[1]])\n    return data\n\n\ndef read_mcscanx(fn):\n    \"\"\"Read mcscanx output and parse into data structure.\"\"\"\n    with open(fn) as f1:\n        data, b = [], []\n        flag, num = 0, 0\n        for line in f1:\n            line = line.strip()\n            if re.match(r\"## Alignment\", line):\n                flag = 1\n                if not b:\n                    arr = re.findall(r\"[\\d+\\.]+\", line)[0]\n                    continue\n                data.append([num, b, 0])\n                b = []\n                num = re.findall(r\"\\d+\", line)[0]\n                continue\n            if flag == 0:\n                continue\n            a = re.split(r\"\\:\", line)\n            c = re.split(r\"\\s+\", a[1])\n            b.append([c[1], c[1], c[2], c[2]])\n        if b:\n            data.append([num, b, 0])\n    return data\n\n\ndef read_jcvi(fn):\n    \"\"\"Read jcvi output and parse into data structure.\"\"\"\n    with open(fn) as f1:\n        data, b = [], []\n        num = 1\n        for line in f1:\n            line = line.strip()\n            if re.match(r\"###\", line):\n                if b:\n                    data.append([num, b, 0])\n                    b = []\n                num += 1\n                continue\n            a = re.split(r\"\\t\", line)\n            b.append([a[0], a[0], a[1], a[1]])\n        if b:\n            data.append([num, b, 0])\n    return data\n\n\ndef read_collinearity(fn):\n    \"\"\"Read collinearity output and parse into data structure.\"\"\"\n    with open(fn) as f1:\n        data, b = [], []\n        flag, arr = 0, []\n        for line in f1:\n            line = line.strip()\n            if re.match(r\"# Alignment\", line):\n                flag = 1\n                if not b:\n                    arr = re.findall(r'[\\.\\d+]+', line)\n                    continue\n                data.append([arr[0], b, arr[2]])\n                b = []\n                arr = re.findall(r'[\\.\\d+]+', line)\n                continue\n            if flag == 0:\n                continue\n            b.append(re.split(r\"\\s\", line))\n        if b:\n            data.append([arr[0], b, arr[2]])\n    return data\n\n\ndef read_ks(file, col):\n    \"\"\"Read KS values from file and select specified column.\"\"\"\n    ks = pd.read_csv(file, sep='\\t')\n    ks.drop_duplicates(subset=['id1', 'id2'], keep='first', inplace=True)\n    ks[col] = ks[col].astype(float)\n    ks = ks[ks[col] >= 0]\n    ks.index = ks['id1'] + ',' + ks['id2']\n    return ks[col]\n\n\ndef get_median(data):\n    \"\"\"Calculate the median of the data list.\"\"\"\n    if not data:\n        return 0\n    data_sorted = sorted(data)\n    half = len(data_sorted) // 2\n    return (data_sorted[half] + data_sorted[-(half + 1)]) / 2\n\n\ndef cds_to_pep(cds_file, pep_file, fmt='fasta'):\n    \"\"\"Translate CDS sequences to peptide sequences and write to file.\"\"\"\n    records = list(SeqIO.parse(cds_file, fmt))\n    for rec in records:\n        rec.seq = rec.seq.translate()\n    SeqIO.write(records, pep_file, 'fasta')\n    return True\n\n\ndef newblast(file, score, evalue, gene_loc1, gene_loc2, reverse):\n    \"\"\"Filter BLAST results based on score, evalue, and gene locations.\"\"\"\n    blast = pd.read_csv(file, sep=\"\\t\", header=None)\n    \n    if reverse == 'true':\n        blast[[0, 1]] = blast[[1, 0]]\n    blast = blast[(blast[11] >= score) & (blast[10] < evalue) & (blast[1] != blast[0])]\n    blast = blast[(blast[0].isin(gene_loc1.index)) & (blast[1].isin(gene_loc2.index))]\n    blast.drop_duplicates(subset=[0, 1], keep='first', inplace=True)\n    blast[0] = blast[0].astype(str)\n    blast[1] = blast[1].astype(str)\n    return blast\n\n\ndef newgff(file):\n    \"\"\"Read GFF file and rename columns with appropriate data types.\"\"\"\n    gff = pd.read_csv(file, sep=\"\\t\", header=None, index_col=1)\n    gff.rename(columns={0: 'chr', 2: 'start', 3: 'end', 4: 'strand', 5: 'order'}, inplace=True)\n    gff['chr'] = gff['chr'].astype(str)\n    gff['start'] = gff['start'].astype(np.int64)\n    gff['end'] = gff['end'].astype(np.int64)\n    gff['strand'] = gff['strand'].astype(str)\n    gff['order'] = gff['order'].astype(int)\n    return gff\n\n\ndef newlens(file, position):\n    \"\"\"Read lens file and select position based on 'order' or 'end'.\"\"\"\n    lens = pd.read_csv(file, sep=\"\\t\", header=None, index_col=0)\n    lens.index = lens.index.astype(str)\n    if position == 'order':\n        lens = lens[2]\n    elif position == 'end':\n        lens = lens[1]\n    return lens\n\n\ndef read_classification(file):\n    \"\"\"Read classification data and convert columns to appropriate types.\"\"\"\n    classification = pd.read_csv(file, sep=\"\\t\", header=None)\n    classification[0] = classification[0].astype(str)\n    classification[1] = classification[1].astype(int)\n    classification[2] = classification[2].astype(int)\n    classification[3] = classification[3].astype(str)\n    classification[4] = classification[4].astype(int)\n    return classification\n\n\ndef gene_location(gff, lens, step, position):\n    \"\"\"Calculate gene locations based on lens and step.\"\"\"\n    gff = gff[gff['chr'].isin(lens.index)].copy()\n    if gff.empty:\n        print('Stoped! \\n\\nChromosomes in gff file and lens file do not correspond.')\n        exit(0)\n    dict_chr = dict(zip(lens.index, np.append(np.array([0]), lens.cumsum()[:-1].values)))\n    gff['loc'] = ''\n    for name, group in gff.groupby('chr'):\n        gff.loc[group.index, 'loc'] = (dict_chr[name] + group[position]) * step\n    return gff\n\n\ndef dotplot_frame(fig, ax, lens1, lens2, step1, step2, genome1_name, genome2_name, arr, pad = 0):\n    \"\"\"Set up the dotplot frame with grid lines and labels.\"\"\"\n    for k in lens1.cumsum()[:-1] * step1:\n        ax.axhline(y=k, alpha=0.8, color='black', lw=0.5)\n    for k in lens2.cumsum()[:-1] * step2:\n        ax.axvline(x=k, alpha=0.8, color='black', lw=0.5)\n    align = dict(family='DejaVu Sans', style='italic', horizontalalignment=\"center\", verticalalignment=\"center\")\n    yticks = lens1.cumsum() * step1 - 0.5 * lens1 * step1\n    ax.set_yticks(yticks)\n    ax.set_yticklabels(lens1.index, fontsize = 13, family='DejaVu Sans', style='normal')\n    ax.tick_params(axis='y', which='major', pad = pad)\n    ax.tick_params(axis='x', which='major', pad = pad)\n    xticks = lens2.cumsum() * step2 - 0.5 * lens2 * step2\n    ax.set_xticks(xticks)\n    ax.set_xticklabels(lens2.index, fontsize = 13, family='DejaVu Sans', style='normal')\n    ax.xaxis.set_ticks_position('none')\n    ax.yaxis.set_ticks_position('none')\n    if arr[0] <= 0:\n        ax.text(-0.06, 0.5, genome1_name, weight='semibold', fontsize=16, rotation=90, **align)\n    else:\n        ax.text(-0.06, 0.5, genome1_name, weight='semibold', fontsize=16, rotation=90, **align)\n    if arr[1] < 0:\n        ax.text(0.5, -0.06, genome2_name, weight='semibold', fontsize=16, **align)\n    else:\n        ax.text(0.5, -0.06, genome2_name, weight='semibold', fontsize=16, **align)\n\ndef Bezier3(plist, t):\n    \"\"\"Calculate Bezier curve of degree 3.\"\"\"\n    p0, p1, p2 = plist\n    return p0 * (1 - t) ** 2 + 2 * p1 * t * (1 - t) + p2 * t ** 2\n\n\ndef Bezier4(plist, t):\n    \"\"\"Calculate Bezier curve of degree 4.\"\"\"\n    p0, p1, p2, p3, p4 = plist\n    return p0 * (1 - t) ** 4 + 4 * p1 * t * (1 - t) ** 3 + 6 * p2 * t ** 2 * (1 - t) ** 2 + 4 * p3 * (1 - t) * t ** 3 + p4 * t ** 4\n\n\ndef Rectangle(ax, loc, height, width, color, alpha):\n    \"\"\"Draw a rectangle on the axes with specified properties.\"\"\"\n    p = mpatches.Rectangle(loc, width, height, edgecolor=None, facecolor=color, alpha=alpha)\n    ax.add_patch(p)\n\ndef str_to_bool(s):\n    if isinstance(s, bool):\n        return s \n    return str(s).strip().lower() == 'true'"
  },
  {
    "path": "build/lib/wgdi/block_correspondence.py",
    "content": "import re\nimport numpy as np\nimport pandas as pd\nimport wgdi.base as base\n\nclass block_correspondence():\n    def __init__(self, options):\n        # Default values\n        self.tandem = True\n        self.pvalue = 0.2\n        self.position = 'order'\n        self.block_length = 5\n        self.tandem_length = 200\n        self.tandem_ratio = 1\n        self.ks_hit = 0.5\n\n        # Set user-defined options\n        for k, v in options:\n            setattr(self, str(k), v)\n            print(k, ' = ', v)\n\n        # Parse ks_area and homo if present\n        self.ks_area = [float(k) for k in getattr(self, 'ks_area', '-1,3').split(',')]\n        self.homo = [float(k) for k in self.homo.split(',')]\n        self.tandem_ratio = float(self.tandem_ratio)\n        self.tandem = base.str_to_bool(self.tandem)\n\n    def run(self):\n        lens1 = base.newlens(self.lens1, self.position)\n        lens2 = base.newlens(self.lens2, self.position)\n        \n        # Load block information from CSV\n        bkinfo = pd.read_csv(self.blockinfo)\n        bkinfo = self.preprocess_blockinfo(bkinfo, lens1, lens2)\n        \n        # Initialize correspondence DataFrame\n        cor = self.initialize_correspondence(lens1, lens2)\n        \n        # If no tandem allowed, remove tandem regions\n        if not self.tandem:\n            bkinfo = self.remove_tandem(bkinfo)\n        \n        # Remove low KS hits\n        bkinfo = self.remove_ks_hit(bkinfo)\n\n        # Find collinearity regions and save results\n        collinear_indices = self.collinearity_region(cor, bkinfo, lens1)\n        bkinfo.loc[bkinfo.index.isin(collinear_indices), :].to_csv(self.savefile, index=False)\n\n    def preprocess_blockinfo(self, bkinfo, lens1, lens2):\n        bkinfo['chr1'] = bkinfo['chr1'].astype(str)\n        bkinfo['chr2'] = bkinfo['chr2'].astype(str)\n        \n        # Filter by length, chromosome indices, and p-value\n        bkinfo = bkinfo[(bkinfo['length'] >= int(self.block_length)) & \n                        (bkinfo['chr1'].isin(lens1.index)) & \n                        (bkinfo['chr2'].isin(lens2.index)) & \n                        (bkinfo['pvalue'] <= float(self.pvalue))]\n        \n        # Filter by tandem ratio if the column exists\n        if 'tandem_ratio' in bkinfo.columns:\n            bkinfo = bkinfo[bkinfo['tandem_ratio'] <= self.tandem_ratio]\n        \n        return bkinfo\n\n    def initialize_correspondence(self, lens1, lens2):\n        # Create correspondence DataFrame with initial values\n        cor = [[k, i, 0, lens1[i], j, 0, lens2[j], float(self.homo[0]), float(self.homo[1])] \n               for k in range(1, int(self.multiple) + 1) \n               for i in lens1.index \n               for j in lens2.index]\n        \n        cor = pd.DataFrame(cor, columns=['sub', 'chr1', 'start1', 'end1', 'chr2', 'start2', 'end2', 'homo1', 'homo2'])\n        cor['chr1'] = cor['chr1'].astype(str)\n        cor['chr2'] = cor['chr2'].astype(str)\n        \n        return cor\n\n    def remove_tandem(self, bkinfo):\n        # Remove tandem regions from the DataFrame\n        group = bkinfo[bkinfo['chr1'] == bkinfo['chr2']].copy()\n        group['start'] = group['start1'] - group['start2']\n        group['end'] = group['end1'] - group['end2']\n        tandem_condition = (group['start'].abs() <= int(self.tandem_length)) | (group['end'].abs() <= int(self.tandem_length))\n        index_to_remove = group[tandem_condition].index\n        return bkinfo.drop(index_to_remove)\n\n    def remove_ks_hit(self, bkinfo):\n        # Remove records with insufficient KS hits\n        for index, row in bkinfo.iterrows():\n            ks = self.get_ks_value(row['ks'])\n            ks_ratio = len([k for k in ks if self.ks_area[0] <= k <= self.ks_area[1]]) / len(ks)\n            if ks_ratio < self.ks_hit:\n                bkinfo.drop(index, inplace=True)\n        return bkinfo\n\n    def get_ks_value(self, ks_str):\n        # Extract and return KS values as floats\n        ks = ks_str.split('_')\n        ks = list(map(float, ks[1:])) if ks[0] == '' else list(map(float, ks))\n        return ks\n\n    def collinearity_region(self, cor, bkinfo, lens):\n        collinear_indices = []\n        for (chr1, chr2), group in bkinfo.groupby(['chr1', 'chr2']):\n            group = group.sort_values(by=['length'], ascending=False)\n            df = pd.Series(0, index=range(1, int(lens[str(chr1)]) + 1))\n            for index, row in group.iterrows():\n                # Check homology conditions\n                if not self.is_valid_homo(row):\n                    continue\n                # Update the block series and compute ratio\n                b1 = [int(k) for k in row['block1'].split('_')]\n                df1 = df.copy()\n                df1[b1] += 1\n                ratio = (len(df1[df1 > 0]) - len(df[df > 0])) / len(b1)\n                if ratio < 0.5:\n                    continue\n                df[b1] += 1\n                collinear_indices.append(index)\n        \n        return collinear_indices\n\n    def is_valid_homo(self, row):\n        # Check if the homology values are within the specified range\n        return self.homo[0] <= row['homo' + self.multiple] <= self.homo[1]\n"
  },
  {
    "path": "build/lib/wgdi/block_info.py",
    "content": "import numpy as np\nimport pandas as pd\nimport wgdi.base as base\n\n\nclass block_info:\n    def __init__(self, options):\n        self.repeat_number = 20\n        self.ks_col = 'ks_NG86'\n        self.blast_reverse = False\n        for k, v in options:\n            setattr(self, str(k), v)\n            print(f\"{k} = {v}\")\n        \n        self.repeat_number = int(self.repeat_number)\n        self.blast_reverse = base.str_to_bool(self.blast_reverse)\n\n    def block_position(self, collinearity, blast, gff1, gff2, ks):\n        data = []\n        for block in collinearity:\n            blk_homo, blk_ks = [], []\n\n            # Skip blocks with missing gene coordinates in GFF files\n            if block[1][0][0] not in gff1.index or block[1][0][2] not in gff2.index:\n                continue\n            \n            # Extract chromosome info\n            chr1, chr2 = gff1.at[block[1][0][0], 'chr'], gff2.at[block[1][0][2], 'chr']\n            \n            # Extract start and end positions\n            array1, array2 = [float(i[1]) for i in block[1]], [float(i[3]) for i in block[1]]\n            start1, end1 = array1[0], array1[-1]\n            start2, end2 = array2[0], array2[-1]\n            \n            block1, block2 = [], []\n            for k in block[1]:\n                block1.append(int(float(k[1])))\n                block2.append(int(float(k[3])))\n                \n                # Check for KS values\n                pair_ks = self.get_ks_value(ks, k)\n                blk_ks.append(pair_ks)\n\n                # Retrieve blast homo data\n                if k[0]+\",\"+k[2] in blast.index:\n                    blk_homo.append(blast.loc[k[0]+\",\"+k[2], [f'homo{i}' for i in range(1, 6)]].values.tolist())\n            \n            ks_median, ks_average = self.calculate_ks_statistics(blk_ks)\n            homo = self.calculate_homo_statistics(blk_homo)\n\n            blkks = '_'.join([str(k) for k in blk_ks])\n            block1 = '_'.join([str(k) for k in block1])\n            block2 = '_'.join([str(k) for k in block2])\n            \n            # Calculate tandem ratio\n            tandem_ratio = self.tandem_ratio(blast, gff2, block[1])\n            \n            # Store the results\n            data.append([\n                block[0], chr1, chr2, start1, end1, start2, end2, block[2], len(block[1]), \n                ks_median, ks_average, *homo, block1, block2, blkks, tandem_ratio\n            ])\n        \n        # Create a DataFrame with the results\n        data_df = pd.DataFrame(data, columns=[\n            'id', 'chr1', 'chr2', 'start1', 'end1', 'start2', 'end2', 'pvalue', 'length', \n            'ks_median', 'ks_average', 'homo1', 'homo2', 'homo3', 'homo4', 'homo5', \n            'block1', 'block2', 'ks', 'tandem_ratio'\n        ])\n\n        # Calculate density\n        data_df['density1'] = data_df['length'] / ((data_df['end1'] - data_df['start1']).abs() + 1)\n        data_df['density2'] = data_df['length'] / ((data_df['end2'] - data_df['start2']).abs() + 1)\n\n        return data_df\n\n    def get_ks_value(self, ks, k):\n        \"\"\"Return KS value for the given pair of genes.\"\"\"\n        pair = f\"{k[0]},{k[2]}\"\n        if pair in ks.index:\n            return ks[pair]\n        pair_rev = f\"{k[2]},{k[0]}\"\n        if pair_rev in ks.index:\n            return ks[pair_rev]\n        return -1\n\n    def calculate_ks_statistics(self, blk_ks):\n        \"\"\"Calculate KS statistics: median and average.\"\"\"\n        ks_arr = [k for k in blk_ks if k >= 0]\n        if len(ks_arr) == 0:\n            return -1, -1\n        ks_median = base.get_median(ks_arr)\n        ks_average = sum(ks_arr) / len(ks_arr)\n        return ks_median, ks_average\n\n    def calculate_homo_statistics(self, blk_homo):\n        \"\"\"Calculate homo statistics by averaging across all blocks.\"\"\"\n        df = pd.DataFrame(blk_homo)\n        homo = df.mean().values if len(df) > 0 else [-1, -1, -1, -1, -1]\n        return homo\n\n    def blast_homo(self, blast, gff1, gff2, repeat_number):\n        \"\"\"Assign homo values based on blast data.\"\"\"\n        index = [group.sort_values(by=11, ascending=False)[:repeat_number].index.tolist() for name, group in blast.groupby([0])]\n        blast = blast.loc[np.concatenate([k[:repeat_number] for k in index], dtype=object), [0, 1]]\n        blast = blast.assign(homo1=np.nan, homo2=np.nan, homo3=np.nan, homo4=np.nan, homo5=np.nan)\n\n        # Assign homo values\n        for i in range(1, 6):\n            bluenum = i + 5\n            redindex = np.concatenate([k[:i] for k in index], dtype=object)\n            blueindex = np.concatenate([k[i:bluenum] for k in index], dtype=object)\n            grayindex = np.concatenate([k[bluenum:repeat_number] for k in index], dtype=object)\n            blast.loc[redindex, f'homo{i}'] = 1\n            blast.loc[blueindex, f'homo{i}'] = 0\n            blast.loc[grayindex, f'homo{i}'] = -1\n        \n        blast['chr1_order'] = blast[0].map(gff1['order'])\n        blast['chr2_order'] = blast[1].map(gff2['order'])\n        return blast\n\n    def tandem_ratio(self, blast, gff2, block):\n        \"\"\"Calculate tandem ratio for a block.\"\"\"\n        block = pd.DataFrame(block)[[0, 2]].rename(columns={0: 'id1', 2: 'id2'})\n        block['order2'] = block['id2'].map(gff2['order'])\n\n        # Filter block_blast data\n        block_blast = blast[(blast[0].isin(block['id1'].values)) & (blast[1].isin(block['id2'].values))].copy()\n        block_blast = pd.merge(block_blast, block, left_on=0, right_on='id1', how='left')\n        block_blast['difference'] = (block_blast['chr2_order'] - block_blast['order2']).abs()\n\n        # Filter based on difference and calculate ratio\n        block_blast = block_blast[(block_blast['difference'] <= self.repeat_number) & (block_blast['difference'] > 0)]\n        return len(block_blast[0].unique()) / len(block) * len(block_blast) / (len(block) + len(block_blast))\n\n    def run(self):\n        \"\"\"Main function to run the analysis.\"\"\"\n        # Initialize required datasets\n        lens1 = base.newlens(self.lens1, self.position)\n        lens2 = base.newlens(self.lens2, self.position)\n        gff1 = base.newgff(self.gff1)\n        gff2 = base.newgff(self.gff2)\n\n        # Filter GFF files based on chromosome indices\n        gff1 = gff1[gff1['chr'].isin(lens1.index)]\n        gff2 = gff2[gff2['chr'].isin(lens2.index)]\n\n        # Load blast data\n        blast = base.newblast(self.blast, int(self.score), float(self.evalue), gff1, gff2, self.blast_reverse)\n        blast = self.blast_homo(blast, gff1, gff2, self.repeat_number)\n        blast.index = blast[0] + ',' + blast[1]\n\n        # Get collinearity data\n        collinearity = self.auto_file(gff1, gff2)\n\n        # Load ks data if necessary\n        ks = pd.Series([]) if self.ks == 'none' or self.ks == '' or not hasattr(self, 'ks') else base.read_ks(self.ks, self.ks_col)\n\n        # Get the block position data\n        data = self.block_position(collinearity, blast, gff1, gff2, ks)\n        data['class1'] = 0\n        data['class2'] = 0\n\n        # Save results\n        data.to_csv(self.savefile, index=None)\n\n    def auto_file(self, gff1, gff2):\n        \"\"\"Auto-detect and read collinearity file.\"\"\"\n        with open(self.collinearity) as f:\n            p = ' '.join(f.readlines()[0:30])\n        \n        # Handle different file formats\n        if 'path length' in p or 'MAXIMUM GAP' in p:\n            return base.read_colinearscan(self.collinearity)\n        elif 'MATCH_SIZE' in p or '## Alignment' in p:\n            return self.process_mcscanx(gff1, gff2)\n        elif '# Alignment' in p:\n            return base.read_collinearity(self.collinearity)\n        elif '###' in p:\n            return self.process_jcvi(gff1, gff2)\n\n    def process_mcscanx(self, gff1, gff2):\n        \"\"\"Process MCScanX format collinearity data.\"\"\"\n        col = base.read_mcscanx(self.collinearity)\n        collinearity = []\n        for block in col:\n            newblock = [k for k in block[1] if k[0] in gff1.index and k[2] in gff2.index]\n            if newblock:\n                for k in newblock:\n                    k[1], k[3] = gff1.at[k[0], 'order'], gff2.at[k[2], 'order']\n                collinearity.append([block[0], newblock, block[2]])\n        return collinearity\n\n    def process_jcvi(self, gff1, gff2):\n        \"\"\"Process JCVI format collinearity data.\"\"\"\n        col = base.read_jcvi(self.collinearity)\n        collinearity = []\n        for block in col:\n            newblock = [k for k in block[1] if k[0] in gff1.index and k[2] in gff2.index]\n            if newblock:\n                for k in newblock:\n                    k[1], k[3] = gff1.at[k[0], 'order'], gff2.at[k[2], 'order']\n                collinearity.append([block[0], newblock, block[2]])\n        return collinearity\n"
  },
  {
    "path": "build/lib/wgdi/block_ks.py",
    "content": "import re\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\nimport wgdi.base as base\n\n\nclass block_ks:\n    def __init__(self, options):\n        # Default parameters\n        self.markersize = 0.8\n        self.figsize = 'default'\n        self.tandem_length = 200\n        self.blockinfo_reverse = False\n        self.tandem = False\n        self.area = [0, 3]\n        self.position = 'order'\n        self.ks_col = 'ks_NG86'\n        self.pvalue = 0.01\n        \n        # Overriding default parameters with options\n        for k, v in options:\n            setattr(self, str(k), v)\n            print(f\"{k} = {v}\")\n        \n        # Parsing area as a float list\n        self.area = [float(k) for k in str(self.area).split(',')]\n        self.markersize =  float(self.markersize)\n        self.tandem_length =  int(self.tandem_length)\n        \n        self.blockinfo_reverse =  base.str_to_bool(self.blockinfo_reverse)\n        self.remove_tandem =  base.str_to_bool(self.remove_tandem)\n\n    def block_position(self, bkinfo, lens1, lens2, step1, step2):\n        pos, pairs = [], []\n        \n        # Create mappings for chromosome positions\n        dict_y_chr = dict(zip(lens1.index, np.append([0], lens1.cumsum()[:-1].values)))\n        dict_x_chr = dict(zip(lens2.index, np.append([0], lens2.cumsum()[:-1].values)))\n        \n        # Iterate through block information\n        for _, row in bkinfo.iterrows():\n            block1 = row['block1'].split('_')\n            block2 = row['block2'].split('_')\n            ks = row['ks'].split('_')\n            \n            locy_median = (dict_y_chr[row['chr1']] + 0.5 * (row['end1'] + row['start1'])) * step1\n            locx_median = (dict_x_chr[row['chr2']] + 0.5 * (row['end2'] + row['start2'])) * step2\n            pos.append([locx_median, locy_median, row['ks_median']])\n            \n            # Ensure ks length matches block length\n            if len(block1) != len(ks):\n                ks = ks[1:]\n                \n            for i in range(len(block1)):\n                locy = (dict_y_chr[row['chr1']] + float(block1[i])) * step1\n                locx = (dict_x_chr[row['chr2']] + float(block2[i])) * step2\n                pairs.append([locx, locy, float(ks[i])])\n        \n        return pos, pairs\n\n    def remove_tandem(self, bkinfo):\n        # Filter for same-chromosome blocks\n        group = bkinfo[bkinfo['chr1'] == bkinfo['chr2']].copy()\n        \n        # Calculate block start and end differences\n        group['start'] = group['start1'] - group['start2']\n        group['end'] = group['end1'] - group['end2']\n        \n        # Remove tandems based on threshold\n        index = group[(group['start'].abs() <= self.tandem_length) |\n                      (group['end'].abs() <= self.tandem_length)].index\n        return bkinfo.drop(index)\n\n    def run(self):\n        # Initialize axis and chromosome lens\n        axis = [0, 1, 1, 0]\n        lens1 = base.newlens(self.lens1, self.position)\n        lens2 = base.newlens(self.lens2, self.position)\n        \n        # Parse figsize\n        if re.search(r'\\d', self.figsize):\n            self.figsize = [float(k) for k in self.figsize.split(',')]\n        else:\n            self.figsize = np.array([1, float(lens1.sum()) / float(lens2.sum())]) * 10\n        \n        # Calculate step sizes\n        step1 = 1 / float(lens1.sum())\n        step2 = 1 / float(lens2.sum())\n        \n        # Create figure and axes\n        fig, ax = plt.subplots(figsize=self.figsize)\n        plt.rcParams['ytick.major.pad'] = 0\n        ax.xaxis.set_ticks_position('top')\n        \n        # Plot dotplot frame\n        base.dotplot_frame(fig, ax, lens1, lens2, step1, step2,\n                           self.genome1_name, self.genome2_name, [0, 1])\n        \n        # Load block information\n        bkinfo = pd.read_csv(self.blockinfo)\n        \n        # Handle reverse block information\n        if self.blockinfo_reverse == True:\n            bkinfo[['chr1', 'chr2']] = bkinfo[['chr2', 'chr1']]\n            bkinfo[['block1', 'block2']] = bkinfo[['block2', 'block1']]\n        \n        # Filter block information\n        bkinfo['chr1'] = bkinfo['chr1'].astype(str)\n        bkinfo['chr2'] = bkinfo['chr2'].astype(str)\n        bkinfo = bkinfo[(bkinfo['length'] >= int(self.block_length)) & \n                        (bkinfo['chr1'].isin(lens1.index)) & \n                        (bkinfo['chr2'].isin(lens2.index)) & \n                        (bkinfo['pvalue'] < float(self.pvalue))]\n        \n        # Remove tandem duplicates if required\n        if self.tandem == False:\n            bkinfo = self.remove_tandem(bkinfo)\n        \n        # Calculate positions and pairs\n        pos, pairs = self.block_position(bkinfo, lens1, lens2, step1, step2)\n        \n        # Filter pairs by ks value\n        df = pd.DataFrame(pairs, columns=['loc1', 'loc2', 'ks'])\n        df = df[(df['ks'] >= self.area[0]) & (df['ks'] <= self.area[1])]\n        df.drop_duplicates(inplace=True)\n        \n        # Plot scatter\n        cm = plt.cm.get_cmap('gist_rainbow')\n        sc = plt.scatter(df['loc1'], df['loc2'], s=self.markersize, c=df['ks'],\n                         alpha=0.9, edgecolors=None, linewidths=0, marker='o', \n                         vmin=self.area[0], vmax=self.area[1], cmap=cm)\n        \n        # Add colorbar\n        cbar = fig.colorbar(sc, shrink=0.5, pad=0.03, fraction=0.1)\n        align = dict(family='DejaVu Sans', style='normal',\n                     horizontalalignment=\"center\", verticalalignment=\"center\")\n        cbar.set_label('Ks', labelpad=12.5, fontsize=16, **align)\n        \n        # Set axis and save figure\n        ax.axis(axis)\n        plt.subplots_adjust(left=0.09, right=0.96, top=0.93, bottom=0.03)\n        plt.savefig(self.savefig, dpi=500)\n        plt.show()\n"
  },
  {
    "path": "build/lib/wgdi/circos.py",
    "content": "import re\nimport sys\n\nimport matplotlib as mpl\nimport matplotlib.patches as mpatches\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\n\nimport wgdi.base as base\n\n\nclass circos():\n    def __init__(self, options):\n        self.figsize = '10,10'\n        self.position = 'order'\n        self.label_size = 9\n        self.label_radius = 0.015\n        self.column_names = [None]*100\n        for k, v in options:\n            setattr(self, str(k), v)\n            print(k, ' = ', v)\n        self.figsize = [float(k) for k in self.figsize.split(',')]\n        self.ring_width = float(self.ring_width)\n        if hasattr(self, 'legend_square'):\n            self.legend_square = [float(k)\n                                  for k in self.legend_square.split(',')]\n        else:\n            self.legend_square = 0.04, 0.04\n\n    def plot_circle(self, loc_chr, radius, color='black', lw=1, alpha=1, linestyle='-'):\n        for k in loc_chr:\n            start, end = loc_chr[k]\n            t = np.arange(start, end, 0.005)\n            x, y = (radius) * np.cos(t), (radius) * np.sin(t)\n            plt.plot(x, y, linestyle=linestyle,\n                     color=color, lw=lw, alpha=alpha)\n\n    def plot_labels(self, root, labels, loc_chr, radius, horizontalalignment=\"center\", verticalalignment=\"center\", fontsize=6,\n                    color='black'):\n        for k in loc_chr:\n            loc = sum(loc_chr[k]) * 0.5\n            x, y = radius * np.cos(loc), radius * np.sin(loc)\n            self.Wedge(root, (x, y), self.label_radius, 0,\n                       360, self.label_radius, 'white', 1)\n            if 1 * np.pi < loc < 2 * np.pi:\n                loc += np.pi\n            plt.text(x, y, labels[k], horizontalalignment=horizontalalignment, verticalalignment=verticalalignment,\n                     fontsize=fontsize, color=color, rotation=0)\n\n    def Wedge(self, ax, loc, radius, start, end, width, color, alpha):\n        p = mpatches.Wedge(loc, radius, start, end, width=width,\n                           edgecolor=None, facecolor=color, alpha=alpha)\n        ax.add_patch(p)\n\n    def plot_bar(self, df, radius, length, lw, color, alpha):\n        for k in df[df.columns[0]].drop_duplicates().values:\n            if str(k) not in color.keys():\n                color[str(k)] = 'black'\n            if k in ['', np.nan]:\n                continue\n            df_chr = df.groupby(df.columns[0]).get_group(k)\n            x1, y1 = radius * \\\n                np.cos(df_chr['rad']), radius * np.sin(df_chr['rad'])\n            x2, y2 = (radius + length) * \\\n                np.cos(df_chr['rad']), (radius + length) * \\\n                np.sin(df_chr['rad'])\n            x = np.array(\n                [x1.values, x2.values, [np.nan] * x1.size]).flatten('F')\n            y = np.array(\n                [y1.values, y2.values, [np.nan] * x1.size]).flatten('F')\n            plt.plot(x, y, linestyle='-',\n                     color=color[str(k)], lw=lw, alpha=alpha)\n\n    def chr_location(self, lens, angle_gap, angle):\n        start, end, loc_chr = 0, 0.2*angle_gap, {}\n        for k in lens.index:\n            end += angle_gap + angle * (float(lens[k]))\n            start = end - angle * (float(lens[k]))\n            loc_chr[k] = [float(start), float(end)]\n        return loc_chr\n\n    def deal_alignment(self, alignment, gff, lens, loc_chr, angle):\n        alignment.replace('\\s+', '', inplace=True)\n        alignment.replace('.', '', inplace=True)\n        print(alignment.dropna(subset=[2, 3],how='all'))\n        # exit(0)\n        newalignment = alignment.copy()\n        for i in range(len(alignment.columns)):\n            alignment[i] = alignment[i].astype(str)\n            newalignment[i] = alignment[i].map(gff['chr'].to_dict())\n        newalignment['loc'] = alignment[0].map(gff[self.position].to_dict())\n        newalignment[0] = newalignment[0].astype('str')\n        newalignment['loc'] = newalignment['loc'].astype('float')\n        newalignment = newalignment[newalignment[0].isin(lens.index) == True]\n        newalignment['rad'] = np.nan\n        for name, group in newalignment.groupby(0):\n            if str(name) not in loc_chr:\n                continue\n            newalignment.loc[group.index, 'rad'] = loc_chr[str(\n                name)][0]+angle * group['loc']\n        print(newalignment.dropna(subset=[2, 3,4],how='all'))\n        return newalignment\n\n    def deal_ancestor(self, alignment, gff, lens, loc_chr, angle, al):\n        alignment.replace('\\s+', '', inplace=True)\n        alignment.replace('.', np.nan, inplace=True)\n        newalignment = pd.merge(alignment, gff, left_on=0, right_on=gff.index)\n        newalignment['rad'] = np.nan\n        for name, group in newalignment.groupby('chr'):\n            if str(name) not in loc_chr:\n                continue\n            newalignment.loc[group.index, 'rad'] = loc_chr[str(\n                name)][0]+angle * group[self.position]\n        newalignment.index = newalignment[0]\n        newalignment[0] = newalignment[0].map(newalignment['rad'].to_dict())\n        data = []\n        for index_al, row_al in al.iterrows():\n            for k in alignment.columns[1:]:\n                alignment[k] = alignment[k].astype(str)\n                group = newalignment[(newalignment['chr'] == row_al['chr']) & (\n                    newalignment['order'] >= row_al['start']) & (newalignment['order'] <= row_al['end'])].copy()\n                group.loc[:, k] = group.loc[:, k].map(\n                    newalignment['rad']).values\n                group.dropna(subset=[k], inplace=True)\n                group.index = group.index.map(newalignment['rad'].to_dict())\n                group['color'] = row_al['color']\n                group = group[group[k].notnull()]\n                data += group[[0, k, 'color']].values.tolist()\n        df = pd.DataFrame(data, columns=['loc1', 'loc2', 'color'])\n        return df\n\n    def plot_collinearity(self, data, radius, lw=0.02, alpha=1):\n        for name, group in data.groupby('color'):\n            x, y = np.array([]), np.array([])\n            for index, row in group.iterrows():\n                ex1x, ex1y = radius * \\\n                    np.cos(row['loc1']), radius*np.sin(row['loc1'])\n                ex2x, ex2y = radius * \\\n                    np.cos(row['loc2']), radius*np.sin(row['loc2'])\n                ex3x, ex3y = radius * (1-abs(row['loc1']-row['loc2'])/np.pi) * np.cos((row['loc1']+row['loc2'])*0.5), radius * (\n                    1-abs(row['loc1']-row['loc2'])/np.pi) * np.sin((row['loc1']+row['loc2'])*0.5)\n                x1 = [ex1x, 0.5*ex3x, ex2x]\n                y1 = [ex1y, 0.5*ex3y, ex2y]\n                step = .002\n                t = np.arange(0, 1+step, step)\n                xt = base.Bezier3(x1, t)\n                yt = base.Bezier3(y1, t)\n                x = np.hstack((x, xt, np.nan))\n                y = np.hstack((y, yt, np.nan))\n            plt.plot(x, y, color=name, lw=lw, alpha=alpha)\n\n    def plot_legend(self, ax, chr_color, width, height):\n        (x1, x2) = ax.get_xlim()\n        (y1, y2) = ax.get_ylim()\n        a = 1000\n        for k, v in enumerate(chr_color.keys(), 0):\n            h = y1-k//a*height*2\n            k = k % a\n            if x1 + width * k > x2-width:\n                a = k\n                h = y1-k//a*height*2\n                k = k % a\n            loc = [x1 + width * k, h]\n            base.Rectangle(ax, loc, height, width, chr_color[v], 1)\n            plt.text(loc[0] + width*0.382, h-0.618*height, v, fontsize=12)\n        ax.set_ylim(h-2*height, y2)\n\n    def run(self):\n        fig, ax = plt.subplots(figsize=self.figsize)\n        mpl.rcParams['agg.path.chunksize'] = 100000000\n        lens = base.newlens(self.lens, self.position)\n        radius, angle_gap = float(self.radius), float(self.angle_gap)\n        angle = (2 * np.pi - (int(len(lens))+1.5)\n                 * angle_gap) / (int(lens.sum()))\n        loc_chr = self.chr_location(lens, angle_gap, angle)\n        list_colors = [str(k).strip() for k in re.split(',|:', self.colors)]\n        chr_color = dict(zip(list_colors[::2], list_colors[1::2]))\n        gff = base.newgff(self.gff)\n        if hasattr(self, 'ancestor'):\n            ancestor = pd.read_csv(self.ancestor, header=None)\n            al = pd.read_csv(self.ancestor_location, sep='\\t', header=None)\n            al.rename(columns={0: 'chr', 1: 'start',\n                               2: 'end', 3: 'color'}, inplace=True)\n            al['chr'] = al['chr'].astype(str)\n            data = self.deal_ancestor(ancestor, gff, lens, loc_chr, angle, al)\n            self.plot_collinearity(data, radius, lw=0.1, alpha=0.8)\n\n        if hasattr(self, 'alignment'):\n            alignment = pd.read_csv(self.alignment, header=None)\n            print(alignment)\n            newalignment = self.deal_alignment(\n                alignment, gff, lens, loc_chr, angle)\n            if ',' in self.column_names:\n                names = [str(k) for k in self.column_names.split(',')]\n            else:\n                names = [None]*len(newalignment.columns)\n            n = 0\n            align = dict(family='Arial', verticalalignment=\"center\",\n                         horizontalalignment=\"center\")\n            print(newalignment)\n            for k, v in enumerate(newalignment.columns[1:-2]):\n                r = radius + self.ring_width*(k+1)\n                print(k,v,r)\n                self.plot_circle(loc_chr, r, lw=0.5, alpha=1, color='grey')\n                self.plot_bar(newalignment[[v, 'rad']], r + self.ring_width *\n                              0.15, self.ring_width*0.7, 0.15, chr_color, 1)\n                if n % 2 == 0:\n                    loc = 0.05\n                    x, y = (r+self.ring_width*0.5) * \\\n                        np.cos(loc), (r+self.ring_width*0.5) * np.sin(loc)\n                    plt.text(x, y, names[n], rotation=loc *\n                             180 / np.pi, fontsize=self.label_size, **align)\n                else:\n                    loc = -0.08\n                    x, y = (r+self.ring_width*0.5) * \\\n                        np.cos(loc), (r+self.ring_width*0.5) * np.sin(loc)\n                    plt.text(x, y, names[n], fontsize=self.label_size,\n                             rotation=loc * 180 / np.pi, **align)\n                n += 1\n        if hasattr(self, 'ancestor'):\n            colors = al['color'].drop_duplicates().values.tolist()\n            ancestor_chr_color = dict(zip(range(1, len(colors)+1), colors))\n            self.plot_legend(ax, ancestor_chr_color,\n                             self.legend_square[0], self.legend_square[1])\n        if hasattr(self, 'alignment'):\n            del chr_color['nan']\n            self.plot_legend(\n                ax, chr_color, self.legend_square[0], self.legend_square[1])\n        labels = self.chr_label + lens.index\n        labels = dict(zip(lens.index, labels))\n        self.plot_labels(ax, labels, loc_chr, radius +\n                         self.ring_width*0.3, fontsize=self.label_size)\n\n        plt.axis('off')\n        a = (ax.get_ylim()[1]-ax.get_ylim()[0]) / \\\n            (ax.get_xlim()[1]-ax.get_xlim()[0])\n        fig.set_size_inches(self.figsize[0], self.figsize[0]*a, forward=True)\n        plt.savefig(self.savefig, dpi=500)\n        plt.show()\n        sys.exit(0)\n"
  },
  {
    "path": "build/lib/wgdi/collinearity.py",
    "content": "import numpy as np\nimport pandas as pd\n\n\nclass collinearity:\n    def __init__(self, options, points):\n        # Default values\n        self.gap_penalty = -1\n        self.over_length = 0\n        self.mg1 = 40\n        self.mg2 = 40\n        self.pvalue = 1\n        self.over_gap = 3\n        self.points = points\n        self.p_value = 0\n        self.coverage_ratio = 0.8\n        \n        # Set user-defined options\n        for k, v in options:\n            setattr(self, str(k), v)\n\n        # Initialize grading and mg values\n        self.grading = [50, 40, 25] if not hasattr(self, 'grading') else [int(k) for k in self.grading.split(',')]\n        self.mg1, self.mg2 = [40, 40] if not hasattr(self, 'mg') else [int(k) for k in self.mg.split(',')]\n\n        # Convert string values to floats\n        self.pvalue = float(self.pvalue)\n        self.coverage_ratio = float(self.coverage_ratio)\n\n    def get_matrix(self):\n        \"\"\"Initialize the matrix for the collinearity points.\"\"\"\n        self.points['usedtimes1'] = 0\n        self.points['usedtimes2'] = 0\n        self.points['times'] = 1\n        self.points['score1'] = self.points['grading']\n        self.points['score2'] = self.points['grading']\n        self.points['path1'] = self.points.index.to_numpy().reshape(len(self.points), 1).tolist()\n        self.points['path2'] = self.points['path1']\n        self.points_init = self.points.copy()\n        self.mat_points = self.points\n\n    def run(self):\n        \"\"\"Run the main collinearity processing.\"\"\"\n        self.get_matrix()\n        self.score_matrix()\n        data = []\n\n        # Process points for maxPath in the positive direction\n        points1 = self.points[['loc1', 'loc2', 'score1', 'path1', 'usedtimes1']].sort_values(by=['score1'], ascending=False)\n        points1.drop(index=points1[points1['usedtimes1'] < 1].index, inplace=True)\n        points1.columns = ['loc1', 'loc2', 'score', 'path', 'usedtimes']\n        \n        while (self.over_length >= self.over_gap or len(points1) >= self.over_gap):\n            if self.max_path(points1):\n                if self.p_value > self.pvalue:\n                    continue\n                data.append([self.path, self.p_value, self.score])\n\n        # Process points for maxPath in the negative direction\n        points2 = self.points[['loc1', 'loc2', 'score2', 'path2', 'usedtimes2']].sort_values(by=['score2'], ascending=False)\n        points2.drop(index=points2[points2['usedtimes2'] < 1].index, inplace=True)\n        points2.columns = ['loc1', 'loc2', 'score', 'path', 'usedtimes']\n\n        while (self.over_length >= self.over_gap) or (len(points2) >= self.over_gap):\n            if self.max_path(points2):\n                if self.p_value > self.pvalue:\n                    continue\n                data.append([self.path, self.p_value, self.score])\n\n        return data\n\n    def score_matrix(self):\n        \"\"\"Calculate the scoring matrix for the points.\"\"\"\n        for index, row, col in self.points[['loc1', 'loc2']].itertuples():\n            # Get points within a certain range\n            points = self.points[(self.points['loc1'] > row) & \n                                 (self.points['loc2'] > col) & \n                                 (self.points['loc1'] < row + self.mg1) & \n                                 (self.points['loc2'] < col + self.mg2)]\n            \n            row_i_old, gap = row, self.mg2\n            for index_ij, row_i, col_j, grading in points[['loc1', 'loc2', 'grading']].itertuples():\n                if col_j - col > gap and row_i > row_i_old:\n                    break\n                score = grading + (row_i - row + col_j - col) * self.gap_penalty\n                score1 = score + self.points.at[index, 'score1']\n                if score > 0 and self.points.at[index_ij, 'score1'] < score1:\n                    self.points.at[index_ij, 'score1'] = score1\n                    self.points.at[index, 'usedtimes1'] += 1\n                    self.points.at[index_ij, 'usedtimes1'] += 1\n                    self.points.at[index_ij, 'path1'] = self.points.at[index, 'path1'] + [index_ij]\n                    gap = min(col_j - col, gap)\n                    row_i_old = row_i\n\n        # Reverse processing to handle negative direction\n        points_reverse = self.points.sort_values(by=['loc1', 'loc2'], ascending=[False, True])\n        for index, row, col in points_reverse[['loc1', 'loc2']].itertuples():\n            points = points_reverse[(points_reverse['loc1'] < row) & \n                                    (points_reverse['loc2'] > col) & \n                                    (points_reverse['loc1'] > row - self.mg1) & \n                                    (points_reverse['loc2'] < col + self.mg2)]\n            \n            row_i_old, gap = row, self.mg2\n            for index_ij, row_i, col_j, grading in points[['loc1', 'loc2', 'grading']].itertuples():\n                if col_j - col > gap and row_i < row_i_old:\n                    break\n                score = grading + (row - row_i + col_j - col) * self.gap_penalty\n                score2 = score + self.points.at[index, 'score2']\n                if score > 0 and self.points.at[index_ij, 'score2'] < score2:\n                    self.points.at[index_ij, 'score2'] = score2\n                    self.points.at[index, 'usedtimes2'] += 1\n                    self.points.at[index_ij, 'usedtimes2'] += 1\n                    self.points.at[index_ij, 'path2'] = self.points.at[index, 'path2'] + [index_ij]\n                    gap = min(col_j - col, gap)\n                    row_i_old = row_i\n\n    def max_path(self, points):\n        \"\"\"Find the maximum path for the given points.\"\"\"\n        if len(points) == 0:\n            self.over_length = 0\n            return False\n        \n        # Initialize path score and index\n        self.score, self.path_index = points.loc[points.index[0], ['score', 'path']]\n        self.path = points[points.index.isin(self.path_index)]\n        self.over_length = len(self.path_index)\n        \n        # Check if the block overlaps with other blocks\n        if self.over_length >= self.over_gap and len(self.path) / self.over_length > self.coverage_ratio:\n            points.drop(index=self.path.index, inplace=True)\n            [loc1_min, loc2_min], [loc1_max, loc2_max] = self.path[['loc1', 'loc2']].agg(['min', 'max']).to_numpy()\n\n            # Calculate p-value\n            gap_init = self.points_init[(loc1_min <= self.points_init['loc1']) & \n                                        (self.points_init['loc1'] <= loc1_max) & \n                                        (loc2_min <= self.points_init['loc2']) & \n                                        (self.points_init['loc2'] <= loc2_max)].copy()\n            \n            self.p_value = self.p_value_estimated(gap_init, loc1_max - loc1_min + 1, loc2_max - loc2_min + 1)\n            self.path = self.path.sort_values(by=['loc1'], ascending=[True])[['loc1', 'loc2']]\n            return True\n        else:\n            points.drop(index=points.index[0], inplace=True)\n        return False\n\n    def p_value_estimated(self, gap, L1, L2):\n        \"\"\"Estimate p-value based on the given gap and lengths.\"\"\"\n        N1 = gap['times'].sum()\n        N = len(gap)\n        self.points_init.loc[gap.index, 'times'] += 1\n        m = len(self.path)\n        a = (1 - self.score / m / self.grading[0]) * (N1 - m + 1) / N * (L1 - m + 1) * (L2 - m + 1) / L1 / L2\n        return round(a, 4)\n"
  },
  {
    "path": "build/lib/wgdi/dotplot.py",
    "content": "import re\n\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\n\nimport wgdi.base as base\n\n\nclass dotplot():\n    def __init__(self, options):\n        self.multiple = 1\n        self.score = 100\n        self.evalue = 1e-5\n        self.repeat_number = 20\n        self.markersize = 0.5\n        self.figsize = 'default'\n        self.position = 'order'\n        self.ancestor_top = None\n        self.ancestor_left = None\n        self.blast_reverse = False\n        for k, v in options:\n            setattr(self, str(k), v)\n            print(k, ' = ', v)\n        if self.ancestor_top == 'none' or self.ancestor_top == '':\n            self.ancestor_top = None\n        if self.ancestor_left == 'none' or self.ancestor_left == '':\n            self.ancestor_left = None\n        base.str_to_bool(self.blast_reverse)\n\n    def pair_positon(self, blast, gff1, gff2, rednum, repeat_number):\n        blast['color'] = ''\n        blast['loc1'] = blast[0].map(gff1['loc'])\n        blast['loc2'] = blast[1].map(gff2['loc'])\n        bluenum = 5+rednum\n        index = [group.sort_values(by=[11], ascending=[False])[:repeat_number].index.tolist()\n                 for name, group in blast.groupby([0])]\n        reddata = np.array([k[:rednum] for k in index], dtype=object)\n        bluedata = np.array([k[rednum:bluenum] for k in index], dtype=object)\n        graydata = np.array([k[bluenum:repeat_number] for k in index], dtype=object)\n        if len(reddata):\n            redindex = np.concatenate(reddata)\n        else:\n            redindex = []\n        if len(bluedata):\n            blueindex = np.concatenate(bluedata)\n        else:\n            blueindex = []\n        if len(graydata):\n            grayindex = np.concatenate(graydata)\n        else:\n            grayindex = []\n        blast.loc[redindex, 'color'] = 'red'\n        blast.loc[blueindex, 'color'] = 'blue'\n        blast.loc[grayindex, 'color'] = 'gray'\n        return blast[blast['color'].str.contains(r'\\w')]\n\n    def run(self):\n        axis = [0, 1, 1, 0]\n        left, right, top, bottom = 0.07, 0.97, 0.93, 0.03\n        lens1 = base.newlens(self.lens1, self.position)\n        lens2 = base.newlens(self.lens2, self.position)\n        step1 = 1 / float(lens1.sum())\n        step2 = 1 / float(lens2.sum())\n        if self.ancestor_left != None:\n            axis[0] = -0.02\n            lens_ancestor_left = pd.read_csv(\n                self.ancestor_left, sep=\"\\t\", header=None)\n            lens_ancestor_left[0] = lens_ancestor_left[0].astype(str)\n            lens_ancestor_left[3] = lens_ancestor_left[3].astype(str)\n            lens_ancestor_left[4] = lens_ancestor_left[4].astype(int)\n            lens_ancestor_left[4] = lens_ancestor_left[4] / lens_ancestor_left[4].max()\n            lens_ancestor_left = lens_ancestor_left[lens_ancestor_left[0].isin(\n                lens1.index)]\n        if self.ancestor_top != None:\n            axis[3] = -0.02\n            lens_ancestor_top = pd.read_csv(\n                self.ancestor_top, sep=\"\\t\", header=None)\n            lens_ancestor_top[0] = lens_ancestor_top[0].astype(str)\n            lens_ancestor_top[3] = lens_ancestor_top[3].astype(str)\n            lens_ancestor_top[4] = lens_ancestor_top[4].astype(int)\n            lens_ancestor_top[4] = lens_ancestor_top[4] / lens_ancestor_top[4].max()\n            lens_ancestor_top = lens_ancestor_top[lens_ancestor_top[0].isin(\n                lens2.index)]\n        if re.search(r'\\d', self.figsize):\n            self.figsize = [float(k) for k in self.figsize.split(',')]\n        else:\n            self.figsize = np.array(\n                [1, float(lens1.sum())/float(lens2.sum())])*10\n        plt.rcParams['ytick.major.pad'] = 0\n        fig, ax = plt.subplots(figsize=self.figsize)\n        ax.xaxis.set_ticks_position('top')\n        base.dotplot_frame(fig, ax, lens1, lens2, step1, step2,\n                           self.genome1_name, self.genome2_name, [axis[0], axis[3]])\n        gff1 = base.newgff(self.gff1)\n        gff2 = base.newgff(self.gff2)\n        gff1 = base.gene_location(gff1, lens1, step1, self.position)\n        gff2 = base.gene_location(gff2, lens2, step2, self.position)\n        if self.ancestor_top != None:\n            top = top\n            self.aree_left = self.ancestor_posion(ax, gff2, lens_ancestor_top, 'top')\n        if self.ancestor_left != None:\n            left = left\n            self.aree_top = self.ancestor_posion(ax, gff1, lens_ancestor_left, 'left')\n        print('read gffs')\n        blast = base.newblast(self.blast, int(self.score),\n                              float(self.evalue), gff1, gff2, self.blast_reverse)\n        if len(blast) ==0:\n            print('Stoped! \\n\\nThe gene id in blast file does not correspond to gff1 and gff2.')\n            exit(0)\n        print('read blast')\n        df = self.pair_positon(blast, gff1, gff2,\n                               int(self.multiple), int(self.repeat_number))\n        print('deal blast')\n        ax.scatter(df['loc2'], df['loc1'], s=float(self.markersize), c=df['color'],\n                   alpha=0.5, edgecolors=None, linewidths=0, marker='o')\n        ax.axis(axis)\n        plt.subplots_adjust(left=left, right=right, top=top, bottom=bottom)\n        plt.savefig(self.savefig, dpi=300)\n        plt.show()\n\n    def ancestor_posion(self, ax, gff, lens, mark):\n        data = []\n        for index, row in lens.iterrows():\n            loc1 = gff[(gff['chr'] == row[0]) & (\n                gff['order'] == int(row[1]))].index\n            loc2 = gff[(gff['chr'] == row[0]) & (\n                gff['order'] == int(row[2])-1)].index\n            loc1, loc2 = gff.loc[[loc1[0], loc2[0]], 'loc']\n            if mark == 'top':\n                width = abs(loc1-loc2)\n                loc = [min(loc1, loc2), 0]\n                height = -0.02\n                base.Rectangle(ax, loc, height, width, row[3], row[4])\n            if mark == 'left':\n                height = abs(loc1-loc2)\n                loc = [-0.02, min(loc1, loc2), ]\n                width = 0.02\n                base.Rectangle(ax, loc, height, width, row[3], row[4])\n            data.append([loc, height, width, row[3], row[4]])\n        return data\n"
  },
  {
    "path": "build/lib/wgdi/example/__init__.py",
    "content": ""
  },
  {
    "path": "build/lib/wgdi/example/align.conf",
    "content": "[alignment]\nblockinfo = block information file (.csv)\nblockinfo_reverse = false\nclassid =  class1\ngff1 =  gff1 file\ngff2 =  gff2 file\nlens1 = lens1 file\nlens2 = lens2 file\ngenome1_name =  Genome1 name\ngenome2_name =  Genome2 name\nmarkersize = 0.5\nks_area = -1,3\nposition = order\ncolors = red,blue,green\nfigsize = 10,10\nsavefile = savefile(.csv)\nsavefig= save image(.png, .pdf, .svg)"
  },
  {
    "path": "build/lib/wgdi/example/alignmenttrees.conf",
    "content": "[alignmenttrees]\nalignment = alignment file (.csv)\ngff = gff file (reference genome, If alignment has no reference species, delete it)\nlens = lens file (If alignment has no reference species, delete it)\ndir = output folder\nsequence_file = sequence file (.fa)\ncds_file = cds file (.fa)\ncodon_positon = 1,2,3  (1,2 mean codon1&2; 1,2,3 mean no codon removed)\ntrees_file =  trees (.nwk)\nalign_software = (mafft,muscle)\ntree_software =  (iqtree,fasttree)\nthreads = 1 (Number,AUTO)\nmodel = MFP\ntrimming =  (trimal,divvier)\nminimum = 4\ndelete_detail = true\n"
  },
  {
    "path": "build/lib/wgdi/example/ancestral_karyotype.conf",
    "content": "[ancestral_karyotype]\ngff = gff file (cat the relevant 'gff' files into a file)\npep_file = pep file (cat the relevant 'pep.fa' files into a file)\nancestor = ancestor file  (this file requires you to provide)\nmark = aak \nancestor_gff =  result file\nancestor_lens =  result file\nancestor_pep =  result file\nancestor_file =  result file"
  },
  {
    "path": "build/lib/wgdi/example/ancestral_karyotype_repertoire.conf",
    "content": "[ancestral_karyotype_repertoire]\nblockinfo =  block information (*.csv)\n# blockinfo: processed *.csv\nblockinfo_reverse =  False\ngff1 =  gff1 file (ancestor's gff)\ngff2 =  gff2 file (the other species's gff)\ngap = 5\nmark = aak1s\nancestor = ancestor file \n#current ancestor file\nancestor_new =  result file\nancestor_pep =  ancestor pep file \n#cat all pep files together\nancestor_pep_new =  result file\nancestor_gff =  result file\nancestor_lens =  result file\n"
  },
  {
    "path": "build/lib/wgdi/example/blockinfo.conf",
    "content": "[blockinfo]\nblast = blast file\ngff1 =  gff1 file\ngff2 =  gff2 file\nlens1 = lens1 file\nlens2 = lens2 file\ncollinearity = collinearity file\nscore = 100\nevalue = 1e-5\nrepeat_number = 20\nposition = order\nks = ks file\nks_col = ks_NG86\nsavefile = block information (*.csv)\n"
  },
  {
    "path": "build/lib/wgdi/example/blockks.conf",
    "content": "[blockks]\nlens1 = lens1 file\nlens2 = lens2 file\ngenome1_name =  Genome1 name\ngenome2_name =  Genome2 name\nblockinfo = block information (*.csv)\npvalue = 0.2\ntandem = true\ntandem_length = 200\nmarkersize = 1\narea = 0,2\nblock_length =  minimum length\nfigsize = 8,8\nsavefig = save image(.png, .pdf, .svg)\n"
  },
  {
    "path": "build/lib/wgdi/example/circos.conf",
    "content": "[circos]\ngff =  gff file\nlens =  lens file\nradius = 0.2\nangle_gap = 0.05\nring_width = 0.015\ncolors  = 1:c,2:m,3:blue,4:gold,5:red,6:lawngreen,7:darkgreen,8:k,9:darkred,10:gray\nalignment = alignment file \nchr_label = chr\nancestor = ancestor alignment file \nancestor_location = ancestor file \nfigsize = 10,10\nlabel_size = 9\nposition = order\nlegend_square = 0.04, 0.04\ncolumn_names = 1,2,3,4,5\nsavefig = result(.png, .pdf, .svg)\n"
  },
  {
    "path": "build/lib/wgdi/example/collinearity.conf",
    "content": "[collinearity]\ngff1 = gff1 file\ngff2 = gff2 file\nlens1 = lens1 file\nlens2 = lens2 file\nblast = blast file\nblast_reverse = false\ncomparison = genomes\nmultiple  = 1\nprocess = 8\nevalue = 1e-5\nscore = 100\ngrading = 50,30,25\nmg = 25,25\npvalue = 1\nrepeat_number = 20\npositon = order\nsavefile = collinearity file\n"
  },
  {
    "path": "build/lib/wgdi/example/conf.ini",
    "content": "[ini]\nmafft_path = /home/sunpc/micromamba/envs/wgdi/bin/mafft\npal2nal_path = /home/sunpc/micromamba/envs/wgdi/bin/pal2nal.pl\nyn00_path = /home/sunpc/micromamba/envs/wgdi/bin/yn00\nmuscle_path = /home/sunpc/micromamba/envs/wgdi/bin/muscle\niqtree_path =  /home/sunpc/micromamba/envs/wgdi/bin/iqtree\ntrimal_path = /home/sunpc/micromamba/envs/wgdi/bin/trimal\nfasttree_path = /home/sunpc/micromamba/envs/wgdi/bin/fasttree\ndivvier_path = /home/sunpc/micromamba/envs/wgdi/bin/divvier\n"
  },
  {
    "path": "build/lib/wgdi/example/corr.conf",
    "content": "[correspondence]\nblockinfo =  blockinfo file(.csv) \nlens1 = lens1 file\nlens2 = lens2 file\ntandem = true\ntandem_length = 200\npvalue = 0.2\nblock_length = 5\ntandem_ratio = 0.5\nmultiple  = 1\nhomo = -1,1\nsavefile = savefile(.csv)\n"
  },
  {
    "path": "build/lib/wgdi/example/dotplot.conf",
    "content": "[dotplot]\nblast = blast file\ngff1 =  gff1 file\ngff2 =  gff2 file\nlens1 = lens1 file\nlens2 = lens2 file\ngenome1_name =  Genome1 name\ngenome2_name =  Genome2 name\nmultiple  = 1\nscore = 100\nevalue = 1e-5\nrepeat_number = 10\nposition = order\nblast_reverse = false\nancestor_left = ancestor file or none\nancestor_top = ancestor file or none\nmarkersize = 0.5\nfigsize = 10,10\nsavefig = savefile(.png, .pdf, .svg)\n"
  },
  {
    "path": "build/lib/wgdi/example/fusion_positions_database.conf",
    "content": "[fusion_positions_database]\npep = pep file\ngff = gff file\nfusion_positions = fusion_positions file\n# Number of gene sets on each side of the breakpoint\nancestor_gff =  result file\nancestor_lens =  result file\nancestor_pep =  result file\nancestor_file =  result file\n"
  },
  {
    "path": "build/lib/wgdi/example/fusions_detection.conf",
    "content": "[fusions_detection]\nblockinfo = block information (*.csv)\nancestor = ancestor file\n#The number of genes spanned by a synteny block on both sides of a breakpoint.\nmin_genes_per_side = 5\ndensity = 0.3\nfiltered_blockinfo = result blockinfo (.csv)\n"
  },
  {
    "path": "build/lib/wgdi/example/karyotype.conf",
    "content": "[karyotype]\nancestor = ancestor chromosome file\nwidth = 0.5\nfigsize = 10,6.18\nsavefig = save image(.png, .pdf, .svg)"
  },
  {
    "path": "build/lib/wgdi/example/karyotype_mapping.conf",
    "content": "[karyotype_mapping]\nblast = blast file\nblast_reverse = false\ngff1 = gff1 file\ngff2 = gff2 file \nscore = 100\nevalue = 1e-5\nrepeat_number = 5\nancestor_left = ancestor location file (Only one of ('left', 'top') can be reserved)\nancestor_top = ancestor location file\nthe_other_lens = the other lens file\nblockinfo = block information (*.csv)\nblockinfo_reverse = false\nlimit_length = 5\nthe_other_ancestor_file =  result file "
  },
  {
    "path": "build/lib/wgdi/example/ks.conf",
    "content": "[ks]\ncds_file = \tcds file \n#cat all cds files together\npep_file = \tpep file\n#cat all pep files together\nalign_software = muscle\npairs_file = gene pairs file\nks_file = ks result"
  },
  {
    "path": "build/lib/wgdi/example/ks_fit_result.csv",
    "content": ",color,linewidth,linestyle,,,,,,\ncsa_csa,red,2,-,2.532090116,1.510453744,0.229652282,1.638111687,2.048906176,0.345639862\nvvi_vvi,blue,2,-,3.00367275,1.288717936,0.177816426,,,\nvvi_oin_gamma,orange,2,-,1.910418336,1.328469514,0.262257112,,,\nvvi_oin,orange,2,--,4.948194212,0.882608858,0.10426873,,,\nvvi_csa,green,2,--,2.470770292464022,1.4131842495219498,0.21391959288821544,,,\n"
  },
  {
    "path": "build/lib/wgdi/example/ksfigure.conf",
    "content": "[ksfigure]\nksfit = ksfit result(*.csv)\nlabelfontsize = 15\nlegendfontsize = 15\nxlabel = none            \nylabel = none            \ntitle = none\narea = 0,2\nfigsize = 10,6.18\nshadow = true (true/false)\nsavefig =  save image(.png, .pdf, .svg)\n"
  },
  {
    "path": "build/lib/wgdi/example/kspeaks.conf",
    "content": "[kspeaks]\nblockinfo = block information (*.csv)\npvalue = 0.2\ntandem = true\nblock_length = int number\nks_area = 0,10\nmultiple  = 1\nhomo = 0,1\nfontsize = 9\narea = 0,3\nfigsize = 10,6.18\nsavefig = saving image(.png,.pdf)\nsavefile = ks medain savefile\n"
  },
  {
    "path": "build/lib/wgdi/example/peaksfit.conf",
    "content": "[peaksfit]\nblockinfo = block information (*.csv)\nmode = median\nbins_number = 200\nks_area = 0,10\nfontsize = 9\narea = 0,3\nfigsize = 10,6.18\nshadow = true \nsavefig = saving image(.png,.pdf,.svg)"
  },
  {
    "path": "build/lib/wgdi/example/pindex.conf",
    "content": "[pindex]\nalignment = alignment file (.csv)\ngff = gff file\nlens =lens file\ngap = 50\nretention = 0.05\ndiff = 0.05\nremove_delta = (true/false)\nsavefile = result file(.csv)\n"
  },
  {
    "path": "build/lib/wgdi/example/polyploidy_classification.conf",
    "content": "[polyploidy classification]\nblockinfo = block information (*.csv)\nancestor_left = ancestor file\nancestor_top = ancestor file\nclassid = class1,class2\nsame_protochromosome =  False\nsame_subgenome =  False\nsavefile = result file(.csv)"
  },
  {
    "path": "build/lib/wgdi/example/retain.conf",
    "content": "[retain]\nalignment = alignment file\ngff = gff file\nlens = lens file\ncolors = red,blue,green\nrefgenome = shorthand\nfigsize = 10,12\nstep = 50\nylabel = y label\nsavefile = retain file (result)\nsavefig = result(.png, .pdf, .svg)\n"
  },
  {
    "path": "build/lib/wgdi/example/shared_fusion.conf",
    "content": "[shared_fusion]\nblockinfo = block information (*.csv)\n# The new lens file is the output filtered by lens file.\nlens1 = lens file, new lens file\nlens2 =  lens file,  new lens file\nancestor_left = ancestor file\nancestor_top = ancestor file\nclassid = class1,class2\nlimit_length = 5\nfiltered_blockinfo = result blockinfo (.csv)"
  },
  {
    "path": "build/lib/wgdi/fusion_positions_database.py",
    "content": "import pandas as pd\nimport os\nfrom Bio import SeqIO\n\nclass fusion_positions_database:\n    def __init__(self, options):\n        for k, v in options:\n            setattr(self, k, v)\n            print(f'{k} = {v}')\n\n    def run(self):\n        # Load and remove duplicates from data\n        gff = pd.read_csv(self.gff, sep=\"\\t\", header=None, dtype={0: str, 5: int}).drop_duplicates()\n        pep = SeqIO.to_dict(SeqIO.parse(self.pep, \"fasta\"))\n        df = pd.read_csv(self.fusion_positions, sep=\"\\t\", header=None, dtype={0: str, 1: int, 2:int, 3:str}).drop_duplicates()\n        \n        # Load ancestral sequence file if it exists\n        seqs = SeqIO.to_dict(SeqIO.parse(self.ancestor_pep, \"fasta\")) if os.path.exists(self.ancestor_pep) else {}\n\n        sf_gff, sf_lens = [], []\n\n        # Process fusion positions\n        for _, row in df.iterrows():\n            newchr = row[3]\n            newgff = gff[(gff[0] == row[0]) & \n                         (gff[5] >= row[1] - row[2]) & \n                         (gff[5] < row[1] + row[2])].copy()\n            newgff['id'] = [f\"{newchr}s{str(row[0]).zfill(2)}g{str(i).zfill(3)}\" for i in range(1, len(newgff) + 1)]\n\n            sf_position = row[1] - newgff.iloc[0, 5]\n            sf_lens.append([newchr, sf_position, len(newgff)])\n            \n            # For each gene in the filtered GFF region\n            for _, gff_row in newgff.iterrows():\n                if gff_row[1] in pep and gff_row['id'] not in seqs:\n                    gene = pep[gff_row[1]][:]\n                    gene.id, gene.description = gff_row['id'], ''\n                    seqs[gff_row['id']] = gene\n                    # Collect data for the final GFF output\n                    sf_gff.append([gff_row['id'], newchr, sf_position, gff_row[2], gff_row[3], gff_row[4], gff_row[1]])\n\n        # Write sequences to FASTA file\n        SeqIO.write(seqs.values(), self.ancestor_pep, 'fasta')\n\n        # Save filtered GFF data\n        if sf_gff:\n            sf_gff = pd.DataFrame(sf_gff)\n            sf_gff.rename(columns={3: 'start', 4: 'end', 5: 'strand'}, inplace=True)\n            sf_gff['order'] = sf_gff[0].str[-3:].astype(int)\n            sf_gff[[1, 0, 'start', 'end', 'strand', 'order', 6]].to_csv(self.ancestor_gff, sep=\"\\t\", mode='a', index=False, header=None)\n            sf_lens = pd.DataFrame(sf_lens).drop_duplicates()\n            sf_lens.to_csv(self.ancestor_lens, sep=\"\\t\", mode='a', index=False, header=None)\n\n            # Generate ancestral sequence data\n            ancestor = []\n            for _, row in sf_lens.iterrows():\n                ancestor.append([row[0], 1, row[1], 'red', 1])\n                ancestor.append([row[0], row[1] + 1, row[2], 'blue', 1])\n            pd.DataFrame(ancestor).to_csv(self.ancestor_file, sep=\"\\t\", mode='a', index=False, header=None)\n\n        # Remove duplicates from the output files\n        for file in [self.ancestor_gff, self.ancestor_lens, self.ancestor_file]:\n            df = pd.read_csv(file, header=None).drop_duplicates().to_csv(file, index=False, header=None)\n"
  },
  {
    "path": "build/lib/wgdi/fusions_detection.py",
    "content": "import pandas as pd\nfrom tabulate import tabulate\n\nclass fusions_detection:\n    def __init__(self, options):\n        self.min_genes_per_side = 5\n        self.density = 0.3\n        for k, v in options:\n            setattr(self, k, v)\n            print(f\"{k} = {v}\")\n        self.min_genes_per_side = int(self.min_genes_per_side)\n        self.density = float(self.density)\n\n    def run(self):\n        # Load the ancestor file and process the positions\n        ancestor = pd.read_csv(self.ancestor, sep='\\t', header=None)\n        position = ancestor.groupby(0)[2].unique().apply(pd.Series)\n        bkinfo = pd.read_csv(self.blockinfo)\n        newbkinfo = bkinfo.head(0)\n        \n        # Iterate over each row in the position dataframe\n        for index, row in position.iterrows():\n            # Filter the bkinfo dataframe based on chr2 and density\n            filtered_group = bkinfo[(bkinfo['chr2'] == index) & (bkinfo['density2'] >= self.density)].copy()\n            # Split the block2 column and stack the resulting series\n            df = filtered_group['block2'].str.split('_', expand=True).stack().astype(int)\n            # Count the number of genes greater and less than the current position\n            filtered_group['greater'] = (df > row[0]).groupby(level=0).sum()\n            filtered_group['less'] = (df < row[0]).groupby(level=0).sum()\n            # Filter the group based on the minimum number of genes per side\n            filtered_group = filtered_group[(filtered_group['greater'] >= self.min_genes_per_side) & (filtered_group['less'] >= self.min_genes_per_side)]\n            # Concatenate the filtered group with the newbkinfo dataframe\n            newbkinfo = pd.concat([newbkinfo, filtered_group])\n        if len(newbkinfo) ==0:\n            print(\"\\nNo shared fusion breakpoints detected\")\n            exit(0)\n\n        # Get and print the shared fusion positions\n        newbkinfo.to_csv(self.filtered_blockinfo, header=True, index=False)\n        non_overlap_counts = newbkinfo.groupby('chr2').apply(self.count_non_overlapping)\n        data = [(chr2, count) for chr2, count in non_overlap_counts.items()]\n        print(\"\\nThe following are the shared fusion breakpoints and counts:\")\n        print(tabulate(data, headers=[\"Fusion Breakpoint\", \"Count\"], tablefmt=\"github\"))\n\n    def count_non_overlapping(self, group):\n        if len(group) == 1:\n            return 1\n        grouped = group.groupby('chr1')\n        total_count = 0\n        for chr1, chr_group in grouped:\n            chr_group = chr_group.sort_values(by='start1').reset_index(drop=True)\n            count = 0\n            current_end = -1 \n            for _, row in chr_group.iterrows():\n                start1, end1 = row['start1'], row['end1']\n                if start1 > current_end:\n                    count += 1\n                    current_end = end1 \n            total_count += count\n        return total_count"
  },
  {
    "path": "build/lib/wgdi/karyotype.py",
    "content": "import matplotlib.pyplot as plt\nimport pandas as pd\n\nimport wgdi.base as base\n\n\nclass karyotype():\n    def __init__(self, options):\n        self.width = 0.5\n        for k, v in options:\n            setattr(self, str(k), v)\n            print(str(k), ' = ', v)\n        if hasattr(self, 'figsize'):\n            self.figsize = [float(k) for k in self.figsize.split(',')]\n        else:\n            self.figsize = 10, 6.18\n        if hasattr(self, 'width'):\n            self.width = float(self.width)\n        else:\n            self.width = 0.5\n\n    def run(self):\n        fig, ax = plt.subplots(figsize=self.figsize)\n        ancestor_lens = pd.read_csv(\n            self.ancestor, sep=\"\\t\", header=None)\n        ancestor_lens[0] = ancestor_lens[0].astype(str)\n        ancestor_lens[3] = ancestor_lens[3].astype(str)\n        ancestor_lens[4] = ancestor_lens[4].astype(int)\n        ancestor_lens[4] = ancestor_lens[4] / ancestor_lens[4].max()\n        chrs = ancestor_lens[0].drop_duplicates().to_list()\n        ax.bar(chrs, 10, color='white', alpha=0)\n        for index, row in ancestor_lens.iterrows():\n            base.Rectangle(ax, [chrs.index(row[0])-self.width*0.5,\n                                row[1]], row[2]-row[1], self.width, row[3], row[4])\n        ax.tick_params(labelsize=15)\n        ax.spines['top'].set_visible(False)\n        ax.spines['right'].set_visible(False)\n        ax.spines['left'].set_visible(False)\n        ax.spines['bottom'].set_visible(False)\n        ax.set_xticks([])\n        ax.set_yticks([])\n        plt.savefig(self.savefig, dpi=500)\n        plt.show()\n"
  },
  {
    "path": "build/lib/wgdi/karyotype_mapping.py",
    "content": "import numpy as np\nimport pandas as pd\n\nimport wgdi.base as base\n\n\nclass karyotype_mapping:\n    def __init__(self, options):\n        # Initialize default attributes\n        self.blast_reverse = False\n        self.blockinfo_reverse = False\n        self.position = 'order'\n        self.block_length = 5\n        self.limit_length = 5\n        self.repeat_number = 20\n        self.score = 100\n        self.evalue = 1e-5\n\n        # Update attributes with provided keyword arguments and print them\n        for k, v in options:\n            setattr(self, k, v)\n            print(f\"{k} = {v}\")\n        \n        self.blast_reverse = base.str_to_bool(self.blast_reverse)\n        self.blockinfo_reverse = base.str_to_bool(self.blockinfo_reverse)\n        self.limit_length = int(self.limit_length)\n\n    def karyotype_left(self, pairs, ancestor, gff1, gff2):\n        # Loop through each row in ancestor to set color and classification in gff1\n        for _, row in ancestor.iterrows():\n            loc_min, loc_max = sorted([row[1], row[2]])\n            index1 = gff1[(gff1['chr'] == row[0]) &\n                          (gff1['order'] >= loc_min) &\n                          (gff1['order'] <= loc_max)].index\n            gff1.loc[index1, ['color', 'classification']] = row[3], row[4]\n\n        # Merge pairs with gff1 and update gff2 with color and classification\n        data = pd.merge(pairs, gff1, left_on=0, right_index=True, how='left')\n        data.drop_duplicates(subset=[1], inplace=True)\n        data.set_index(1, inplace=True)\n        gff2.loc[data.index, ['color', 'classification']] = data[['color', 'classification']]\n        return gff2\n\n    def karyotype_top(self, pairs, ancestor, gff1, gff2):\n        # Loop through each row in ancestor to set color and classification in gff2\n        for _, row in ancestor.iterrows():\n            loc_min, loc_max = sorted([row[1], row[2]])\n            index1 = gff2[(gff2['chr'] == row[0]) &\n                          (gff2['order'] >= loc_min) &\n                          (gff2['order'] <= loc_max)].index\n            gff2.loc[index1, ['color', 'classification']] = row[3], row[4]\n\n        # Merge pairs with gff2 and update gff1 with color and classification\n        data = pd.merge(pairs, gff2, left_on=1, right_index=True, how='left')\n        data.drop_duplicates(subset=[0], inplace=True)\n        data.set_index(0, inplace=True)\n        gff1.loc[data.index, ['color', 'classification']] = data[['color', 'classification']]\n        return gff1\n\n    def karyotype_map(self, gff, lens):\n        # Filter gff based on lens index and non-null color\n        gff = gff[gff['chr'].isin(lens.index) & gff['color'].notnull()]\n        ancestor = []\n        # Group by chromosome and process each group to create ancestor records\n        for chr, group in gff.groupby('chr'):\n            color, class_id, arr = '', 1, []\n            for _, row in group.iterrows():\n                if color ==  row['color'] and class_id == row['classification']:\n                    arr.append(row['order'])\n                else:\n                    if len(arr) >= self.limit_length:\n                        ancestor.append([chr, min(arr), max(arr), color, class_id, len(arr)])\n                    color, class_id = row['color'], row['classification']\n                    arr = []\n                    if len(ancestor) >= 1 and color == ancestor[-1][3] and class_id == ancestor[-1][4] and chr == ancestor[-1][0]:\n                        arr.append(ancestor[-1][1])\n                        arr += np.random.randint(ancestor[-1][1], ancestor[-1][2], size=ancestor[-1][5]-1).tolist()\n                        ancestor.pop()\n                    arr.append(row['order'])\n            if len(arr) >= self.limit_length:\n                ancestor.append([chr, min(arr), max(arr), color, class_id, len(arr)])\n\n        ancestor = pd.DataFrame(ancestor)\n        # Adjust min and max positions for each chromosome group\n        for chr, group in ancestor.groupby(0):\n            ancestor.loc[group.index[0], 1] = 1\n            ancestor.loc[group.index[-1], 2] = lens[chr]\n        ancestor[4] = ancestor[4].astype(int)\n        return ancestor[[0, 1, 2, 3, 4, 5]]\n\n    def colinear_gene_pairs(self, bkinfo, gff1, gff2):\n        gff1 = gff1.reset_index()\n        gff2 = gff2.reset_index()\n        \n        gff1_indexed = gff1.set_index(['chr', 'order'])\n        gff2_indexed = gff2.set_index(['chr', 'order'])\n        \n        data = []\n        for _, row in bkinfo.iterrows():\n            b1 = list(map(int, row['block1'].split('_')))\n            b2 = list(map(int, row['block2'].split('_')))\n\n            for order1, order2 in zip(b1, b2):\n                a = gff1_indexed.loc[(row['chr1'], order1), 1]\n                b = gff2_indexed.loc[(row['chr2'], order2), 1]\n                data.append([a, b])\n        return pd.DataFrame(data)\n    \n    def new_ancestor(self, ancestor, gff1, gff2, blast):\n        # Iterate through ancestor rows to adjust positions based on neighboring rows\n        for i in range(1, len(ancestor)):\n            if ancestor.iloc[i, 0] == ancestor.iloc[i-1, 0]:\n                area = ancestor.iloc[i, 1] - ancestor.iloc[i-1, 2]\n                if area <= 5:\n                    ancestor.iloc[i-1, 2] = ancestor.iloc[i, 1] - 1\n                else:\n                    index1 = gff1[(gff1['chr'] == ancestor.iloc[i, 0]) &\n                                (gff1['order'] >= ancestor.iloc[i-1, 2]+1) &\n                                (gff1['order'] <= ancestor.iloc[i, 1]-1)].index\n                    index2 = gff2[gff2['color'] == ancestor.iloc[i-1, 3]].index\n                    index3 = gff2[gff2['color'] == ancestor.iloc[i, 3]].index\n\n                    newblast1 = blast[(blast[0].isin(index1)) & (blast[1].isin(index2))]\n                    newblast2 = blast[(blast[0].isin(index1)) & (blast[1].isin(index3))]\n\n                    if len(newblast1) >= len(newblast2):\n                        ancestor.iloc[i-1, 2] = ancestor.iloc[i, 1] - 1\n                    else:\n                        ancestor.iloc[i, 1] = ancestor.iloc[i-1, 2] + 1\n        for chr, group in ancestor.groupby(0):\n            if len(group) == 1:\n                continue\n            newgff1 = gff1[gff1['chr'] == chr]\n            for i in range(1, len(group)):\n                if group.iloc[i, 5] > 200:\n                    continue\n\n                index_left = newgff1[(newgff1['order'] >= group.iloc[i, 1]) &\n                                (newgff1['order'] <= group.iloc[i, 2])].index\n                blast_left = blast[blast[0].isin(index_left)]\n\n                index_prev = gff2[gff2['color'] == group.iloc[i-1, 3]].index\n                blast_prev = blast_left[blast_left[1].isin(index_prev)]\n\n                index_curr = gff2[gff2['color'] == group.iloc[i, 3]].index\n                blast_curr = blast_left[blast_left[1].isin(index_curr)]\n\n                if len(blast_curr) <= len(blast_prev):\n                    ancestor.loc[group.index[i],3] = ancestor.loc[group.index[i]-1,3]\n\n                if i < len(group)-1:\n                    index_next = gff2[gff2['color'] == group.iloc[i+1, 3]].index\n                    blast_next = blast_left[blast_left[1].isin(index_next)]\n                    if len(blast_next) > max(len(blast_prev),len(blast_curr)):\n                        ancestor.loc[group.index[i],3] = ancestor.loc[group.index[i]+1,3]\n        \n        ancestor['group'] = (ancestor[0].shift(1) != ancestor[0]) | (ancestor[3].shift(1) != ancestor[3]) | (ancestor[4].shift(1) != ancestor[4])\n        ancestor['group'] = ancestor['group'].cumsum()\n        result = ancestor.groupby('group').agg({\n            0: 'first',\n            1: 'min',\n            2: 'max',\n            3: 'first',\n            4: 'first',\n        }).reset_index(drop=True)\n\n        return result\n\n    def run(self):\n        # Read and process block information\n        bkinfo = pd.read_csv(self.blockinfo, index_col='id')\n        bkinfo['chr1'] = bkinfo['chr1'].astype(str)\n        bkinfo['chr2'] = bkinfo['chr2'].astype(str)\n        if self.blockinfo_reverse == True:\n            bkinfo[['chr1', 'chr2']] =  bkinfo[['chr2', 'chr1']]\n            bkinfo[['block1', 'block2']] =  bkinfo[['block2', 'block1']]\n        bkinfo = bkinfo[bkinfo['length'] > int(self.block_length)]\n\n        # Read GFF and lens data\n        gff1 = base.newgff(self.gff1)\n        gff2 = base.newgff(self.gff2)\n        lens = base.newlens(self.the_other_lens, self.position)\n        blast = base.newblast(self.blast, int(self.score), float(self.evalue), gff1, gff2, self.blast_reverse)\n        # blast.drop_duplicates(subset=[0], keep='first', inplace=True)\n\n        # Find colinear gene pairs\n        pairs = self.colinear_gene_pairs(bkinfo, gff1, gff2)\n\n        # Depending on available attributes, call either karyotype_top or karyotype_left\n        if hasattr(self, 'ancestor_top'):\n            ancestor = base.read_classification(self.ancestor_top)\n            data = self.karyotype_top(pairs, ancestor, gff1, gff2)\n        elif hasattr(self, 'ancestor_left'):\n            ancestor = base.read_classification(self.ancestor_left)\n            data = self.karyotype_left(pairs, ancestor, gff1, gff2)\n            gff1, gff2 = gff2, gff1\n            blast.iloc[:, :2] = blast.iloc[:, [1, 0]].to_numpy()\n        else:\n            print('Missing ancestor file.')\n            exit(0)\n\n        # Map the data and create the final ancestor file\n        the_other_ancestor_file = self.karyotype_map(data, lens)\n        the_other_ancestor_file = self.new_ancestor(the_other_ancestor_file, gff1, gff2, blast)\n        the_other_ancestor_file.to_csv(self.the_other_ancestor_file, sep='\\t', header=False, index=False)"
  },
  {
    "path": "build/lib/wgdi/ks.py",
    "content": "import os\nimport sys\nimport numpy as np\nimport pandas as pd\nfrom Bio import SeqIO\nimport subprocess\nfrom Bio.Phylo.PAML import yn00\nimport wgdi.base as base\n\n\nclass ks:\n    def __init__(self, options):\n        base_conf = base.config()\n        self.pair_pep_file = 'pair.pep'\n        self.pair_cds_file = 'pair.cds'\n        self.prot_align_file = 'prot.aln'\n        self.mrtrans = 'pair.mrtrans'\n        self.pair_yn = 'pair.yn'\n\n        for k, v in base_conf:\n            setattr(self, str(k), v)\n        for k, v in options:\n            setattr(self, str(k), v)\n            print(f'{str(k)} = {v}')\n\n    def auto_file(self):\n        pairs = []\n        with open(self.pairs_file) as f:\n            p = ' '.join(f.readlines()[:30])\n\n        # Detect file format and process accordingly\n        if 'path length' in p or 'MAXIMUM GAP' in p:\n            collinearity = base.read_colinearscan(self.pairs_file)\n            pairs = [[v[0], v[2]] for k in collinearity for v in k[1]]\n        elif 'MATCH_SIZE' in p or '## Alignment' in p:\n            collinearity = base.read_mcscanx(self.pairs_file)\n            pairs = [[v[0], v[2]] for k in collinearity for v in k[1]]\n        elif '# Alignment' in p:\n            collinearity = base.read_collinearity(self.pairs_file)\n            pairs = [[v[0], v[2]] for k in collinearity for v in k[1]]\n        elif '###' in p:\n            collinearity = base.read_jcvi(self.pairs_file)\n            pairs = [[v[0], v[2]] for k in collinearity for v in k[1]]\n        elif ',' in p:\n            collinearity = pd.read_csv(self.pairs_file, header=None)\n            pairs = collinearity.values.tolist()\n        else:\n            collinearity = pd.read_csv(self.pairs_file, header=None, sep='\\t')\n            pairs = collinearity.values.tolist()\n\n        df = pd.DataFrame(pairs).drop_duplicates()\n        df[0] = df[0].astype(str)\n        df[1] = df[1].astype(str)\n        df.index = df[0] + ',' + df[1]\n        return df\n\n    def run(self):\n        # Load sequence data\n        cds = SeqIO.to_dict(SeqIO.parse(self.cds_file, \"fasta\"))\n        pep = SeqIO.to_dict(SeqIO.parse(self.pep_file, \"fasta\"))\n        df_pairs = self.auto_file()\n\n        # Check if ks file exists and load it, otherwise create a new one\n        if os.path.exists(self.ks_file):\n            ks = pd.read_csv(self.ks_file, sep='\\t').drop_duplicates()\n            kscopy = ks.copy()\n            names = ks.columns.tolist()\n            names[0], names[1] = names[1], names[0]\n            kscopy.columns = names\n            ks = pd.concat([ks, kscopy])\n            ks['id'] = ks['id1'] + ',' + ks['id2']\n            df_pairs.drop(np.intersect1d(df_pairs.index, ks['id'].to_numpy()), inplace=True)\n            ks_file = open(self.ks_file, 'a+')\n        else:\n            ks_file = open(self.ks_file, 'w')\n            ks_file.write('\\t'.join(['id1', 'id2', 'ka_NG86', 'ks_NG86', 'ka_YN00', 'ks_YN00']) + '\\n')\n\n        # Filter valid pairs based on sequence data\n        df_pairs = df_pairs[\n            (df_pairs[0].isin(cds.keys())) & (df_pairs[1].isin(cds.keys())) &\n            (df_pairs[0].isin(pep.keys())) & (df_pairs[1].isin(pep.keys()))\n        ]\n\n        pairs = df_pairs[[0, 1]].to_numpy()\n\n        if len(pairs) > 0 and pairs[0][0][:3] == pairs[0][1][:3]:\n            allpairs = []\n            pair_hash = {}\n            for k in pairs:\n                if k[0] + ',' + k[1] in pair_hash or k[1] + ',' + k[0] in pair_hash:\n                    continue\n                else:\n                    pair_hash[k[0] + ',' + k[1]] = 1\n                    pair_hash[k[1] + ',' + k[0]] = 1\n                    allpairs.append(k)\n            pairs = allpairs\n\n        for k in pairs:\n            cds_gene1, cds_gene2 = cds[k[0]], cds[k[1]]\n            cds_gene1.id, cds_gene2.id = 'gene1', 'gene2'\n            pep_gene1, pep_gene2 = pep[k[0]], pep[k[1]]\n            pep_gene1.id, pep_gene2.id = 'gene1', 'gene2'\n\n            # Write sequences to files\n            SeqIO.write([cds[k[0]], cds[k[1]]], self.pair_cds_file, \"fasta\")\n            SeqIO.write([pep[k[0]], pep[k[1]]], self.pair_pep_file, \"fasta\")\n\n            # Compute Ka/Ks values\n            kaks = self.pair_kaks(['gene1', 'gene2'])\n            if kaks is None:\n                continue\n\n            ks_file.write('\\t'.join([str(i) for i in list(k) + list(kaks)]) + '\\n')\n\n        ks_file.close()\n\n        # Clean up temporary files\n        for file in [\n            self.pair_pep_file, self.pair_cds_file, self.mrtrans, self.pair_yn,\n            self.prot_align_file, '2YN.dN', '2YN.dS', '2YN.t', 'rst', 'rst1', 'yn00.ctl', 'rub'\n        ]:\n            try:\n                os.remove(file)\n            except OSError:\n                pass\n\n    def pair_kaks(self, k):\n        self.align()\n        pal = self.pal2nal()\n        if not pal:\n            return []\n\n        kaks = self.run_yn00()\n        if kaks is None:\n            return []\n\n        kaks_new = [\n            kaks[k[0]][k[1]]['NG86']['dN'], kaks[k[0]][k[1]]['NG86']['dS'],\n            kaks[k[0]][k[1]]['YN00']['dN'], kaks[k[0]][k[1]]['YN00']['dS']\n        ]\n        return kaks_new\n\n    def align(self):\n        if self.align_software == 'mafft':\n            try:\n                command = [self.mafft_path, '--quiet', self.pair_pep_file, '>', self.prot_align_file]\n                subprocess.run(\" \".join(command), shell=True, check=True)\n            except subprocess.CalledProcessError as e:\n                print(f\"Error while running MAFFT: {e}\")\n\n        elif self.align_software == 'muscle':\n            try:\n                command = [self.muscle_path, '-align', self.pair_pep_file, '-output', self.prot_align_file, '-quiet']\n                subprocess.run(\" \".join(command), shell=True, check=True)\n            except subprocess.CalledProcessError as e:\n                print(f\"Error while running Muscle: {e}\")\n\n    def pal2nal(self):\n        args = ['perl', self.pal2nal_path, self.prot_align_file, self.pair_cds_file, '-output paml -nogap', '>' + self.mrtrans]\n        command = ' '.join(args)\n        try:\n            os.system(command)\n        except:\n            return False\n        return True\n\n    def run_yn00(self):\n        yn = yn00.Yn00()\n        yn.alignment = self.mrtrans\n        yn.out_file = self.pair_yn\n        yn.set_options(icode=0, commonf3x4=0, weighting=0, verbose=1)\n\n        try:\n            run_result = yn.run(command=self.yn00_path)\n        except:\n            run_result = None\n        return run_result\n"
  },
  {
    "path": "build/lib/wgdi/ks_peaks.py",
    "content": "import matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\nfrom scipy.stats.kde import gaussian_kde\n\nimport wgdi.base as base\n\nclass kspeaks:\n    def __init__(self, options):\n        # Default values\n        self.tandem_length = 200\n        self.figsize = 10, 6.18\n        self.fontsize = 9\n        self.block_length = 3\n        self.area = 0, 3\n        self.tandem =  True\n\n        # Set options passed in\n        for k, v in options:\n            setattr(self, str(k), v)\n            print(f'{str(k)} = {v}')\n\n        # Convert string values to lists of floats\n        self.homo = [float(k) for k in self.homo.split(',')]\n        self.ks_area = [float(k) for k in self.ks_area.split(',')]\n        self.figsize = [float(k) for k in self.figsize.split(',')]\n        self.area = [float(k) for k in self.area.split(',')]\n        self.pvalue = float(self.pvalue)\n        self.block_length = int(self.block_length)\n        self.tandem = base.str_to_bool(self.tandem)\n\n    def remove_tandem(self, bkinfo):\n        \"\"\"\n        Remove tandem duplications based on start and end position differences.\n        \"\"\"\n        group = bkinfo[bkinfo['chr1'] == bkinfo['chr2']].copy()\n        group.loc[:, 'start'] = group.loc[:, 'start1'] - group.loc[:, 'start2']\n        group.loc[:, 'end'] = group.loc[:, 'end1'] - group.loc[:, 'end2']\n        \n        # Drop rows where start or end difference is within tandem length\n        index = group[(group['start'].abs() <= self.tandem_length) | \n                      (group['end'].abs() <= self.tandem_length)].index\n        bkinfo = bkinfo.drop(index)\n        return bkinfo\n\n    def ks_kde(self, df):\n        \"\"\"\n        Perform kernel density estimation (KDE) on Ks data.\n        \"\"\"\n        # Clean up 'ks' column by removing leading underscores\n        df.loc[df['ks'].str.startswith('_'), 'ks'] = df.loc[df['ks'].str.startswith('_'), 'ks'].str[1:]\n        \n        ks = df['ks'].str.split('_')\n        arr = []\n        ks_ave = []\n        \n        # Collect individual Ks values and calculate average Ks per row\n        for v in ks.values:\n            v = [float(k) for k in v if float(k) >= 0]\n            if len(v) == 0:\n                continue\n            arr.extend(v)\n            ks_ave.append(sum(v) / len(v))  # Mean of each row's Ks values\n        \n        # KDE for three distributions: median, average, total\n        kdemedian = gaussian_kde(df['ks_median'].values)\n        kdemedian.set_bandwidth(bw_method=kdemedian.factor / 3.)\n        \n        kdeaverage = gaussian_kde(ks_ave)\n        kdeaverage.set_bandwidth(bw_method=kdeaverage.factor / 3.)\n        \n        kdetotal = gaussian_kde(arr)\n        kdetotal.set_bandwidth(bw_method=kdetotal.factor / 3.)\n\n        return [kdemedian, kdeaverage, kdetotal]\n\n    def run(self):\n        \"\"\"\n        Main method to process the data, perform KDE, and generate the plot.\n        \"\"\"\n        plt.rcParams['ytick.major.pad'] = 0\n        fig, ax = plt.subplots(figsize=self.figsize)\n\n        # Read the block info file\n        bkinfo = pd.read_csv(self.blockinfo)\n        bkinfo['chr1'] = bkinfo['chr1'].astype(str)\n        bkinfo['chr2'] = bkinfo['chr2'].astype(str)\n        bkinfo['length'] = bkinfo['length'].astype(int)\n\n        # Filter based on block length and p-value\n        bkinfo = bkinfo[(bkinfo['length'] > self.block_length) &\n                        (bkinfo['pvalue'] < self.pvalue)]\n\n        # Remove tandem duplications if needed\n        if self.tandem == False:\n            bkinfo = self.remove_tandem(bkinfo)\n\n        # Further filtering based on homozygous range and Ks area\n        bkinfo = bkinfo[bkinfo[f'homo{self.multiple}'] >= self.homo[0]]\n        bkinfo = bkinfo[bkinfo[f'homo{self.multiple}'] <= self.homo[1]]\n        bkinfo = bkinfo[bkinfo['ks_median'] >= self.ks_area[0]]\n        bkinfo = bkinfo[bkinfo['ks_median'] <= self.ks_area[1]]\n\n        # Perform KDE on the Ks data\n        kdemedian, kdeaverage, kdetotal = self.ks_kde(bkinfo)\n\n        # Define the range for the x-axis (Ks values)\n        dist_space = np.linspace(self.area[0], self.area[1], 500)\n\n        # Plot the KDE results\n        ax.plot(dist_space, kdemedian(dist_space), color='red', label='block median')\n        ax.plot(dist_space, kdeaverage(dist_space), color='black', label='block average')\n        ax.plot(dist_space, kdetotal(dist_space), color='blue', label='all pairs')\n\n        # Set plot labels, grid, and limits\n        ax.grid()\n        ax.set_xlabel(r'${K_{s}}$', fontsize=20)\n        ax.set_ylabel('Frequency', fontsize=20)\n        ax.tick_params(labelsize=18)\n        ax.set_xlim(self.area)\n        ax.legend(fontsize=20)\n\n        # Adjust layout for better display\n        plt.subplots_adjust(left=0.09, right=0.96, top=0.93, bottom=0.12)\n\n        # Save the figure\n        plt.savefig(self.savefig, dpi=500)\n        plt.show()\n\n        # Save the filtered data to CSV\n        bkinfo.to_csv(self.savefile, index=False)"
  },
  {
    "path": "build/lib/wgdi/ksfigure.py",
    "content": "import re\nimport sys\n\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\nimport wgdi.base as base\nfrom scipy import stats\n\n\nclass ksfigure():\n    def __init__(self, options):\n        self.figsize = 10, 6.18\n        self.legendfontsize = 30\n        self.labelfontsize = 9\n        self.area = 0, 3\n        self.shadow = True\n        self.mode = 'median'\n        for k, v in options:\n            setattr(self, str(k), v)\n            print(str(k), ' = ', v)\n        if self.xlabel == 'none' or self.xlabel == '':\n            self.xlabel = r'Synonymous nucleotide subsititution (${K_{s}}$)'\n        if self.ylabel == 'none' or self.ylabel == '':\n            self.ylabel = 'kernel density of syntenic blocks'\n        if self.title == 'none' or self.title == '':\n            self.title = ''\n        self.figsize = [float(k) for k in self.figsize.split(',')]\n        self.area = [float(k) for k in self.area.split(',')]\n        self.shadow = base.str_to_bool(self.shadow)\n\n    def Gaussian_distribution(self, t, k):\n        y = np.zeros(len(t))\n        for i in range(0, int((len(k) - 1) / 3)+1):\n            if np.isnan(k[3 * i + 2]):\n                continue\n            k[3 * i + 2] = float(k[3 * i + 2])/np.sqrt(2)\n            k[3 * i + 0] = float(k[3 * i + 0]) * \\\n                np.sqrt(2*np.pi)*float(k[3 * i + 2])\n            y1 = stats.norm.pdf(\n                t, float(k[3 * i + 1]), float(k[3 * i + 2])) * float(k[3 * i + 0])\n            y = y+y1\n        return y\n\n    def run(self):\n        plt.rcParams['ytick.major.pad'] = 0\n        fig, ax = plt.subplots(figsize=self.figsize)\n        ksfit = pd.read_csv(self.ksfit, index_col=0)\n        t = np.arange(self.area[0], self.area[1], 0.0005)\n        col = [k for k in ksfit.columns if re.match('Unnamed:', k)]\n        for index, row in ksfit.iterrows():\n            ax.plot(t, self.Gaussian_distribution(\n                t, row[col].values), linestyle=row['linestyle'], color=row['color'],alpha=0.8, label=index, linewidth=row['linewidth'])\n            if self.shadow == True:\n                ax.fill_between(t, 0, self.Gaussian_distribution(t, row[col].values),  color=row['color'], alpha=0.15, interpolate=True, edgecolor=None, label=index,)\n        align = dict(family='Arial', verticalalignment=\"center\",\n                     horizontalalignment=\"center\")\n        ax.set_xlabel(self.xlabel, fontsize=self.labelfontsize,\n                      labelpad=20, **align)\n        ax.set_ylabel(self.ylabel, fontsize=self.labelfontsize,\n                      labelpad=20, **align)\n        ax.set_title(self.title, weight='bold',\n                     fontsize=self.labelfontsize, **align)\n        plt.tick_params(labelsize=10)\n        handles,labels = ax.get_legend_handles_labels()\n        df = pd.DataFrame({  'handles': handles, 'labels': labels})\n        df.drop_duplicates(subset='labels', keep='first', inplace=True)\n        handles, labels = df['handles'].tolist(), df['labels'].tolist()\n        if self.shadow == True:\n            plt.legend(handles=handles,labels=labels,loc='upper right', prop={\n                   'family': 'Arial', 'style': 'italic', 'size': self.legendfontsize})\n        else:\n            plt.legend(handles=handles,labels=labels,loc='upper right', prop={\n                   'family': 'Arial', 'style': 'italic', 'size': self.legendfontsize})\n        plt.gca().spines['top'].set_visible(False)\n        plt.gca().spines['right'].set_visible(False)\n        plt.savefig(self.savefig, dpi=500)\n        plt.show()\n        sys.exit(0)\n"
  },
  {
    "path": "build/lib/wgdi/peaksfit.py",
    "content": "import re\nimport sys\n\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\nfrom scipy.optimize import curve_fit\nfrom scipy.stats import gaussian_kde, linregress\n\nimport wgdi.base as base\n\n\nclass peaksfit():\n    def __init__(self, options):\n        self.figsize = 10, 6.18\n        self.fontsize = 9\n        self.area = 0, 3\n        self.mode = 'median'\n        self.histogram_only = False\n        for k, v in options:\n            setattr(self, str(k), v)\n            print(str(k), ' = ', v)\n        self.figsize = [float(k) for k in self.figsize.split(',')]\n        self.area = [float(k) for k in self.area.split(',')]\n        self.bins_number = int(self.bins_number)\n        self.peaks = 1\n        self.histogram_only = base.str_to_bool(self.histogram_only)\n\n    def ks_values(self, df):\n        df.loc[df['ks'].str.startswith('_'),'ks']= df.loc[df['ks'].str.startswith('_'),'ks'].str[1:]\n        ks = df['ks'].str.split('_')\n        ks_total = []\n        ks_average = []\n        for v in ks.values:\n            ks_total.extend([float(k) for k in v])\n        ks_average = df['ks_average'].values\n        ks_median = df['ks_median'].values\n        return [ks_median, ks_average, ks_total]\n\n    def gaussian_fuc(self, x, *params):\n        y = np.zeros_like(x)\n        for i in range(0, len(params), 3):\n            amp = float(params[i])\n            ctr = float(params[i+1])\n            wid = float(params[i+2])\n            y = y + amp * np.exp(-((x - ctr)/wid)**2)\n        return y\n\n    def kde_fit(self, data, x):\n        kde = gaussian_kde(data)\n        kde.set_bandwidth(bw_method=kde.factor/3.)\n        p = kde(x)\n        guess = [1,1, 1]*self.peaks\n        popt, pcov = curve_fit(self.gaussian_fuc, x, p, guess, maxfev = 80000)\n        popt = [abs(k) for k in popt]\n        data = []\n        y = self.gaussian_fuc(x, *popt)\n        for i in range(0, len(popt), 3):\n            array = [popt[i], popt[i+1], popt[i+2]]\n            data.append(self.gaussian_fuc(x, *array))\n        slope, intercept, r_value, p_value, std_err = linregress(p, y)\n        print(\"\\nR-square: \"+str(r_value**2))\n        print(\"The gaussian fitting curve parameters are :\")\n        print('  |  '.join([str(k) for k in popt]))\n        return y, data\n\n    def run(self):\n        plt.rcParams['ytick.major.pad'] = 0\n        fig, ax = plt.subplots(figsize=self.figsize)\n        bkinfo = pd.read_csv(self.blockinfo)\n        ks_median, ks_average, ks_total = self.ks_values(bkinfo)\n        data = eval('ks_'+self.mode)\n        data = [k for k in data if self.area[0] <= k <= self.area[1]]\n        x = np.linspace(self.area[0], self.area[1], self.bins_number)\n        n, bins, patches = ax.hist(data, int(\n            self.bins_number), density=1, facecolor='blue', alpha=0.3, label='Histogram')\n        if self.histogram_only == True:\n            pass\n        else:\n            y, fit = self.kde_fit(data, x)\n            ax.plot(x, y, color='black', linestyle='-', label='Gaussian fitting')\n        ax.grid()\n        align = dict(family='Arial', verticalalignment=\"center\",\n                     horizontalalignment=\"center\")\n        ax.set_xlabel(r'${K_{s}}$', fontsize=20)\n        ax.set_ylabel('Frequency', fontsize=20)\n        ax.tick_params(labelsize=18)\n        ax.legend(fontsize=20)\n        ax.set_xlim(self.area)\n        plt.subplots_adjust(left=0.09, right=0.96, top=0.93, bottom=0.12)\n        plt.savefig(self.savefig, dpi=500)\n        plt.show()\n        sys.exit(0)\n"
  },
  {
    "path": "build/lib/wgdi/pindex.py",
    "content": "import os\nimport sys\n\nimport numpy as np\nimport pandas as pd\nimport wgdi.base as base\n\n\nclass pindex():\n    def __init__(self, options):\n        self.remove_delta = True\n        self.position = 'order'\n        self.retention = 0.05\n        self.diff = 0.05\n        self.gap = 50\n        for k, v in options:\n            setattr(self, str(k), v)\n            print(k, ' = ', v)\n        self.gap = int(self.gap)\n        self.retention = float(self.retention)\n        self.diff = float(self.diff)\n\n    def Pindex(self, sub1, sub2):\n        r1 = self.retain(sub1)\n        r2 = self.retain(sub2)\n        r = []\n        for i in range(len(r2)):\n            if(r1[i] < self.retention or r2[i] < self.retention):\n                r.append(0)\n                continue\n            d = (r1[i]-r2[i])/(r1[i]+r2[i])*0.5\n            if d > self.diff:\n                r.append(1)\n            elif -d > self.diff:\n                r.append(-1)\n            else:\n                r.append(0)\n        a, b, c = len([i for i in r if i == 1]), len(\n            [i for i in r if i == -1]), len([i for i in r if i == 0])\n        return [a, -b, c, len(r)]\n\n    def retain(self, arr):\n        a = []\n        for i in range(0, len(arr), 2*self.gap):\n            start, end = i-self.gap, i+self.gap\n            genenum, retainnum = 0, 0\n            for j in range(start, end):\n                if((j >= int(len(arr))) or (j < 0)):\n                    continue\n                else:\n                    retainnum += arr[j]\n                    genenum += 1\n            a.append(float(retainnum/genenum))\n        return a\n\n    def run(self):\n        alignment = pd.read_csv(self.alignment, header=None, index_col=0)\n        alignment.replace(r'\\w+', 1, regex=True, inplace=True)\n        alignment.replace('.', 0, inplace=True)\n        alignment.fillna(0, inplace=True)\n        gff = base.newgff(self.gff)\n        lens = base.newlens(self.lens, self.position)\n        gff = gff[gff['chr'].isin(lens.index)]\n        alignment = alignment.join(gff[['chr', self.position]], how='left')\n        alignment.dropna(axis=0, how='any', inplace=True)\n        p = self.cal_pindex(alignment)\n        print('Polyploidy-index: ', p)\n        sys.exit(0)\n\n    def cal_pindex(self, alignment):\n        data, df = [], []\n        columns = alignment.columns[:-2].tolist()\n        for i in range(len(columns)-1):\n            for j in range(i+1, len(columns)):\n                b = []\n                for chr, group in alignment.groupby('chr'):\n                    sub1 = group.loc[:, columns[i]].tolist()\n                    sub2 = group.loc[:, columns[j]].tolist()\n                    p = self.Pindex(sub1, sub2)\n                    b.append(p)\n                    df.append([i, j, chr]+p)\n                sub_diver = sum([abs(k[0]+k[1]) for k in b])\n                if self.remove_delta == True:\n                    sub_total = sum([abs(k[1])+abs(k[0]) for k in b])\n                    if sub_total == 0:\n                        c = 0\n                    else:\n                        c = sub_diver/sub_total\n                else:\n                    sub_total = sum([abs(k[1])+abs(k[0])+abs(k[2]) for k in b])\n                    c = sub_diver/sub_total\n                data.append(c)\n        df = pd.DataFrame(df, columns=[\n                          'sub1', 'sub2', 'chr', 'sub1_high', 'sub2_high', 'No_diff', 'Total'])\n        df['sub2_high'] = df['sub2_high'].abs()\n        self.infomation(df)\n        print('\\nPolyploidy-index between subgenomes are ', data)\n        return sum(data)/len(data)\n\n    def turn_percentage(self, x):\n        return '(%.2f%%)' % (x * 100)\n\n    def infomation(self, df):\n        data = []\n        for names, group in df.groupby(['sub1', 'sub2']):\n            newgroup = pd.concat([group.head(1), group],\n                                 axis=0, ignore_index=True)\n            cols = ['sub1_high', 'sub2_high', 'No_diff', 'Total']\n            newgroup.loc[0, cols] = group.loc[:, cols].sum()\n            group1 = newgroup.copy()\n            group1[cols] = group1[cols].astype(str)\n            newgroup['sub1_high'] = (\n                newgroup['sub1_high'] / newgroup['Total']).apply(self.turn_percentage)\n            newgroup['sub2_high'] = (\n                newgroup['sub2_high'] / newgroup['Total']).apply(self.turn_percentage)\n            newgroup['No_diff'] = (\n                newgroup['No_diff'] / newgroup['Total']).apply(self.turn_percentage)\n            newgroup['Total'] = (\n                newgroup['Total'] / group['Total'].sum()).apply(self.turn_percentage)\n            newgroup[cols] = group1[cols]+newgroup[cols]\n            group_list = []\n            a = newgroup[['chr']+cols].columns.to_numpy()\n            a[0] = 'Chromosome'\n            a[1], a[2] = 'Sub_'+str(names[0]+1), 'Sub_'+str(names[1]+1)\n            group_list.append(a)\n            b = newgroup[['chr']+cols].to_numpy()\n            b[0][0] = 'Total'\n            for k in b:\n                group_list.append(k)\n            group_list = np.array(group_list).T\n            for k in group_list:\n                data.append(k)\n        data = pd.DataFrame(data)\n        data.to_csv(self.savefile, header=None, index=None)\n"
  },
  {
    "path": "build/lib/wgdi/polyploidy_classification.py",
    "content": "import pandas as pd\nimport wgdi.base as base\n\n\nclass polyploidy_classification:\n    def __init__(self, options):\n        self.same_protochromosome = False\n        self.same_subgenome = False\n        for k, v in options:\n            setattr(self, str(k), v)\n            print(f\"{k} = {v}\")\n        \n        self.same_protochromosome = base.str_to_bool(self.same_protochromosome)\n        self.same_subgenome = base.str_to_bool(self.same_subgenome)\n        \n        # Initialize classid with a default value if not provided\n        self.classid = [str(k) for k in getattr(self, 'classid', 'class1,class2').split(',')]\n\n    def run(self):\n        # Read input files\n        ancestor_left = base.read_classification(self.ancestor_left)\n        ancestor_top = base.read_classification(self.ancestor_top)\n        bkinfo = pd.read_csv(self.blockinfo)\n\n        # Ensure chr1 and chr2 are treated as strings\n        bkinfo['chr1'] = bkinfo['chr1'].astype(str)\n        bkinfo['chr2'] = bkinfo['chr2'].astype(str)\n\n        # Filter rows where chr1 and chr2 match ancestor values\n        bkinfo = bkinfo[bkinfo['chr1'].isin(ancestor_left[0].values) & bkinfo['chr2'].isin(ancestor_top[0].values)]\n\n        # Initialize additional columns\n        bkinfo[self.classid[0]] = 0\n        bkinfo[self.classid[1]] = 0\n        bkinfo[self.classid[0] + '_color'] = ''\n        bkinfo[self.classid[1] + '_color'] = ''\n        bkinfo['diff'] = 0.0\n\n        # Processing the first classification (ancestor_left vs chr1)\n        for name, group in bkinfo.groupby('chr1'):\n            d1 = ancestor_left[ancestor_left[0] == name]\n            for index1, row1 in group.iterrows():\n                a, b = sorted([row1['start1'], row1['end1']])\n                a, b = int(a), int(b)\n                for index2, row2 in d1.iterrows():\n                    c, d = sorted([row2[1], row2[2]])\n                    h = len([k for k in range(a, b) if k in range(c, d)]) / (b - a)\n                    if h > bkinfo.loc[index1, 'diff']:\n                        bkinfo.loc[index1, 'diff'] = float(h)\n                        bkinfo.loc[index1, self.classid[0]] = row2[4]\n                        bkinfo.loc[index1, self.classid[0] + '_color'] = row2[3]\n\n        # Reset 'diff' and process the second classification (ancestor_top vs chr2)\n        bkinfo['diff'] = 0.0\n        for name, group in bkinfo.groupby('chr2'):\n            d2 = ancestor_top[ancestor_top[0] == name]\n            for index1, row1 in group.iterrows():\n                a, b = sorted([row1['start2'], row1['end2']])\n                a, b = int(a), int(b)\n                for index2, row2 in d2.iterrows():\n                    c, d = sorted([row2[1], row2[2]])\n                    h = len([k for k in range(a, b) if k in range(c, d)]) / (b - a)\n                    if h > bkinfo.loc[index1, 'diff']:\n                        bkinfo.loc[index1, 'diff'] = float(h)\n                        bkinfo.loc[index1, self.classid[1]] = row2[4]\n                        bkinfo.loc[index1, self.classid[1] + '_color'] = row2[3]\n\n        # Uncomment if you want to filter rows where both colors match\n        if self.same_protochromosome == True:\n            bkinfo = bkinfo[bkinfo[self.classid[1] + '_color'] == bkinfo[self.classid[0] + '_color']]\n        if self.same_subgenome == True:\n            bkinfo = bkinfo[bkinfo[self.classid[1]] == bkinfo[self.classid[0]]]  \n\n        # Save the result to a CSV file\n        bkinfo.to_csv(self.savefile, index=False)\n"
  },
  {
    "path": "build/lib/wgdi/retain.py",
    "content": "import matplotlib.pyplot as plt\nimport pandas as pd\nimport wgdi.base as base\n\nclass retain:\n    def __init__(self, options):\n        self.position = 'order'\n        \n        # Initialize the options by setting attributes dynamically\n        for k, v in options:\n            setattr(self, str(k), v)\n            print(f\"{str(k)} = {v}\")\n\n        # Handle the ylim parameter, which defines the y-axis limits\n        self.ylim = [float(k) for k in self.ylim.split(',')] if hasattr(self, 'ylim') else [0, 1]\n        \n        # Handle the colors and figsize parameters\n        self.colors = [str(k) for k in self.colors.split(',')]\n        self.figsize = [float(k) for k in self.figsize.split(',')]\n\n    def run(self):\n        # Load GFF and lens data\n        gff = base.newgff(self.gff)\n        lens = base.newlens(self.lens, self.position)\n        \n        # Filter GFF data based on lens chromosome index\n        gff = gff[gff['chr'].isin(lens.index)]\n        \n        # Load alignment data and join with GFF\n        alignment = pd.read_csv(self.alignment, header=None, index_col=0)\n        alignment = alignment.join(gff[['chr', self.position]], how='left')\n        \n        # Perform alignment processing\n        self.retain = self.align_chr(alignment)\n        \n        # Save the processed data to a file\n        self.retain[self.retain.columns[:-2]].to_csv(self.savefile, sep='\\t', header=None)\n        \n        # Create a figure for plotting\n        fig, axs = plt.subplots(len(lens), 1, sharex=True, sharey=True, figsize=tuple(self.figsize))\n        fig.add_subplot(111, frameon=False)\n        \n        align = dict(family='DejaVu Sans', verticalalignment=\"center\", horizontalalignment=\"center\")\n\n        \n        # Hide all the spines and ticks on the plot\n        for spine in plt.gca().spines.values():\n            spine.set_visible(False)\n        plt.tick_params(top=False, bottom=False, left=False, right=False, labelleft=False, labelbottom=False)\n        \n        # Group the retain data by chromosome and plot each chromosome's data\n        groups = self.retain.groupby('chr')\n        for i, chr_name in enumerate(lens.index):\n            group = groups.get_group(chr_name)\n\n            if len(lens) == 1:\n                for j, col in enumerate(self.retain.columns[:-2]):\n                    axs.plot(group['order'].values, group[col].values,\n                                linestyle='-', color=self.colors[j], linewidth=1)\n                axs.spines['right'].set_visible(False)\n                axs.spines['top'].set_visible(False)\n                axs.set_ylim(self.ylim)\n                axs.tick_params(labelsize=12)                \n            else:\n                # Plot each column's data for the current chromosome\n                for j, col in enumerate(self.retain.columns[:-2]):\n                    axs[i].plot(group['order'].values, group[col].values,\n                                linestyle='-', color=self.colors[j], linewidth=1)\n            \n                # Hide the right and top spines for each subplot\n                axs[i].spines['right'].set_visible(False)\n                axs[i].spines['top'].set_visible(False)\n                axs[i].set_ylim(self.ylim)\n                axs[i].tick_params(labelsize=12)\n\n        for i, chr_name in enumerate(lens.index):\n            if len(lens) == 1:\n                x, y = axs.get_xlim()[1] * 0.90, axs.get_ylim()[1] * 0.8\n                axs.text(x, y, f\"{self.refgenome} {chr_name}\", fontsize=14, **align)\n            else:\n                # Add a label for the reference genome and chromosome\n                x, y = axs[i].get_xlim()[1] * 0.90, axs[i].get_ylim()[1] * 0.8\n                axs[i].text(x, y, f\"{self.refgenome} {chr_name}\", fontsize=14, **align)\n        \n        # Adjust layout and save the figure as an image\n        plt.ylabel(f\"{self.ylabel}\\n\\n\\n\\n\", fontsize=18, **align)\n        plt.subplots_adjust(left=0.1, right=0.95, top=0.95, bottom=0.05)\n        plt.savefig(self.savefig, dpi=500)\n        plt.show()\n\n    def align_chr(self, alignment):\n        \"\"\"\n        Perform the alignment processing for each chromosome by updating the values.\n        \"\"\"\n        for i in alignment.columns[:-2]:\n            # Update values: set '1' for valid values, '0' for invalid, and fill NaN with 0\n            alignment.loc[alignment[i].str.contains(r'\\w', na=False), i] = 1\n            alignment.loc[alignment[i] == '.', i] = 0\n            alignment.loc[alignment[i] == ' ', i] = 0\n            alignment[i] = alignment[i].astype('float64').fillna(0)\n            \n            # Apply the moving average function to each group by chromosome\n            for chr_name, group in alignment.groupby(['chr']):\n                a = self.moving_average(group[i].values.tolist())\n                alignment.loc[group.index, i] = a\n        return alignment\n\n    def moving_average(self, arr):\n        \"\"\"\n        Calculate a moving average over a specified window size.\n        This function smooths the input array using a sliding window.\n        \"\"\"\n        a = []\n        for i in range(len(arr)):\n            # Define the window range\n            start, end = max(0, i - int(self.step)), min(len(arr), i + int(self.step))\n            ave = sum(arr[start:end]) / (end - start)\n            a.append(ave)\n        return a\n"
  },
  {
    "path": "build/lib/wgdi/run.py",
    "content": "import argparse\nimport os\nimport shutil\nimport sys\n\nimport wgdi\nimport wgdi.base as base\nfrom wgdi.align_dotplot import align_dotplot\nfrom wgdi.block_correspondence import block_correspondence\nfrom wgdi.block_info import block_info\nfrom wgdi.block_ks import block_ks\nfrom wgdi.circos import circos\nfrom wgdi.dotplot import dotplot\nfrom wgdi.karyotype import karyotype\nfrom wgdi.karyotype_mapping import karyotype_mapping\nfrom wgdi.ks import ks\nfrom wgdi.ks_peaks import kspeaks\nfrom wgdi.ksfigure import ksfigure\nfrom wgdi.peaksfit import peaksfit\nfrom wgdi.pindex import pindex\nfrom wgdi.polyploidy_classification import polyploidy_classification\nfrom wgdi.retain import retain\nfrom wgdi.run_colliearity import mycollinearity\nfrom wgdi.trees import trees\nfrom wgdi.ancestral_karyotype import ancestral_karyotype\nfrom wgdi.ancestral_karyotype_repertoire import ancestral_karyotype_repertoire\nfrom wgdi.shared_fusion import shared_fusion\nfrom wgdi.fusion_positions_database import fusion_positions_database\nfrom wgdi.fusions_detection import fusions_detection\n\n\n# Argument parser setup\nparser = argparse.ArgumentParser(\n    prog='wgdi', usage='%(prog)s [options]', epilog=\"\",\n    formatter_class=argparse.RawDescriptionHelpFormatter\n)\n\nparser.description = '''\\\nWGDI(Whole-Genome Duplication Integrated): A user-friendly toolkit for evolutionary analyses of whole-genome duplications and ancestral karyotypes.\n\n    https://wgdi.readthedocs.io/en/latest/\n    -------------------------------------- \n'''\n\nparser.add_argument(\"-v\", \"--version\", action='version', version='0.75')\nparser.add_argument(\"-d\", dest=\"dotplot\", help=\"Show homologous gene dotplot\")\nparser.add_argument(\"-icl\", dest=\"improvedcollinearity\", help=\"Improved version of ColinearScan \")\nparser.add_argument(\"-ks\", dest=\"calks\", help=\"Calculate Ka/Ks for homologous gene pairs by YN00\")\nparser.add_argument(\"-bk\", dest=\"blockks\", help=\"Show Ks of blocks in a dotplot\")\nparser.add_argument(\"-bi\", dest=\"blockinfo\", help=\"Collinearity and Ks speculate whole genome duplication\")\nparser.add_argument(\"-c\", dest=\"correspondence\", help=\"Extract event-related genomic alignment\")\nparser.add_argument(\"-kp\", dest=\"kspeaks\", help=\"A simple way to get ks peaks\")\nparser.add_argument(\"-kf\", dest=\"ksfigure\", help=\"A simple way to draw ks distribution map\")\nparser.add_argument(\"-pf\", dest=\"peaksfit\", help=\"Gaussian fitting of ks distribution\")\nparser.add_argument(\"-pc\", dest=\"polyploidy_classification\", help=\"Polyploid distinguish among subgenomes\")\nparser.add_argument(\"-a\", dest=\"alignment\", help=\"Show event-related genomic alignment in a dotplot\")\nparser.add_argument(\"-k\", dest=\"karyotype\", help=\"Show genome evolution from reconstructed ancestors\")\nparser.add_argument(\"-ak\", dest=\"ancestral_karyotype\", help=\"Generation of ancestral karyotypes from chromosomes that retain same structures in genomes\")\nparser.add_argument(\"-akr\", dest=\"ancestral_karyotype_repertoire\", help=\"Incorporate genes from collinearity blocks into the ancestral karyotype repertoire\")\nparser.add_argument(\"-km\", dest=\"karyotype_mapping\", help=\"Mapping from the known karyotype result to this species\")\nparser.add_argument(\"-fpd\", dest=\"fusion_positions_database\", help=\"Extract the fusion positions dataset\")\nparser.add_argument(\"-fd\", dest=\"fusions_detection\", help=\"Determine whether these fusion events occur in other genomes\")\nparser.add_argument(\"-sf\", dest=\"shared_fusion\", help=\"Quickly find shared fusions between species\")\nparser.add_argument(\"-at\", dest=\"alignmenttrees\", help=\"Collinear genes construct phylogenetic trees\")\nparser.add_argument(\"-p\", dest=\"pindex\", help=\"Polyploidy-index characterize the degree of divergence between subgenomes of a polyploidy\")\nparser.add_argument(\"-r\", dest=\"retain\", help=\"Show subgenomes in gene retention or genome fractionation\")\nparser.add_argument(\"-ci\", dest=\"circos\", help=\"A simple way to run circos\")\nparser.add_argument(\"-conf\", dest=\"configure\", help=\"Display and modify the environment variable\")\n\nargs = parser.parse_args()\n\n# Function to run subprograms based on options\ndef run_subprogram(program, conf, name):\n    options = base.load_conf(conf, name)\n    r = program(options)\n    r.run()\n\n# Function to configure environment\ndef run_configure():\n    base.rewrite(args.configure, 'ini')\n\n# Main function to decide which module to run based on input arguments\ndef module_to_run(argument, conf):\n    switcher = {\n        'dotplot': (dotplot, conf, 'dotplot'),\n        'correspondence': (block_correspondence, conf, 'correspondence'),\n        'alignment': (align_dotplot, conf, 'alignment'),\n        'retain': (retain, conf, 'retain'),\n        'blockks': (block_ks, conf, 'blockks'),\n        'blockinfo': (block_info, conf, 'blockinfo'),\n        'calks': (ks, conf, 'ks'),\n        'circos': (circos, conf, 'circos'),\n        'kspeaks': (kspeaks, conf, 'kspeaks'),\n        'peaksfit': (peaksfit, conf, 'peaksfit'),\n        'ksfigure': (ksfigure, conf, 'ksfigure'),\n        'pindex': (pindex, conf, 'pindex'),\n        'alignmenttrees': (trees, conf, 'alignmenttrees'),\n        'improvedcollinearity': (mycollinearity, conf, 'collinearity'),\n        'configure': run_configure,\n        'polyploidy_classification': (polyploidy_classification, conf, 'polyploidy classification'),\n        'karyotype': (karyotype, conf, 'karyotype'),\n        'ancestral_karyotype': (ancestral_karyotype, conf, 'ancestral_karyotype'),\n        'karyotype_mapping': (karyotype_mapping, conf, 'karyotype_mapping'),\n        'ancestral_karyotype_repertoire': (ancestral_karyotype_repertoire, conf, 'ancestral_karyotype_repertoire'),\n        'shared_fusion': (shared_fusion, conf, 'shared_fusion'),\n        'fusion_positions_database': (fusion_positions_database, conf, 'fusion_positions_database'),\n        'fusions_detection': (fusions_detection, conf, 'fusions_detection'),\n    }\n    \n    if argument == 'configure':\n        run_configure()\n    else:\n        program, conf, name = switcher.get(argument)\n        if program:\n            run_subprogram(program, conf, name)\n\n\n# Main entry point\ndef main():\n    path = wgdi.__path__[0]\n    options = {\n        'dotplot': 'dotplot.conf',\n        'correspondence': 'corr.conf',\n        'alignment': 'align.conf',\n        'retain': 'retain.conf',\n        'blockks': 'blockks.conf',\n        'blockinfo': 'blockinfo.conf',\n        'calks': 'ks.conf',\n        'circos': 'circos.conf',\n        'kspeaks': 'kspeaks.conf',\n        'ksfigure': 'ksfigure.conf',\n        'pindex': 'pindex.conf',\n        'alignmenttrees': 'alignmenttrees.conf',\n        'peaksfit': 'peaksfit.conf',\n        'configure': 'conf.ini',\n        'improvedcollinearity': 'collinearity.conf',\n        'polyploidy_classification': 'polyploidy_classification.conf',\n        'karyotype': 'karyotype.conf',\n        'ancestral_karyotype': 'ancestral_karyotype.conf',\n        'ancestral_karyotype_repertoire': 'ancestral_karyotype_repertoire.conf',\n        'karyotype_mapping': 'karyotype_mapping.conf',\n        'shared_fusion': 'shared_fusion.conf',\n        'fusion_positions_database': 'fusion_positions_database.conf',\n        'fusions_detection': 'fusions_detection.conf',\n    }\n\n    for arg in vars(args):\n        value = getattr(args, arg)\n        if value is not None:\n            if value in ['?', 'help', 'example']:\n                with open(os.path.join(path, 'example', options[arg])) as f:\n                    print(f.read())\n                \n                if arg == 'ksfigure' and not os.path.exists('ks_fit_result.csv'):\n                    shutil.copy2(os.path.join(wgdi.__path__[0], 'example/ks_fit_result.csv'), os.getcwd())\n            elif not os.path.exists(value):\n                print(f'{value} not exists')\n                sys.exit(0)\n            else:\n                module_to_run(arg, value)\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "build/lib/wgdi/run_colliearity.py",
    "content": "import gc\nimport re\nimport sys\nfrom multiprocessing import Pool\n\nimport numpy as np\nimport pandas as pd\n\nimport wgdi.base as base\nimport wgdi.collinearity as improvedcollinearity\n\n\nclass mycollinearity():\n    def __init__(self, options):\n        # Initialize parameters with default values\n        self.repeat_number = 10\n        self.multiple = 1\n        self.score = 100\n        self.evalue = 1e-5\n        self.blast_reverse = False\n        self.over_gap  = 5\n        self.comparison = 'genomes'\n        self.options = options\n\n        for k, v in options:\n            setattr(self, str(k), v)\n            print(f\"{str(k)} = {v}\")\n        self.position = 'order'\n        # Parse grading values\n        if hasattr(self, 'grading'):\n            self.grading = [int(k) for k in self.grading.split(',')]\n        else:\n            self.grading = [50, 40, 25]\n        # Ensure process is an integer\n        if hasattr(self, 'process'):\n            self.process = int(self.process)\n        else:\n            self.process = 4\n        self.over_gap  = int(self.over_gap )\n        base.str_to_bool(self.blast_reverse)\n\n    def deal_blast_for_chromosomes(self, blast, rednum, repeat_number):\n        bluenum = rednum\n        blast = blast.sort_values(by=[0, 11], ascending=[True, False])\n        def assign_grading(group):\n            group['cumcount'] = group.groupby(1).cumcount()\n            group = group[group['cumcount'] <= repeat_number]\n            group['grading'] = pd.cut(\n                group['cumcount'],\n                bins=[-1, 0, bluenum, repeat_number],\n                labels=self.grading,\n                right=True\n            )\n            return group\n        newblast = blast.groupby(['chr1', 'chr2']).apply(assign_grading).reset_index(drop=True)\n        newblast['grading'] = newblast['grading'].astype(int)\n        return newblast[newblast['grading'] > 0]\n    \n    def deal_blast_for_genomes(self, blast, rednum, repeat_number):\n        # Initialize the grading column\n        blast['grading'] = 0\n        \n        # Define the blue number as the sum of rednum and the predefined constant\n        bluenum = 4 + rednum\n        \n        # Get the indices for each group by sorting the 11th column in descending order\n        index = [group.sort_values(by=[11], ascending=[False])[:repeat_number].index.tolist()\n                for name, group in blast.groupby([0])]\n        \n        # Split the indices into red, blue, and gray groups\n        reddata = np.array([k[:rednum] for k in index], dtype=object)\n        bluedata = np.array([k[rednum:bluenum] for k in index], dtype=object)\n        graydata = np.array([k[bluenum:repeat_number] for k in index], dtype=object)\n        \n        # Concatenate the results into flat lists\n        redindex = np.concatenate(reddata) if reddata.size else []\n        blueindex = np.concatenate(bluedata) if bluedata.size else []\n        grayindex = np.concatenate(graydata) if graydata.size else []\n\n        # Update the grading column based on the group indices\n        blast.loc[redindex, 'grading'] = self.grading[0]\n        blast.loc[blueindex, 'grading'] = self.grading[1]\n        blast.loc[grayindex, 'grading'] = self.grading[2]\n\n        # Return only the rows with non-zero grading\n        return blast[blast['grading'] > 0]\n\n    def run(self):\n        # Read and process lens files\n        lens1 = base.newlens(self.lens1, 'order')\n        lens2 = base.newlens(self.lens2, 'order')\n        # Read and process gff files\n        gff1 = base.newgff(self.gff1)\n        gff2 = base.newgff(self.gff2)\n        # Filter gff data based on lens indices\n        gff1 = gff1[gff1['chr'].isin(lens1.index)]\n        gff2 = gff2[gff2['chr'].isin(lens2.index)]\n        # Process blast data\n\n        blast = base.newblast(self.blast, int(self.score), float(self.evalue),gff1, gff2, self.blast_reverse)\n\n        # Map positions and chromosome information\n        blast['loc1'] = blast[0].map(gff1[self.position])\n        blast['loc2'] = blast[1].map(gff2[self.position])\n        blast['chr1'] = blast[0].map(gff1['chr'])\n        blast['chr2'] = blast[1].map(gff2['chr'])\n        # Apply blast filtering and grading\n        if self.comparison.lower() == 'genomes':\n            blast = self.deal_blast_for_genomes(blast, int(self.multiple), int(self.repeat_number))\n        if self.comparison.lower() == 'chromosomes':\n            blast = self.deal_blast_for_chromosomes(blast, int(self.multiple), int(self.repeat_number))\n        print(f\"The filtered homologous gene pairs are {len(blast)}.\\n\")\n        if len(blast) < 1:\n            print(\"Stopped!\\n\\nIt may be that the id1 and id2 in the BLAST file do not match with (gff1, lens1) and (gff2, lens2).\")\n            sys.exit(1)\n        # Group blast data by 'chr1' and 'chr2'\n        total = []\n        for (chr1, chr2), group in blast.groupby(['chr1', 'chr2']):\n            total.append([chr1, chr2, group])\n        del blast, group\n        gc.collect()\n        # Determine chunk size for multiprocessing\n        n = int(np.ceil(len(total) / float(self.process)))\n        result, data = '', []\n        try:\n            # Initialize multiprocessing Pool\n            pool = Pool(self.process)\n            for i in range(0, len(total), n):\n                # Apply single_pool function asynchronously\n                data.append(pool.apply_async(\n                    self.single_pool, args=(total[i:i + n], gff1, gff2, lens1, lens2)\n                ))\n            pool.close()\n            pool.join()\n        except:\n            pool.terminate()\n        for k in data:\n            # Collect results from async tasks\n            text = k.get()\n            if text:\n                result += text\n        # Write final output to file\n        result = re.split('\\n', result)\n        fout = open(self.savefile, 'w')\n        num = 1\n        for line in result:\n            if re.match(r\"# Alignment\", line):\n                # Replace alignment number\n                s = f'# Alignment {num}:'\n                fout.write(s + line.split(':')[1] + '\\n')\n                num += 1\n                continue\n            if len(line) > 0:\n                fout.write(line + '\\n')\n        fout.close()\n        sys.exit(0)\n\n    def single_pool(self, group, gff1, gff2, lens1, lens2):\n        text = ''\n        for bk in group:\n            chr1, chr2 = str(bk[0]), str(bk[1])\n            print(f'Running {chr1} vs {chr2}')\n            # Extract and sort points\n            points = bk[2][['loc1', 'loc2', 'grading']].sort_values(\n                by=['loc1', 'loc2'], ascending=[True, True]\n            )\n            # Initialize collinearity analysis\n            collinearity = improvedcollinearity.collinearity(\n                self.options, points)\n            data = collinearity.run()\n            if not data:\n                continue\n            # Extract gene information\n            gf1 = gff1[gff1['chr'] == chr1].reset_index().set_index('order')[[1, 'strand']]\n            gf2 = gff2[gff2['chr'] == chr2].reset_index().set_index('order')[[1, 'strand']]\n            n = 1\n            for block, evalue, score in data:\n                if len(block) < self.over_gap:\n                    continue\n                # Map gene names and strands\n                block['name1'] = block['loc1'].map(gf1[1])\n                block['name2'] = block['loc2'].map(gf2[1])\n                block['strand1'] = block['loc1'].map(gf1['strand'])\n                block['strand2'] = block['loc2'].map(gf2['strand'])\n                block['strand'] = np.where(\n                    block['strand1'] == block['strand2'], '1', '-1'\n                )\n                # Prepare text output\n                block['text'] = block.apply(\n                    lambda x: f\"{x['name1']} {x['loc1']} {x['name2']} {x['loc2']} {x['strand']}\\n\",\n                    axis=1\n                )\n                # Determine alignment mark\n                a, b = block['loc2'].head(2).values\n                mark = 'plus' if a < b else 'minus'\n                # Append alignment information\n                text += f'# Alignment {n}: score={score} pvalue={evalue} N={len(block)} {chr1}&{chr2} {mark}\\n'\n                text += ''.join(block['text'].values)\n                n += 1\n        return text"
  },
  {
    "path": "build/lib/wgdi/shared_fusion.py",
    "content": "import pandas as pd\nimport wgdi.base as base\n\nclass shared_fusion:\n    def __init__(self, options):\n        for k, v in options:\n            setattr(self, str(k), v)\n            print(f\"{k} = {v}\")\n        \n        # Handle classid and limit_length options\n        self.classid = [str(k) for k in self.classid.split(',')] if hasattr(self, 'classid') else ['class1', 'class2']\n        self.limit_length = int(self.limit_length) if hasattr(self, 'limit_length') else 20\n        \n        # Clean and split lens files\n        self.lens1 = self.lens1.replace(' ', '').split(',')\n        self.lens2 = self.lens2.replace(' ', '').split(',')\n\n    def run(self):\n        # Read classification files and block information\n        ancestor_left = base.read_classification(self.ancestor_left)\n        ancestor_top = base.read_classification(self.ancestor_top)\n        bkinfo = pd.read_csv(self.blockinfo)\n\n        # Preprocess blockinfo columns\n        bkinfo['chr1'] = bkinfo['chr1'].astype(str)\n        bkinfo['chr2'] = bkinfo['chr2'].astype(str)\n        bkinfo['start1'] = bkinfo['start1'].astype(int)\n        bkinfo['end1'] = bkinfo['end1'].astype(int)\n        bkinfo['start2'] = bkinfo['start2'].astype(int)\n        bkinfo['end2'] = bkinfo['end2'].astype(int)\n        \n        # Filter based on ancestor chromosomes\n        bkinfo = bkinfo[(bkinfo['chr1'].isin(ancestor_left[0].values)) & \n                        (bkinfo['chr2'].isin(ancestor_top[0].values))]\n\n        # Read lens files\n        lens1 = pd.read_csv(self.lens1[0], sep='\\t', header=None)\n        lens2 = pd.read_csv(self.lens2[0], sep='\\t', header=None)\n        lens1[0] = lens1[0].astype(str)\n        lens2[0] = lens2[0].astype(str)\n\n        # Perform block fusion analysis\n        blockinfoout = self.block_fusions(bkinfo, ancestor_left, ancestor_top)\n\n        # Apply filters based on breakpoints and length\n        blockinfoout = blockinfoout[(blockinfoout['breakpoints1'] == 1) & \n                                     (blockinfoout['breakpoints2'] == 1)]\n        blockinfoout = blockinfoout[(blockinfoout['break_length1'] >= self.limit_length) & \n                                     (blockinfoout['break_length2'] >= self.limit_length)]\n\n        # Save the filtered block info\n        blockinfoout.to_csv(self.filtered_blockinfo, index=False)\n\n        # Filter lens data based on the blockinfoout\n        lens1 = lens1[lens1[0].isin(blockinfoout['chr1'].values)]\n        lens2 = lens2[lens2[0].isin(blockinfoout['chr2'].values)]\n\n        # Save filtered lens data\n        lens1.to_csv(self.lens1[1], sep='\\t', index=False, header=False)\n        lens2.to_csv(self.lens2[1], sep='\\t', index=False, header=False)\n\n    def block_fusions(self, bkinfo, ancestor_left, ancestor_top):\n        # Initialize new columns in the bkinfo dataframe\n        bkinfo['breakpoints1'] = 0\n        bkinfo['breakpoints2'] = 0\n        bkinfo['break_length1'] = 0\n        bkinfo['break_length2'] = 0\n\n        for index, row in bkinfo.iterrows():\n            # Process species 1 (chr1)\n            a, b = sorted([row['start1'], row['end1']])\n            d1 = ancestor_left[(ancestor_left[0] == row['chr1']) & \n                               (ancestor_left[2] >= a) & (ancestor_left[1] <= b)]\n            if len(d1) > 1:\n                bkinfo.loc[index, 'breakpoints1'] = 1\n                breaklength_max = 0\n                for _, row2 in d1.iterrows():\n                    length_in = len([k for k in range(a, b) if k in range(row2[1], row2[2])])\n                    length_out = (b - a) - length_in\n                    breaklength_max = max(breaklength_max, min(length_in, length_out) + 1)\n                bkinfo.loc[index, 'break_length1'] = breaklength_max\n\n            # Process species 2 (chr2)\n            c, d = sorted([row['start2'], row['end2']])\n            d2 = ancestor_top[(ancestor_top[0] == row['chr2']) & \n                              (ancestor_top[2] >= c) & (ancestor_top[1] <= d)]\n            if len(d2) > 1:\n                bkinfo.loc[index, 'breakpoints2'] = 1\n                breaklength_max = 0\n                for _, row2 in d2.iterrows():\n                    length_in = len([k for k in range(c, d) if k in range(row2[1], row2[2])])\n                    length_out = (d - c) - length_in\n                    breaklength_max = max(breaklength_max, min(length_in, length_out) + 1)\n                bkinfo.loc[index, 'break_length2'] = breaklength_max\n\n        return bkinfo\n"
  },
  {
    "path": "build/lib/wgdi/trees.py",
    "content": "import os\nimport shutil\nfrom io import StringIO\n\nimport numpy as np\nimport pandas as pd\nfrom Bio import AlignIO, Seq, SeqIO, SeqRecord\nimport subprocess\n\nimport wgdi.base as base\n\n\nclass trees():\n    def __init__(self, options):\n        base_conf = base.config()\n        self.position = 'order'\n        self.alignfile = ''\n        self.align_trimming = ''\n        self.trimming = 'trimal'\n        self.threads = '1'\n        self.minimum = 4\n        self.tree_software = 'iqtree'\n        self.delete_detail = True\n        for k, v in base_conf:\n            setattr(self, str(k), v)\n        for k, v in options:\n            setattr(self, str(k), v)\n            print(str(k), ' = ', v)\n        if hasattr(self, 'codon_position'):\n            self.codon_position = [\n                int(k)-1 for k in self.codon_position.split(',')]\n        else:\n            self.codon_position = [0, 1, 2]\n        self.delete_detail = base.str_to_bool(self.delete_detail)\n\n    def grouping(self, alignment):\n        data = []\n        indexs = []\n        if not os.path.exists(self.dir):\n            os.makedirs(self.dir)\n        sequence = SeqIO.to_dict(SeqIO.parse(self.sequence_file, \"fasta\"))\n        if hasattr(self, 'cds_file'):\n            seq_cds = SeqIO.to_dict(SeqIO.parse(self.cds_file, \"fasta\"))\n        for index, row in alignment.iterrows():\n            file = base.gen_md5_id(str(row.values))\n            self.sequencefile = os.path.join(self.dir, file+'.fasta')\n            self.alignfile = os.path.join(self.dir, file+'.aln')\n            self.align_trimming = self.alignfile+'.trimming'\n            self.treefile = os.path.join(self.dir, file+'.aln.treefile')\n            if os.path.isfile(self.treefile) and os.path.isfile(self.alignfile):\n                data.append(self.treefile)\n                indexs.append(index)\n                continue\n            ids = []\n            ids_cds = []\n            for i in range(len(row)):\n                if type(row[i]) == float and np.isnan(row[i]):\n                    continue\n                gene_sequence = sequence[row[i]]\n                gene_sequence.id = str(int(i)+1)\n                gene_sequence.description = ''\n                ids.append(gene_sequence)\n            SeqIO.write(ids, self.sequencefile, \"fasta\")\n            self.align()\n            if hasattr(self, 'cds_file'):\n                self.seqcdsfile = os.path.join(self.dir, file+'.cds.fasta')\n                for i in range(len(row)):\n                    if type(row[i]) == float and np.isnan(row[i]):\n                        continue\n                    gene_cds = seq_cds[row[i]]\n                    gene_cds.id = str(int(i)+1)\n                    ids_cds.append(gene_cds)\n                SeqIO.write(ids_cds, self.seqcdsfile, \"fasta\")\n                self.pal2nal()\n                self.codon()\n            if self.trimming.upper() == 'TRIMAL':\n                self.trimal()\n            if self.trimming.upper() == 'DIVVIER':\n                self.divvier()\n            self.buildtrees()\n            if os.path.isfile(self.treefile):\n                data.append(self.treefile)\n        return data\n\n    def codon(self):\n        if self.codon_position == [0, 1, 2]:\n            shutil.move(self.alignfile+'.mrtrans', self.alignfile)\n            return True\n        records = list(SeqIO.parse(self.alignfile+'.mrtrans', 'fasta'))\n        if len(records) == 0:\n            return False\n        newrecords = []\n        def final_list(test_list, x, y): return [\n            test_list[i+j] for i in range(0, len(test_list), x) for j in y]\n        for k in records:\n            if len(k.seq) % 3 > 0:\n                return False\n            seq = final_list(k.seq, 3, self.codon_position)\n            k.seq = ''.join(seq)\n            newrecords.append(SeqRecord.SeqRecord(\n                Seq.Seq(k.seq), id=k.id, description=''))\n        SeqIO.write(newrecords, self.alignfile, 'fasta')\n        return True\n\n    def pal2nal(self):\n        args = ['perl', self.pal2nal_path, self.alignfile,\n                self.seqcdsfile, '-output fasta', '>'+self.alignfile+'.mrtrans']\n        command = ' '.join(args)\n        try:\n            os.system(command)\n        except:\n            return False\n        return True\n\n    def align(self):\n        if self.align_software == 'mafft':\n            try:\n                command = [self.mafft_path,'--quiet', self.sequencefile, '>', self.alignfile]\n                subprocess.run(\" \".join(command), shell=True, check=True)\n            except subprocess.CalledProcessError as e:\n                print(f\"Error while running MAFFT: {e}\")\n\n        if self.align_software == 'muscle':\n            try:\n                command = [self.muscle_path,'-align', self.sequencefile, '-output', self.alignfile, '-quiet']\n                subprocess.run(\" \".join(command), shell=True, check=True)\n            except subprocess.CalledProcessError as e:\n                print(f\"Error while running Muscle: {e}\")\n\n    def trimal(self):\n        args = [self.trimal_path, '-in', self.alignfile,\n                '-out', self.align_trimming, '-automated1']\n        command = ' '.join(args)\n        try:\n            os.system(command)\n        except:\n            return False\n        return True\n\n    def divvier(self):\n        args = [self.divvier_path, '-mincol', '4', '-divvygap', self.alignfile]\n        command = ' '.join(args)\n        try:\n            os.system(command)\n            os.rename(self.alignfile+'.divvy.fas', self.align_trimming)\n        except:\n            return False\n        return True\n\n    def buildtrees(self):\n        try:\n            if self.tree_software.upper() == 'IQTREE':\n                args = [self.iqtree_path, '-s', self.align_trimming,\n                        '-m', self.model, '-T', self.threads, '--quiet']\n                command = ' '.join(args)\n                os.system(command)\n                os.rename(self.align_trimming+'.treefile', self.treefile)\n            elif self.tree_software.upper() == 'FASTTREE':\n                args = [self.fasttree_path,\n                        self.align_trimming, '>', self.treefile]\n                command = ' '.join(args)\n                os.system(command)\n        except:\n            return False\n        if self.delete_detail == True:\n            for file in (self.sequencefile, self.align_trimming+'.bionj', self.align_trimming+'.iqtree', self.align_trimming+'.ckp.gz',\n                         self.align_trimming+'.log', self.align_trimming+'.mldist', self.align_trimming+'.model.gz'):\n                try:\n                    os.remove(file)\n                except OSError:\n                    pass\n        return True\n\n    def run(self):\n        alignment = pd.read_csv(self.alignment, header=None)\n        alignment.replace('.', np.nan, inplace=True)\n        alignment.dropna(thresh=int(self.minimum), inplace=True)\n        if hasattr(self, 'gff') and hasattr(self, 'lens'):\n            gff = base.newgff(self.gff)\n            lens = base.newlens(self.lens, self.position)\n            alignment = pd.merge(\n                alignment, gff[['chr', self.position]], left_on=0, right_on=gff.index, how='left')\n            alignment.dropna(subset=['chr', 'order'], inplace=True)\n            alignment['order'] = alignment['order'].astype(int)\n            alignment = alignment[alignment['chr'].isin(lens.index)]\n            alignment.drop(alignment.columns[-2:], axis=1, inplace=True)\n        data = self.grouping(alignment)\n        fout = open(self.trees_file, 'w')\n        fout.close()\n        for i in range(0, len(data), 100):\n            trees = ' '.join([str(k) for k in data[i:i+100]])\n            args = ['cat', trees, '>>', self.trees_file]\n            command = ' '.join([str(k) for k in args])\n            os.system(command)\n        df = pd.read_csv(self.trees_file, header=None, sep='\\t')\n        df[0].to_csv(self.trees_file, index=None, sep='\\t', header=False)\n        print(\"done\")"
  },
  {
    "path": "command.txt",
    "content": "python setup.py sdist bdist_wheel\ntwine upload dist/*"
  },
  {
    "path": "setup.py",
    "content": "#!/usr/bin/env python\n# -*- coding: UTF-8 -*-\n\nfrom setuptools import find_packages, setup\n\nwith open(\"README.md\", \"r\", encoding='utf-8') as fh:\n    long_description = fh.read()\n\nrequired = ['pandas>=1.1.0', 'numpy', 'biopython', 'matplotlib', 'scipy', 'tabulate']\n\nsetup(\n    name=\"wgdi\",\n    version=\"0.75\",\n    author=\"Pengchuan Sun\",\n    author_email=\"sunpengchuan@gmail.com\",\n    description=\"A user-friendly toolkit for evolutionary analyses of whole-genome duplications and ancestral karyotypes\",\n    license=\"BSD License\",\n    long_description=long_description,\n    long_description_content_type=\"text/markdown\",\n    url=\"https://github.com/SunPengChuan/wgdi\",\n    packages=find_packages(),\n    package_data={'': ['*.conf','*.ini', '*.csv']},\n    classifiers=[\n        \"Intended Audience :: Science/Research\",\n        \"Programming Language :: Python :: 3\",\n        \"License :: OSI Approved :: BSD License\",\n        \"Operating System :: OS Independent\",\n    ],\n    entry_points={\n        'console_scripts': [\n            'wgdi = wgdi.run:main',\n        ]\n    },\n    zip_safe=True,\n    install_requires=required\n)\n"
  },
  {
    "path": "wgdi/__init__.py",
    "content": ""
  },
  {
    "path": "wgdi/align_dotplot.py",
    "content": "import re\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\nimport wgdi.base as base\n\nclass align_dotplot:\n    def __init__(self, options):\n        # Default values\n        self.position = 'order'\n        self.figsize = 'default'\n        self.classid = 'class1'\n\n        # Initialize from options\n        for k, v in options:\n            setattr(self, str(k), v)\n            print(f'{k} = {v}')\n        \n        self.ks_area = [float(k) for k in getattr(self, 'ks_area', '-1,3').split(',')]\n        self.colors = [str(k) for k in getattr(self, 'colors', 'red,blue,green,black,orange').split(',')]\n        self.ancestor_top = None if getattr(self, 'ancestor_top', 'none') == 'none' else self.ancestor_top\n        self.ancestor_left = None if getattr(self, 'ancestor_left', 'none') == 'none' else self.ancestor_left\n\n        self.blockinfo_reverse = base.str_to_bool(self.blockinfo_reverse)\n\n    def pair_position(self, alignment, loc1, loc2, colors):\n        alignment.index = alignment.index.map(loc1)\n        data = []\n        for i, k in enumerate(alignment.columns):\n            df = alignment[k].map(loc2).dropna()\n            for idx, row in df.items():\n                data.append([idx, row, colors[i]])\n        return pd.DataFrame(data, columns=['loc1', 'loc2', 'color'])\n\n    def run(self):\n        axis = [0, 1, 1, 0]\n\n        # Lens generation and figure size\n        lens1 = base.newlens(self.lens1, self.position)\n        lens2 = base.newlens(self.lens2, self.position)\n        \n        if re.search(r'\\d', self.figsize):\n            self.figsize = [float(k) for k in self.figsize.split(',')]\n        else:\n            self.figsize = np.array([1, float(lens1.sum()) / float(lens2.sum())]) * 10\n            \n        plt.rcParams['ytick.major.pad'] = 0\n\n        # Create plot\n        fig, ax = plt.subplots(figsize=self.figsize)\n        ax.xaxis.set_ticks_position('top')\n        step1, step2 = 1 / float(lens1.sum()), 1 / float(lens2.sum())\n\n        # Process Ancestor Data\n        if self.ancestor_left:\n            axis[0] = -0.02\n            lens_ancestor_left = self.process_ancestor(self.ancestor_left, lens1.index)\n\n        if self.ancestor_top:\n            axis[3] = -0.02\n            lens_ancestor_top = self.process_ancestor(self.ancestor_top, lens2.index)\n\n        base.dotplot_frame(fig, ax, lens1, lens2, step1, step2, \n                           self.genome1_name, self.genome2_name, [0, 1])\n\n        # Process GFF files\n        gff1, gff2 = base.newgff(self.gff1), base.newgff(self.gff2)\n        gff1 = base.gene_location(gff1, lens1, step1, self.position)\n        gff2 = base.gene_location(gff2, lens2, step2, self.position)\n\n        if self.ancestor_top:\n            self.ancestor_position(ax, gff2, lens_ancestor_top, 'top')\n\n        if self.ancestor_left:\n            self.ancestor_position(ax, gff1, lens_ancestor_left, 'left')\n\n        # Process block info and alignment\n        bkinfo = self.process_blockinfo(lens1,lens2)\n        align = self.alignment(gff1, gff2, bkinfo)\n        alignment = align[gff1.columns[-len(bkinfo[self.classid].drop_duplicates()):]]\n        alignment.to_csv(self.savefile, header=False)\n\n        # Create scatter plot\n        df = self.pair_position(alignment, gff1['loc'], gff2['loc'], self.colors)\n        plt.scatter(df['loc2'], df['loc1'], s=float(self.markersize), c=df['color'], \n                    alpha=0.5, edgecolors=None, linewidths=0, marker='o')\n\n        ax.axis(axis)\n        plt.subplots_adjust(left=0.07, right=0.97, top=0.93, bottom=0.03)\n        plt.savefig(self.savefig, dpi=500)\n        plt.show()\n\n    def process_ancestor(self, ancestor_file, lens_index):\n        df = pd.read_csv(ancestor_file, sep=\"\\t\", header=None)\n        df[0] = df[0].astype(str)\n        df[3] = df[3].astype(str)\n        df[4] = df[4].astype(int)\n        df[4] = df[4] / df[4].max()\n        return df[df[0].isin(lens_index)]\n\n    def process_blockinfo(self, lens1, lens2):\n        bkinfo = pd.read_csv(self.blockinfo, index_col='id')\n        if self.blockinfo_reverse ==  True:\n            bkinfo[['chr1', 'chr2']] = bkinfo[['chr2', 'chr1']]\n            bkinfo[['block1', 'block2']] = bkinfo[['block2', 'block1']]\n        bkinfo['chr1'] = bkinfo['chr1'].astype(str)\n        bkinfo['chr2'] = bkinfo['chr2'].astype(str)\n        bkinfo[self.classid] = bkinfo[self.classid].astype(str)\n        return bkinfo[bkinfo['chr1'].isin(lens1.index) & (bkinfo['chr2'].isin(lens2.index))]\n\n    def alignment(self, gff1, gff2, bkinfo):\n        gff1['uid'] = gff1['chr'] + 'g' + gff1['order'].astype(str)\n        gff2['uid'] = gff2['chr'] + 'g' + gff2['order'].astype(str)\n        gff1['id'] = gff1.index\n        gff2['id'] = gff2.index\n        \n        for cl, group in bkinfo.groupby(self.classid):\n            name = f'l{cl}'\n            gff1[name] = ''\n            group = group.sort_values(by=['length'], ascending=True)\n\n            for _, row in group.iterrows():\n                block = self.create_block_dataframe(row)\n                if block.empty:\n                    continue\n                block1_min, block1_max = block['block1'].agg(['min', 'max'])\n                area = gff1[(gff1['chr'] == row['chr1']) & \n                            (gff1['order'] >= block1_min) & \n                            (gff1['order'] <= block1_max)].index\n                \n                block['id1'] = (row['chr1'] + 'g' + block['block1'].astype(str)).map(\n                    dict(zip(gff1['uid'], gff1.index)))\n                block['id2'] = (row['chr2'] + 'g' + block['block2'].astype(str)).map(\n                    dict(zip(gff2['uid'], gff2.index)))\n\n                gff1.loc[block['id1'].values, name] = block['id2'].values\n                gff1.loc[gff1.index.isin(area) & gff1[name].eq(''), name] = '.'\n        return gff1\n\n    def create_block_dataframe(self, row):\n        b1, b2, ks = row['block1'].split('_'), row['block2'].split('_'), row['ks'].split('_')\n        ks = list(map(float, ks[1:])) if ks[0] == '' else list(map(float, ks))\n        block = pd.DataFrame(np.array([b1, b2, ks]).T, columns=['block1', 'block2', 'ks'])\n        block['block1'] = block['block1'].astype(int)\n        block['block2'] = block['block2'].astype(int)\n        block['ks'] = block['ks'].astype(float)\n        return block[(block['ks'] <= self.ks_area[1]) & \n                     (block['ks'] >= self.ks_area[0])].drop_duplicates(subset=['block1'], keep='first')\n\n    def ancestor_position(self, ax, gff, lens, mark):\n        for _, row in lens.iterrows():\n            loc1 = gff[(gff['chr'] == row[0]) & (gff['order'] == int(row[1]))].index\n            loc2 = gff[(gff['chr'] == row[0]) & (gff['order'] == int(row[2]))].index\n            loc1, loc2 = gff.loc[[loc1[0], loc2[0]], 'loc']\n            if mark == 'top':\n                width = abs(loc1-loc2)\n                loc = [min(loc1, loc2), 0]\n                height = -0.02\n            if mark == 'left':\n                height = abs(loc1-loc2)\n                loc = [-0.02, min(loc1, loc2), ]\n                width = 0.02\n            base.Rectangle(ax, loc, height, width, row[3], row[4])"
  },
  {
    "path": "wgdi/ancestral_karyotype.py",
    "content": "import pandas as pd\nfrom Bio import SeqIO\nimport wgdi.base as base\n\n\nclass ancestral_karyotype:\n    def __init__(self, options):\n        self.mark = 'aak'\n        \n        # Set attributes from options\n        for k, v in options:\n            setattr(self, str(k), v)\n            print(f\"{k} = {v}\")\n\n    def run(self):\n        # Load and filter data\n        gff = base.newgff(self.gff)\n        ancestor = base.read_classification(self.ancestor)\n        gff = gff[gff['chr'].isin(ancestor[0].values.tolist())]\n\n        # Create new gff copy and initialize required variables\n        newgff = gff.copy()\n        data, num = [], 1\n\n        # Create dictionary mapping chromosome to order\n        chr_arr = ancestor[3].drop_duplicates().to_list()\n        chr_dict = {chr: idx + 1 for idx, chr in enumerate(chr_arr)}\n        ancestor['order'] = ancestor[3].map(chr_dict)\n\n        dict1, dict2 = {}, {}\n\n        # Process ancestor and gff information\n        for (cla, order), group in ancestor.groupby([4, 'order'], sort=[False, False]):\n            for index, row in group.iterrows():\n                index1 = gff[(gff['chr'] == row[0]) & (gff['order'] >= row[1]) & (gff['order'] <= row[2])].index\n                newgff.loc[index1, 'chr'] = str(num)\n                \n                # Store results in data\n                for k in index1:\n                    data.append(newgff.loc[k, :].values.tolist() + [k])\n\n            dict1[str(num)] = cla\n            dict2[str(num)] = group[3].values[0]\n            num += 1\n\n        # Create dataframe from the data collected\n        df = pd.DataFrame(data)\n\n        # Filter based on peptide file\n        pep = SeqIO.to_dict(SeqIO.parse(self.pep_file, \"fasta\"))\n        df = df[df[6].isin(pep.keys())]\n\n        # Assign new names and order\n        for name, group in df.groupby(0):\n            df.loc[group.index, 'order'] = range(1, len(group) + 1)\n            df.loc[group.index, 'newname'] = [f\"{self.mark}{name}g{i:05d}\" for i in range(1, len(group) + 1)]\n\n        # Set data types and sort\n        df['order'] = df['order'].astype(int)\n        df = df[[0, 'newname', 1, 2, 3, 'order', 6]].sort_values(by=[0, 'order'])\n\n        # Save output files\n        df.to_csv(self.ancestor_gff, sep=\"\\t\", index=False, header=None)\n        lens = df.groupby(0).max()[[2, 'order']]\n        lens.to_csv(self.ancestor_lens, sep=\"\\t\", header=None)\n\n        # Add extra columns and save final results\n        lens[1] = 1\n        lens['color'] = lens.index.map(dict2)\n        lens['class'] = lens.index.map(dict1)\n        lens[[1, 'order', 'color', 'class']].to_csv(self.ancestor_file, sep=\"\\t\", header=None)\n\n        # Update peptide sequences with new IDs and save\n        id_dict = df.set_index(6).to_dict()['newname']\n        seqs = []\n\n        for seq_record in SeqIO.parse(self.pep_file, \"fasta\"):\n            if seq_record.id in id_dict:\n                seq_record.id = id_dict[seq_record.id]\n                seqs.append(seq_record)\n\n        SeqIO.write(seqs, self.ancestor_pep, \"fasta\")\n"
  },
  {
    "path": "wgdi/ancestral_karyotype_repertoire.py",
    "content": "\nimport numpy as np\nimport pandas as pd\nfrom Bio import SeqIO\n\nimport wgdi.base as base\n\nclass ancestral_karyotype_repertoire():\n    def __init__(self, options):\n        self.gap = 5\n        self.direction = 0.01\n        self.mark = 'aak1s'\n        self.blockinfo_reverse = False\n        for k, v in options:\n            setattr(self, str(k), v)\n            print(k, ' = ', v)\n        self.blockinfo_reverse =  base.str_to_bool(self.blockinfo_reverse)\n\n    def run(self):\n        gff1 = base.newgff(self.gff1)\n        gff2 = base.newgff(self.gff2)\n        bkinfo = pd.read_csv(self.blockinfo, index_col='id')\n        if self.blockinfo_reverse == True:\n            bkinfo[['chr1', 'chr2']] = bkinfo[['chr2', 'chr1']]\n            bkinfo[['block1', 'block2']] = bkinfo[['block2', 'block1']]\n        for index, row in bkinfo.iterrows():\n            block1, block2 = row['block1'].split('_'), row['block2'].split('_')\n            block1, block2 = [int(k) for k in block1], [int(k) for k in block2]\n            if int(block1[1])-int(block1[0]) < 0:\n                self.direction = -0.01\n            for i in range(1, len(block2)):\n                if abs(block1[i]-block1[i-1]) == 1 and abs(block2[i]-block2[i-1]) < int(self.gap):\n                    gff1_id = gff1[(gff1['chr'] == str(row['chr1'])) & (\n                        gff1['order'] == block1[i])].index[0]\n                    order = gff1.loc[gff1_id, 'order']\n                    gff1_row = gff1.loc[gff1_id, :].copy()\n                    for num in range(block2[i-1], block2[i]):\n                        order = order + self.direction\n                        id = gff2[(gff2['chr'] == str(row['chr2']))\n                                  & (gff2['order'] == num)].index[0]\n                        gff1_row['order'] = order\n                        gff1.loc[id, :] = gff1_row\n        df = gff1.copy()\n        df = df.sort_values(by=['chr', 'order'])\n        for name, group in df.groupby(['chr']):\n            df.loc[group.index, 'order'] = list(range(1, len(group)+1))\n            df.loc[group.index, 'newname'] = list(\n                [str(self.mark)+str(name)+'g'+str(i).zfill(5) for i in range(1, len(group)+1)])\n        df['order'] = df['order'].astype(int)\n        df['oldname'] = df.index\n        columns = ['chr', 'newname', 'start',\n                   'end', 'strand', 'order', 'oldname']\n        df[columns].to_csv(self.ancestor_gff, sep=\"\\t\",\n                           index=False, header=None)\n        lens = df.groupby('chr').max()[['end', 'order']]\n        lens['end'] = lens['end'].astype(np.int64)\n        lens.to_csv(self.ancestor_lens, sep=\"\\t\", header=None)\n        ancestor = base.read_classification(self.ancestor)\n        for index, row in ancestor.iterrows():\n            ancestor.at[index, 1] = 1\n            ancestor.at[index, 2] = lens.at[str(row[0]),'order']\n        ancestor.to_csv(self.ancestor_new, sep=\"\\t\", index=False, header=None)\n        id_dict = df['newname'].to_dict()\n        seqs = []\n        for seq_record in SeqIO.parse(self.ancestor_pep, \"fasta\"):\n            if seq_record.id in id_dict:\n                seq_record.id = id_dict[seq_record.id]\n            else:\n                continue\n            seq_record.description = ''\n            seqs.append(seq_record)\n        SeqIO.write(seqs, self.ancestor_pep_new, \"fasta\")\n"
  },
  {
    "path": "wgdi/base.py",
    "content": "import configparser\nimport hashlib\nimport os\nimport re\n\nimport matplotlib\nimport matplotlib.patches as mpatches\nimport numpy as np\nimport pandas as pd\nfrom Bio import SeqIO\n\nimport wgdi\n\n\ndef gen_md5_id(item):\n    \"\"\"Generate MD5 hash for the given item.\"\"\"\n    return hashlib.md5(item.encode('utf-8')).hexdigest()\n\n\ndef config():\n    \"\"\"Read configuration from the example conf.ini file.\"\"\"\n    conf = configparser.ConfigParser()\n    conf.read(os.path.join(wgdi.__path__[0], 'example/conf.ini'))\n    return conf.items('ini')\n\n\ndef load_conf(file, section):\n    \"\"\"Load configuration items from the specified section.\"\"\"\n    conf = configparser.ConfigParser()\n    conf.read(file)\n    return conf.items(section)\n\n\ndef rewrite(file, section):\n    \"\"\"Rewrite the configuration file to keep only the specified section.\"\"\"\n    conf = configparser.ConfigParser()\n    conf.read(file)\n    if conf.has_section(section):\n        for k in conf.sections():\n            if k != section:\n                conf.remove_section(k)\n        conf.write(open(os.path.join(wgdi.__path__[0], 'example/conf.ini'), 'w'))\n        print('Option ini has been modified')\n    else:\n        print('Option ini no change')\n\n\ndef read_colinearscan(file):\n    \"\"\"Read colinearscan output and parse into data structure.\"\"\"\n    data, b, flag, num = [], [], 0, 1\n    with open(file) as f:\n        for line in f:\n            line = line.strip()\n            if re.match(r\"the\", line):\n                num = re.search(r'\\d+', line).group()\n                b = []\n                flag = 1\n                continue\n            if re.match(r\"\\>LOCALE\", line):\n                flag = 0\n                p = re.split(':', line)\n                if b:\n                    data.append([num, b, p[1]])\n                b = []\n                continue\n            if flag == 1:\n                a = re.split(r\"\\s\", line)\n                b.append(a)\n    if b:\n        data.append([num, b, p[1]])\n    return data\n\n\ndef read_mcscanx(fn):\n    \"\"\"Read mcscanx output and parse into data structure.\"\"\"\n    with open(fn) as f1:\n        data, b = [], []\n        flag, num = 0, 0\n        for line in f1:\n            line = line.strip()\n            if re.match(r\"## Alignment\", line):\n                flag = 1\n                if not b:\n                    arr = re.findall(r\"[\\d+\\.]+\", line)[0]\n                    continue\n                data.append([num, b, 0])\n                b = []\n                num = re.findall(r\"\\d+\", line)[0]\n                continue\n            if flag == 0:\n                continue\n            a = re.split(r\"\\:\", line)\n            c = re.split(r\"\\s+\", a[1])\n            b.append([c[1], c[1], c[2], c[2]])\n        if b:\n            data.append([num, b, 0])\n    return data\n\n\ndef read_jcvi(fn):\n    \"\"\"Read jcvi output and parse into data structure.\"\"\"\n    with open(fn) as f1:\n        data, b = [], []\n        num = 1\n        for line in f1:\n            line = line.strip()\n            if re.match(r\"###\", line):\n                if b:\n                    data.append([num, b, 0])\n                    b = []\n                num += 1\n                continue\n            a = re.split(r\"\\t\", line)\n            b.append([a[0], a[0], a[1], a[1]])\n        if b:\n            data.append([num, b, 0])\n    return data\n\n\ndef read_collinearity(fn):\n    \"\"\"Read collinearity output and parse into data structure.\"\"\"\n    with open(fn) as f1:\n        data, b = [], []\n        flag, arr = 0, []\n        for line in f1:\n            line = line.strip()\n            if re.match(r\"# Alignment\", line):\n                flag = 1\n                if not b:\n                    arr = re.findall(r'[\\.\\d+]+', line)\n                    continue\n                data.append([arr[0], b, arr[2]])\n                b = []\n                arr = re.findall(r'[\\.\\d+]+', line)\n                continue\n            if flag == 0:\n                continue\n            b.append(re.split(r\"\\s\", line))\n        if b:\n            data.append([arr[0], b, arr[2]])\n    return data\n\n\ndef read_ks(file, col):\n    \"\"\"Read KS values from file and select specified column.\"\"\"\n    ks = pd.read_csv(file, sep='\\t')\n    ks.drop_duplicates(subset=['id1', 'id2'], keep='first', inplace=True)\n    ks[col] = ks[col].astype(float)\n    ks = ks[ks[col] >= 0]\n    ks.index = ks['id1'] + ',' + ks['id2']\n    return ks[col]\n\n\ndef get_median(data):\n    \"\"\"Calculate the median of the data list.\"\"\"\n    if not data:\n        return 0\n    data_sorted = sorted(data)\n    half = len(data_sorted) // 2\n    return (data_sorted[half] + data_sorted[-(half + 1)]) / 2\n\n\ndef cds_to_pep(cds_file, pep_file, fmt='fasta'):\n    \"\"\"Translate CDS sequences to peptide sequences and write to file.\"\"\"\n    records = list(SeqIO.parse(cds_file, fmt))\n    for rec in records:\n        rec.seq = rec.seq.translate()\n    SeqIO.write(records, pep_file, 'fasta')\n    return True\n\n\ndef newblast(file, score, evalue, gene_loc1, gene_loc2, reverse):\n    \"\"\"Filter BLAST results based on score, evalue, and gene locations.\"\"\"\n    blast = pd.read_csv(file, sep=\"\\t\", header=None)\n    \n    if reverse == 'true':\n        blast[[0, 1]] = blast[[1, 0]]\n    blast = blast[(blast[11] >= score) & (blast[10] < evalue) & (blast[1] != blast[0])]\n    blast = blast[(blast[0].isin(gene_loc1.index)) & (blast[1].isin(gene_loc2.index))]\n    blast.drop_duplicates(subset=[0, 1], keep='first', inplace=True)\n    blast[0] = blast[0].astype(str)\n    blast[1] = blast[1].astype(str)\n    return blast\n\n\ndef newgff(file):\n    \"\"\"Read GFF file and rename columns with appropriate data types.\"\"\"\n    gff = pd.read_csv(file, sep=\"\\t\", header=None, index_col=1)\n    gff.rename(columns={0: 'chr', 2: 'start', 3: 'end', 4: 'strand', 5: 'order'}, inplace=True)\n    gff['chr'] = gff['chr'].astype(str)\n    gff['start'] = gff['start'].astype(np.int64)\n    gff['end'] = gff['end'].astype(np.int64)\n    gff['strand'] = gff['strand'].astype(str)\n    gff['order'] = gff['order'].astype(int)\n    return gff\n\n\ndef newlens(file, position):\n    \"\"\"Read lens file and select position based on 'order' or 'end'.\"\"\"\n    lens = pd.read_csv(file, sep=\"\\t\", header=None, index_col=0)\n    lens.index = lens.index.astype(str)\n    if position == 'order':\n        lens = lens[2]\n    elif position == 'end':\n        lens = lens[1]\n    return lens\n\n\ndef read_classification(file):\n    \"\"\"Read classification data and convert columns to appropriate types.\"\"\"\n    classification = pd.read_csv(file, sep=\"\\t\", header=None)\n    classification[0] = classification[0].astype(str)\n    classification[1] = classification[1].astype(int)\n    classification[2] = classification[2].astype(int)\n    classification[3] = classification[3].astype(str)\n    classification[4] = classification[4].astype(int)\n    return classification\n\n\ndef gene_location(gff, lens, step, position):\n    \"\"\"Calculate gene locations based on lens and step.\"\"\"\n    gff = gff[gff['chr'].isin(lens.index)].copy()\n    if gff.empty:\n        print('Stoped! \\n\\nChromosomes in gff file and lens file do not correspond.')\n        exit(0)\n    dict_chr = dict(zip(lens.index, np.append(np.array([0]), lens.cumsum()[:-1].values)))\n    gff['loc'] = ''\n    for name, group in gff.groupby('chr'):\n        gff.loc[group.index, 'loc'] = (dict_chr[name] + group[position]) * step\n    return gff\n\n\ndef dotplot_frame(fig, ax, lens1, lens2, step1, step2, genome1_name, genome2_name, arr, pad = 0):\n    \"\"\"Set up the dotplot frame with grid lines and labels.\"\"\"\n    for k in lens1.cumsum()[:-1] * step1:\n        ax.axhline(y=k, alpha=0.8, color='black', lw=0.5)\n    for k in lens2.cumsum()[:-1] * step2:\n        ax.axvline(x=k, alpha=0.8, color='black', lw=0.5)\n    align = dict(family='DejaVu Sans', style='italic', horizontalalignment=\"center\", verticalalignment=\"center\")\n    yticks = lens1.cumsum() * step1 - 0.5 * lens1 * step1\n    ax.set_yticks(yticks)\n    ax.set_yticklabels(lens1.index, fontsize = 13, family='DejaVu Sans', style='normal')\n    ax.tick_params(axis='y', which='major', pad = pad)\n    ax.tick_params(axis='x', which='major', pad = pad)\n    xticks = lens2.cumsum() * step2 - 0.5 * lens2 * step2\n    ax.set_xticks(xticks)\n    ax.set_xticklabels(lens2.index, fontsize = 13, family='DejaVu Sans', style='normal')\n    ax.xaxis.set_ticks_position('none')\n    ax.yaxis.set_ticks_position('none')\n    if arr[0] <= 0:\n        ax.text(-0.06, 0.5, genome1_name, weight='semibold', fontsize=16, rotation=90, **align)\n    else:\n        ax.text(-0.06, 0.5, genome1_name, weight='semibold', fontsize=16, rotation=90, **align)\n    if arr[1] < 0:\n        ax.text(0.5, -0.06, genome2_name, weight='semibold', fontsize=16, **align)\n    else:\n        ax.text(0.5, -0.06, genome2_name, weight='semibold', fontsize=16, **align)\n\ndef Bezier3(plist, t):\n    \"\"\"Calculate Bezier curve of degree 3.\"\"\"\n    p0, p1, p2 = plist\n    return p0 * (1 - t) ** 2 + 2 * p1 * t * (1 - t) + p2 * t ** 2\n\n\ndef Bezier4(plist, t):\n    \"\"\"Calculate Bezier curve of degree 4.\"\"\"\n    p0, p1, p2, p3, p4 = plist\n    return p0 * (1 - t) ** 4 + 4 * p1 * t * (1 - t) ** 3 + 6 * p2 * t ** 2 * (1 - t) ** 2 + 4 * p3 * (1 - t) * t ** 3 + p4 * t ** 4\n\n\ndef Rectangle(ax, loc, height, width, color, alpha):\n    \"\"\"Draw a rectangle on the axes with specified properties.\"\"\"\n    p = mpatches.Rectangle(loc, width, height, edgecolor=None, facecolor=color, alpha=alpha)\n    ax.add_patch(p)\n\ndef str_to_bool(s):\n    if isinstance(s, bool):\n        return s \n    return str(s).strip().lower() == 'true'"
  },
  {
    "path": "wgdi/block_correspondence.py",
    "content": "import re\nimport numpy as np\nimport pandas as pd\nimport wgdi.base as base\n\nclass block_correspondence():\n    def __init__(self, options):\n        # Default values\n        self.tandem = True\n        self.pvalue = 0.2\n        self.position = 'order'\n        self.block_length = 5\n        self.tandem_length = 200\n        self.tandem_ratio = 1\n        self.ks_hit = 0.5\n\n        # Set user-defined options\n        for k, v in options:\n            setattr(self, str(k), v)\n            print(k, ' = ', v)\n\n        # Parse ks_area and homo if present\n        self.ks_area = [float(k) for k in getattr(self, 'ks_area', '-1,3').split(',')]\n        self.homo = [float(k) for k in self.homo.split(',')]\n        self.tandem_ratio = float(self.tandem_ratio)\n        self.tandem = base.str_to_bool(self.tandem)\n\n    def run(self):\n        lens1 = base.newlens(self.lens1, self.position)\n        lens2 = base.newlens(self.lens2, self.position)\n        \n        # Load block information from CSV\n        bkinfo = pd.read_csv(self.blockinfo)\n        bkinfo = self.preprocess_blockinfo(bkinfo, lens1, lens2)\n        \n        # Initialize correspondence DataFrame\n        cor = self.initialize_correspondence(lens1, lens2)\n        \n        # If no tandem allowed, remove tandem regions\n        if not self.tandem:\n            bkinfo = self.remove_tandem(bkinfo)\n        \n        # Remove low KS hits\n        bkinfo = self.remove_ks_hit(bkinfo)\n\n        # Find collinearity regions and save results\n        collinear_indices = self.collinearity_region(cor, bkinfo, lens1)\n        bkinfo.loc[bkinfo.index.isin(collinear_indices), :].to_csv(self.savefile, index=False)\n\n    def preprocess_blockinfo(self, bkinfo, lens1, lens2):\n        bkinfo['chr1'] = bkinfo['chr1'].astype(str)\n        bkinfo['chr2'] = bkinfo['chr2'].astype(str)\n        \n        # Filter by length, chromosome indices, and p-value\n        bkinfo = bkinfo[(bkinfo['length'] >= int(self.block_length)) & \n                        (bkinfo['chr1'].isin(lens1.index)) & \n                        (bkinfo['chr2'].isin(lens2.index)) & \n                        (bkinfo['pvalue'] <= float(self.pvalue))]\n        \n        # Filter by tandem ratio if the column exists\n        if 'tandem_ratio' in bkinfo.columns:\n            bkinfo = bkinfo[bkinfo['tandem_ratio'] <= self.tandem_ratio]\n        \n        return bkinfo\n\n    def initialize_correspondence(self, lens1, lens2):\n        # Create correspondence DataFrame with initial values\n        cor = [[k, i, 0, lens1[i], j, 0, lens2[j], float(self.homo[0]), float(self.homo[1])] \n               for k in range(1, int(self.multiple) + 1) \n               for i in lens1.index \n               for j in lens2.index]\n        \n        cor = pd.DataFrame(cor, columns=['sub', 'chr1', 'start1', 'end1', 'chr2', 'start2', 'end2', 'homo1', 'homo2'])\n        cor['chr1'] = cor['chr1'].astype(str)\n        cor['chr2'] = cor['chr2'].astype(str)\n        \n        return cor\n\n    def remove_tandem(self, bkinfo):\n        # Remove tandem regions from the DataFrame\n        group = bkinfo[bkinfo['chr1'] == bkinfo['chr2']].copy()\n        group['start'] = group['start1'] - group['start2']\n        group['end'] = group['end1'] - group['end2']\n        tandem_condition = (group['start'].abs() <= int(self.tandem_length)) | (group['end'].abs() <= int(self.tandem_length))\n        index_to_remove = group[tandem_condition].index\n        return bkinfo.drop(index_to_remove)\n\n    def remove_ks_hit(self, bkinfo):\n        # Remove records with insufficient KS hits\n        for index, row in bkinfo.iterrows():\n            ks = self.get_ks_value(row['ks'])\n            ks_ratio = len([k for k in ks if self.ks_area[0] <= k <= self.ks_area[1]]) / len(ks)\n            if ks_ratio < self.ks_hit:\n                bkinfo.drop(index, inplace=True)\n        return bkinfo\n\n    def get_ks_value(self, ks_str):\n        # Extract and return KS values as floats\n        ks = ks_str.split('_')\n        ks = list(map(float, ks[1:])) if ks[0] == '' else list(map(float, ks))\n        return ks\n\n    def collinearity_region(self, cor, bkinfo, lens):\n        collinear_indices = []\n        for (chr1, chr2), group in bkinfo.groupby(['chr1', 'chr2']):\n            group = group.sort_values(by=['length'], ascending=False)\n            df = pd.Series(0, index=range(1, int(lens[str(chr1)]) + 1))\n            for index, row in group.iterrows():\n                # Check homology conditions\n                if not self.is_valid_homo(row):\n                    continue\n                # Update the block series and compute ratio\n                b1 = [int(k) for k in row['block1'].split('_')]\n                df1 = df.copy()\n                df1[b1] += 1\n                ratio = (len(df1[df1 > 0]) - len(df[df > 0])) / len(b1)\n                if ratio < 0.5:\n                    continue\n                df[b1] += 1\n                collinear_indices.append(index)\n        \n        return collinear_indices\n\n    def is_valid_homo(self, row):\n        # Check if the homology values are within the specified range\n        return self.homo[0] <= row['homo' + self.multiple] <= self.homo[1]\n"
  },
  {
    "path": "wgdi/block_info.py",
    "content": "import numpy as np\nimport pandas as pd\nimport wgdi.base as base\n\n\nclass block_info:\n    def __init__(self, options):\n        self.repeat_number = 20\n        self.ks_col = 'ks_NG86'\n        self.blast_reverse = False\n        for k, v in options:\n            setattr(self, str(k), v)\n            print(f\"{k} = {v}\")\n        \n        self.repeat_number = int(self.repeat_number)\n        self.blast_reverse = base.str_to_bool(self.blast_reverse)\n\n    def block_position(self, collinearity, blast, gff1, gff2, ks):\n        data = []\n        for block in collinearity:\n            blk_homo, blk_ks = [], []\n\n            # Skip blocks with missing gene coordinates in GFF files\n            if block[1][0][0] not in gff1.index or block[1][0][2] not in gff2.index:\n                continue\n            \n            # Extract chromosome info\n            chr1, chr2 = gff1.at[block[1][0][0], 'chr'], gff2.at[block[1][0][2], 'chr']\n            \n            # Extract start and end positions\n            array1, array2 = [float(i[1]) for i in block[1]], [float(i[3]) for i in block[1]]\n            start1, end1 = array1[0], array1[-1]\n            start2, end2 = array2[0], array2[-1]\n            \n            block1, block2 = [], []\n            for k in block[1]:\n                block1.append(int(float(k[1])))\n                block2.append(int(float(k[3])))\n                \n                # Check for KS values\n                pair_ks = self.get_ks_value(ks, k)\n                blk_ks.append(pair_ks)\n\n                # Retrieve blast homo data\n                if k[0]+\",\"+k[2] in blast.index:\n                    blk_homo.append(blast.loc[k[0]+\",\"+k[2], [f'homo{i}' for i in range(1, 6)]].values.tolist())\n            \n            ks_median, ks_average = self.calculate_ks_statistics(blk_ks)\n            homo = self.calculate_homo_statistics(blk_homo)\n\n            blkks = '_'.join([str(k) for k in blk_ks])\n            block1 = '_'.join([str(k) for k in block1])\n            block2 = '_'.join([str(k) for k in block2])\n            \n            # Calculate tandem ratio\n            tandem_ratio = self.tandem_ratio(blast, gff2, block[1])\n            \n            # Store the results\n            data.append([\n                block[0], chr1, chr2, start1, end1, start2, end2, block[2], len(block[1]), \n                ks_median, ks_average, *homo, block1, block2, blkks, tandem_ratio\n            ])\n        \n        # Create a DataFrame with the results\n        data_df = pd.DataFrame(data, columns=[\n            'id', 'chr1', 'chr2', 'start1', 'end1', 'start2', 'end2', 'pvalue', 'length', \n            'ks_median', 'ks_average', 'homo1', 'homo2', 'homo3', 'homo4', 'homo5', \n            'block1', 'block2', 'ks', 'tandem_ratio'\n        ])\n\n        # Calculate density\n        data_df['density1'] = data_df['length'] / ((data_df['end1'] - data_df['start1']).abs() + 1)\n        data_df['density2'] = data_df['length'] / ((data_df['end2'] - data_df['start2']).abs() + 1)\n\n        return data_df\n\n    def get_ks_value(self, ks, k):\n        \"\"\"Return KS value for the given pair of genes.\"\"\"\n        pair = f\"{k[0]},{k[2]}\"\n        if pair in ks.index:\n            return ks[pair]\n        pair_rev = f\"{k[2]},{k[0]}\"\n        if pair_rev in ks.index:\n            return ks[pair_rev]\n        return -1\n\n    def calculate_ks_statistics(self, blk_ks):\n        \"\"\"Calculate KS statistics: median and average.\"\"\"\n        ks_arr = [k for k in blk_ks if k >= 0]\n        if len(ks_arr) == 0:\n            return -1, -1\n        ks_median = base.get_median(ks_arr)\n        ks_average = sum(ks_arr) / len(ks_arr)\n        return ks_median, ks_average\n\n    def calculate_homo_statistics(self, blk_homo):\n        \"\"\"Calculate homo statistics by averaging across all blocks.\"\"\"\n        df = pd.DataFrame(blk_homo)\n        homo = df.mean().values if len(df) > 0 else [-1, -1, -1, -1, -1]\n        return homo\n\n    def blast_homo(self, blast, gff1, gff2, repeat_number):\n        \"\"\"Assign homo values based on blast data.\"\"\"\n        index = [group.sort_values(by=11, ascending=False)[:repeat_number].index.tolist() for name, group in blast.groupby([0])]\n        blast = blast.loc[np.concatenate([k[:repeat_number] for k in index], dtype=object), [0, 1]]\n        blast = blast.assign(homo1=np.nan, homo2=np.nan, homo3=np.nan, homo4=np.nan, homo5=np.nan)\n\n        # Assign homo values\n        for i in range(1, 6):\n            bluenum = i + 5\n            redindex = np.concatenate([k[:i] for k in index], dtype=object)\n            blueindex = np.concatenate([k[i:bluenum] for k in index], dtype=object)\n            grayindex = np.concatenate([k[bluenum:repeat_number] for k in index], dtype=object)\n            blast.loc[redindex, f'homo{i}'] = 1\n            blast.loc[blueindex, f'homo{i}'] = 0\n            blast.loc[grayindex, f'homo{i}'] = -1\n        \n        blast['chr1_order'] = blast[0].map(gff1['order'])\n        blast['chr2_order'] = blast[1].map(gff2['order'])\n        return blast\n\n    def tandem_ratio(self, blast, gff2, block):\n        \"\"\"Calculate tandem ratio for a block.\"\"\"\n        block = pd.DataFrame(block)[[0, 2]].rename(columns={0: 'id1', 2: 'id2'})\n        block['order2'] = block['id2'].map(gff2['order'])\n\n        # Filter block_blast data\n        block_blast = blast[(blast[0].isin(block['id1'].values)) & (blast[1].isin(block['id2'].values))].copy()\n        block_blast = pd.merge(block_blast, block, left_on=0, right_on='id1', how='left')\n        block_blast['difference'] = (block_blast['chr2_order'] - block_blast['order2']).abs()\n\n        # Filter based on difference and calculate ratio\n        block_blast = block_blast[(block_blast['difference'] <= self.repeat_number) & (block_blast['difference'] > 0)]\n        return len(block_blast[0].unique()) / len(block) * len(block_blast) / (len(block) + len(block_blast))\n\n    def run(self):\n        \"\"\"Main function to run the analysis.\"\"\"\n        # Initialize required datasets\n        lens1 = base.newlens(self.lens1, self.position)\n        lens2 = base.newlens(self.lens2, self.position)\n        gff1 = base.newgff(self.gff1)\n        gff2 = base.newgff(self.gff2)\n\n        # Filter GFF files based on chromosome indices\n        gff1 = gff1[gff1['chr'].isin(lens1.index)]\n        gff2 = gff2[gff2['chr'].isin(lens2.index)]\n\n        # Load blast data\n        blast = base.newblast(self.blast, int(self.score), float(self.evalue), gff1, gff2, self.blast_reverse)\n        blast = self.blast_homo(blast, gff1, gff2, self.repeat_number)\n        blast.index = blast[0] + ',' + blast[1]\n\n        # Get collinearity data\n        collinearity = self.auto_file(gff1, gff2)\n\n        # Load ks data if necessary\n        ks = pd.Series([]) if self.ks == 'none' or self.ks == '' or not hasattr(self, 'ks') else base.read_ks(self.ks, self.ks_col)\n\n        # Get the block position data\n        data = self.block_position(collinearity, blast, gff1, gff2, ks)\n        data['class1'] = 0\n        data['class2'] = 0\n\n        # Save results\n        data.to_csv(self.savefile, index=None)\n\n    def auto_file(self, gff1, gff2):\n        \"\"\"Auto-detect and read collinearity file.\"\"\"\n        with open(self.collinearity) as f:\n            p = ' '.join(f.readlines()[0:30])\n        \n        # Handle different file formats\n        if 'path length' in p or 'MAXIMUM GAP' in p:\n            return base.read_colinearscan(self.collinearity)\n        elif 'MATCH_SIZE' in p or '## Alignment' in p:\n            return self.process_mcscanx(gff1, gff2)\n        elif '# Alignment' in p:\n            return base.read_collinearity(self.collinearity)\n        elif '###' in p:\n            return self.process_jcvi(gff1, gff2)\n\n    def process_mcscanx(self, gff1, gff2):\n        \"\"\"Process MCScanX format collinearity data.\"\"\"\n        col = base.read_mcscanx(self.collinearity)\n        collinearity = []\n        for block in col:\n            newblock = [k for k in block[1] if k[0] in gff1.index and k[2] in gff2.index]\n            if newblock:\n                for k in newblock:\n                    k[1], k[3] = gff1.at[k[0], 'order'], gff2.at[k[2], 'order']\n                collinearity.append([block[0], newblock, block[2]])\n        return collinearity\n\n    def process_jcvi(self, gff1, gff2):\n        \"\"\"Process JCVI format collinearity data.\"\"\"\n        col = base.read_jcvi(self.collinearity)\n        collinearity = []\n        for block in col:\n            newblock = [k for k in block[1] if k[0] in gff1.index and k[2] in gff2.index]\n            if newblock:\n                for k in newblock:\n                    k[1], k[3] = gff1.at[k[0], 'order'], gff2.at[k[2], 'order']\n                collinearity.append([block[0], newblock, block[2]])\n        return collinearity\n"
  },
  {
    "path": "wgdi/block_ks.py",
    "content": "import re\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\nimport wgdi.base as base\n\n\nclass block_ks:\n    def __init__(self, options):\n        # Default parameters\n        self.markersize = 0.8\n        self.figsize = 'default'\n        self.tandem_length = 200\n        self.blockinfo_reverse = False\n        self.tandem = False\n        self.area = [0, 3]\n        self.position = 'order'\n        self.ks_col = 'ks_NG86'\n        self.pvalue = 0.01\n        \n        # Overriding default parameters with options\n        for k, v in options:\n            setattr(self, str(k), v)\n            print(f\"{k} = {v}\")\n        \n        # Parsing area as a float list\n        self.area = [float(k) for k in str(self.area).split(',')]\n        self.markersize =  float(self.markersize)\n        self.tandem_length =  int(self.tandem_length)\n        \n        self.blockinfo_reverse =  base.str_to_bool(self.blockinfo_reverse)\n        self.remove_tandem =  base.str_to_bool(self.remove_tandem)\n\n    def block_position(self, bkinfo, lens1, lens2, step1, step2):\n        pos, pairs = [], []\n        \n        # Create mappings for chromosome positions\n        dict_y_chr = dict(zip(lens1.index, np.append([0], lens1.cumsum()[:-1].values)))\n        dict_x_chr = dict(zip(lens2.index, np.append([0], lens2.cumsum()[:-1].values)))\n        \n        # Iterate through block information\n        for _, row in bkinfo.iterrows():\n            block1 = row['block1'].split('_')\n            block2 = row['block2'].split('_')\n            ks = row['ks'].split('_')\n            \n            locy_median = (dict_y_chr[row['chr1']] + 0.5 * (row['end1'] + row['start1'])) * step1\n            locx_median = (dict_x_chr[row['chr2']] + 0.5 * (row['end2'] + row['start2'])) * step2\n            pos.append([locx_median, locy_median, row['ks_median']])\n            \n            # Ensure ks length matches block length\n            if len(block1) != len(ks):\n                ks = ks[1:]\n                \n            for i in range(len(block1)):\n                locy = (dict_y_chr[row['chr1']] + float(block1[i])) * step1\n                locx = (dict_x_chr[row['chr2']] + float(block2[i])) * step2\n                pairs.append([locx, locy, float(ks[i])])\n        \n        return pos, pairs\n\n    def remove_tandem(self, bkinfo):\n        # Filter for same-chromosome blocks\n        group = bkinfo[bkinfo['chr1'] == bkinfo['chr2']].copy()\n        \n        # Calculate block start and end differences\n        group['start'] = group['start1'] - group['start2']\n        group['end'] = group['end1'] - group['end2']\n        \n        # Remove tandems based on threshold\n        index = group[(group['start'].abs() <= self.tandem_length) |\n                      (group['end'].abs() <= self.tandem_length)].index\n        return bkinfo.drop(index)\n\n    def run(self):\n        # Initialize axis and chromosome lens\n        axis = [0, 1, 1, 0]\n        lens1 = base.newlens(self.lens1, self.position)\n        lens2 = base.newlens(self.lens2, self.position)\n        \n        # Parse figsize\n        if re.search(r'\\d', self.figsize):\n            self.figsize = [float(k) for k in self.figsize.split(',')]\n        else:\n            self.figsize = np.array([1, float(lens1.sum()) / float(lens2.sum())]) * 10\n        \n        # Calculate step sizes\n        step1 = 1 / float(lens1.sum())\n        step2 = 1 / float(lens2.sum())\n        \n        # Create figure and axes\n        fig, ax = plt.subplots(figsize=self.figsize)\n        plt.rcParams['ytick.major.pad'] = 0\n        ax.xaxis.set_ticks_position('top')\n        \n        # Plot dotplot frame\n        base.dotplot_frame(fig, ax, lens1, lens2, step1, step2,\n                           self.genome1_name, self.genome2_name, [0, 1])\n        \n        # Load block information\n        bkinfo = pd.read_csv(self.blockinfo)\n        \n        # Handle reverse block information\n        if self.blockinfo_reverse == True:\n            bkinfo[['chr1', 'chr2']] = bkinfo[['chr2', 'chr1']]\n            bkinfo[['block1', 'block2']] = bkinfo[['block2', 'block1']]\n        \n        # Filter block information\n        bkinfo['chr1'] = bkinfo['chr1'].astype(str)\n        bkinfo['chr2'] = bkinfo['chr2'].astype(str)\n        bkinfo = bkinfo[(bkinfo['length'] >= int(self.block_length)) & \n                        (bkinfo['chr1'].isin(lens1.index)) & \n                        (bkinfo['chr2'].isin(lens2.index)) & \n                        (bkinfo['pvalue'] < float(self.pvalue))]\n        \n        # Remove tandem duplicates if required\n        if self.tandem == False:\n            bkinfo = self.remove_tandem(bkinfo)\n        \n        # Calculate positions and pairs\n        pos, pairs = self.block_position(bkinfo, lens1, lens2, step1, step2)\n        \n        # Filter pairs by ks value\n        df = pd.DataFrame(pairs, columns=['loc1', 'loc2', 'ks'])\n        df = df[(df['ks'] >= self.area[0]) & (df['ks'] <= self.area[1])]\n        df.drop_duplicates(inplace=True)\n        \n        # Plot scatter\n        cm = plt.cm.get_cmap('gist_rainbow')\n        sc = plt.scatter(df['loc1'], df['loc2'], s=self.markersize, c=df['ks'],\n                         alpha=0.9, edgecolors=None, linewidths=0, marker='o', \n                         vmin=self.area[0], vmax=self.area[1], cmap=cm)\n        \n        # Add colorbar\n        cbar = fig.colorbar(sc, shrink=0.5, pad=0.03, fraction=0.1)\n        align = dict(family='DejaVu Sans', style='normal',\n                     horizontalalignment=\"center\", verticalalignment=\"center\")\n        cbar.set_label('Ks', labelpad=12.5, fontsize=16, **align)\n        \n        # Set axis and save figure\n        ax.axis(axis)\n        plt.subplots_adjust(left=0.09, right=0.96, top=0.93, bottom=0.03)\n        plt.savefig(self.savefig, dpi=500)\n        plt.show()\n"
  },
  {
    "path": "wgdi/circos.py",
    "content": "import re\nimport sys\n\nimport matplotlib as mpl\nimport matplotlib.patches as mpatches\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\n\nimport wgdi.base as base\n\n\nclass circos():\n    def __init__(self, options):\n        self.figsize = '10,10'\n        self.position = 'order'\n        self.label_size = 9\n        self.label_radius = 0.015\n        self.column_names = [None]*100\n        for k, v in options:\n            setattr(self, str(k), v)\n            print(k, ' = ', v)\n        self.figsize = [float(k) for k in self.figsize.split(',')]\n        self.ring_width = float(self.ring_width)\n        if hasattr(self, 'legend_square'):\n            self.legend_square = [float(k)\n                                  for k in self.legend_square.split(',')]\n        else:\n            self.legend_square = 0.04, 0.04\n\n    def plot_circle(self, loc_chr, radius, color='black', lw=1, alpha=1, linestyle='-'):\n        for k in loc_chr:\n            start, end = loc_chr[k]\n            t = np.arange(start, end, 0.005)\n            x, y = (radius) * np.cos(t), (radius) * np.sin(t)\n            plt.plot(x, y, linestyle=linestyle,\n                     color=color, lw=lw, alpha=alpha)\n\n    def plot_labels(self, root, labels, loc_chr, radius, horizontalalignment=\"center\", verticalalignment=\"center\", fontsize=6,\n                    color='black'):\n        for k in loc_chr:\n            loc = sum(loc_chr[k]) * 0.5\n            x, y = radius * np.cos(loc), radius * np.sin(loc)\n            self.Wedge(root, (x, y), self.label_radius, 0,\n                       360, self.label_radius, 'white', 1)\n            if 1 * np.pi < loc < 2 * np.pi:\n                loc += np.pi\n            plt.text(x, y, labels[k], horizontalalignment=horizontalalignment, verticalalignment=verticalalignment,\n                     fontsize=fontsize, color=color, rotation=0)\n\n    def Wedge(self, ax, loc, radius, start, end, width, color, alpha):\n        p = mpatches.Wedge(loc, radius, start, end, width=width,\n                           edgecolor=None, facecolor=color, alpha=alpha)\n        ax.add_patch(p)\n\n    def plot_bar(self, df, radius, length, lw, color, alpha):\n        for k in df[df.columns[0]].drop_duplicates().values:\n            if str(k) not in color.keys():\n                color[str(k)] = 'black'\n            if k in ['', np.nan]:\n                continue\n            df_chr = df.groupby(df.columns[0]).get_group(k)\n            x1, y1 = radius * \\\n                np.cos(df_chr['rad']), radius * np.sin(df_chr['rad'])\n            x2, y2 = (radius + length) * \\\n                np.cos(df_chr['rad']), (radius + length) * \\\n                np.sin(df_chr['rad'])\n            x = np.array(\n                [x1.values, x2.values, [np.nan] * x1.size]).flatten('F')\n            y = np.array(\n                [y1.values, y2.values, [np.nan] * x1.size]).flatten('F')\n            plt.plot(x, y, linestyle='-',\n                     color=color[str(k)], lw=lw, alpha=alpha)\n\n    def chr_location(self, lens, angle_gap, angle):\n        start, end, loc_chr = 0, 0.2*angle_gap, {}\n        for k in lens.index:\n            end += angle_gap + angle * (float(lens[k]))\n            start = end - angle * (float(lens[k]))\n            loc_chr[k] = [float(start), float(end)]\n        return loc_chr\n\n    def deal_alignment(self, alignment, gff, lens, loc_chr, angle):\n        alignment.replace('\\s+', '', inplace=True)\n        alignment.replace('.', '', inplace=True)\n        print(alignment.dropna(subset=[2, 3],how='all'))\n        # exit(0)\n        newalignment = alignment.copy()\n        for i in range(len(alignment.columns)):\n            alignment[i] = alignment[i].astype(str)\n            newalignment[i] = alignment[i].map(gff['chr'].to_dict())\n        newalignment['loc'] = alignment[0].map(gff[self.position].to_dict())\n        newalignment[0] = newalignment[0].astype('str')\n        newalignment['loc'] = newalignment['loc'].astype('float')\n        newalignment = newalignment[newalignment[0].isin(lens.index) == True]\n        newalignment['rad'] = np.nan\n        for name, group in newalignment.groupby(0):\n            if str(name) not in loc_chr:\n                continue\n            newalignment.loc[group.index, 'rad'] = loc_chr[str(\n                name)][0]+angle * group['loc']\n        print(newalignment.dropna(subset=[2, 3,4],how='all'))\n        return newalignment\n\n    def deal_ancestor(self, alignment, gff, lens, loc_chr, angle, al):\n        alignment.replace('\\s+', '', inplace=True)\n        alignment.replace('.', np.nan, inplace=True)\n        newalignment = pd.merge(alignment, gff, left_on=0, right_on=gff.index)\n        newalignment['rad'] = np.nan\n        for name, group in newalignment.groupby('chr'):\n            if str(name) not in loc_chr:\n                continue\n            newalignment.loc[group.index, 'rad'] = loc_chr[str(\n                name)][0]+angle * group[self.position]\n        newalignment.index = newalignment[0]\n        newalignment[0] = newalignment[0].map(newalignment['rad'].to_dict())\n        data = []\n        for index_al, row_al in al.iterrows():\n            for k in alignment.columns[1:]:\n                alignment[k] = alignment[k].astype(str)\n                group = newalignment[(newalignment['chr'] == row_al['chr']) & (\n                    newalignment['order'] >= row_al['start']) & (newalignment['order'] <= row_al['end'])].copy()\n                group.loc[:, k] = group.loc[:, k].map(\n                    newalignment['rad']).values\n                group.dropna(subset=[k], inplace=True)\n                group.index = group.index.map(newalignment['rad'].to_dict())\n                group['color'] = row_al['color']\n                group = group[group[k].notnull()]\n                data += group[[0, k, 'color']].values.tolist()\n        df = pd.DataFrame(data, columns=['loc1', 'loc2', 'color'])\n        return df\n\n    def plot_collinearity(self, data, radius, lw=0.02, alpha=1):\n        for name, group in data.groupby('color'):\n            x, y = np.array([]), np.array([])\n            for index, row in group.iterrows():\n                ex1x, ex1y = radius * \\\n                    np.cos(row['loc1']), radius*np.sin(row['loc1'])\n                ex2x, ex2y = radius * \\\n                    np.cos(row['loc2']), radius*np.sin(row['loc2'])\n                ex3x, ex3y = radius * (1-abs(row['loc1']-row['loc2'])/np.pi) * np.cos((row['loc1']+row['loc2'])*0.5), radius * (\n                    1-abs(row['loc1']-row['loc2'])/np.pi) * np.sin((row['loc1']+row['loc2'])*0.5)\n                x1 = [ex1x, 0.5*ex3x, ex2x]\n                y1 = [ex1y, 0.5*ex3y, ex2y]\n                step = .002\n                t = np.arange(0, 1+step, step)\n                xt = base.Bezier3(x1, t)\n                yt = base.Bezier3(y1, t)\n                x = np.hstack((x, xt, np.nan))\n                y = np.hstack((y, yt, np.nan))\n            plt.plot(x, y, color=name, lw=lw, alpha=alpha)\n\n    def plot_legend(self, ax, chr_color, width, height):\n        (x1, x2) = ax.get_xlim()\n        (y1, y2) = ax.get_ylim()\n        a = 1000\n        for k, v in enumerate(chr_color.keys(), 0):\n            h = y1-k//a*height*2\n            k = k % a\n            if x1 + width * k > x2-width:\n                a = k\n                h = y1-k//a*height*2\n                k = k % a\n            loc = [x1 + width * k, h]\n            base.Rectangle(ax, loc, height, width, chr_color[v], 1)\n            plt.text(loc[0] + width*0.382, h-0.618*height, v, fontsize=12)\n        ax.set_ylim(h-2*height, y2)\n\n    def run(self):\n        fig, ax = plt.subplots(figsize=self.figsize)\n        mpl.rcParams['agg.path.chunksize'] = 100000000\n        lens = base.newlens(self.lens, self.position)\n        radius, angle_gap = float(self.radius), float(self.angle_gap)\n        angle = (2 * np.pi - (int(len(lens))+1.5)\n                 * angle_gap) / (int(lens.sum()))\n        loc_chr = self.chr_location(lens, angle_gap, angle)\n        list_colors = [str(k).strip() for k in re.split(',|:', self.colors)]\n        chr_color = dict(zip(list_colors[::2], list_colors[1::2]))\n        gff = base.newgff(self.gff)\n        if hasattr(self, 'ancestor'):\n            ancestor = pd.read_csv(self.ancestor, header=None)\n            al = pd.read_csv(self.ancestor_location, sep='\\t', header=None)\n            al.rename(columns={0: 'chr', 1: 'start',\n                               2: 'end', 3: 'color'}, inplace=True)\n            al['chr'] = al['chr'].astype(str)\n            data = self.deal_ancestor(ancestor, gff, lens, loc_chr, angle, al)\n            self.plot_collinearity(data, radius, lw=0.1, alpha=0.8)\n\n        if hasattr(self, 'alignment'):\n            alignment = pd.read_csv(self.alignment, header=None)\n            print(alignment)\n            newalignment = self.deal_alignment(\n                alignment, gff, lens, loc_chr, angle)\n            if ',' in self.column_names:\n                names = [str(k) for k in self.column_names.split(',')]\n            else:\n                names = [None]*len(newalignment.columns)\n            n = 0\n            align = dict(family='Arial', verticalalignment=\"center\",\n                         horizontalalignment=\"center\")\n            print(newalignment)\n            for k, v in enumerate(newalignment.columns[1:-2]):\n                r = radius + self.ring_width*(k+1)\n                print(k,v,r)\n                self.plot_circle(loc_chr, r, lw=0.5, alpha=1, color='grey')\n                self.plot_bar(newalignment[[v, 'rad']], r + self.ring_width *\n                              0.15, self.ring_width*0.7, 0.15, chr_color, 1)\n                if n % 2 == 0:\n                    loc = 0.05\n                    x, y = (r+self.ring_width*0.5) * \\\n                        np.cos(loc), (r+self.ring_width*0.5) * np.sin(loc)\n                    plt.text(x, y, names[n], rotation=loc *\n                             180 / np.pi, fontsize=self.label_size, **align)\n                else:\n                    loc = -0.08\n                    x, y = (r+self.ring_width*0.5) * \\\n                        np.cos(loc), (r+self.ring_width*0.5) * np.sin(loc)\n                    plt.text(x, y, names[n], fontsize=self.label_size,\n                             rotation=loc * 180 / np.pi, **align)\n                n += 1\n        if hasattr(self, 'ancestor'):\n            colors = al['color'].drop_duplicates().values.tolist()\n            ancestor_chr_color = dict(zip(range(1, len(colors)+1), colors))\n            self.plot_legend(ax, ancestor_chr_color,\n                             self.legend_square[0], self.legend_square[1])\n        if hasattr(self, 'alignment'):\n            del chr_color['nan']\n            self.plot_legend(\n                ax, chr_color, self.legend_square[0], self.legend_square[1])\n        labels = self.chr_label + lens.index\n        labels = dict(zip(lens.index, labels))\n        self.plot_labels(ax, labels, loc_chr, radius +\n                         self.ring_width*0.3, fontsize=self.label_size)\n\n        plt.axis('off')\n        a = (ax.get_ylim()[1]-ax.get_ylim()[0]) / \\\n            (ax.get_xlim()[1]-ax.get_xlim()[0])\n        fig.set_size_inches(self.figsize[0], self.figsize[0]*a, forward=True)\n        plt.savefig(self.savefig, dpi=500)\n        plt.show()\n        sys.exit(0)\n"
  },
  {
    "path": "wgdi/collinearity.py",
    "content": "import numpy as np\nimport pandas as pd\n\n\nclass collinearity:\n    def __init__(self, options, points):\n        # Default values\n        self.gap_penalty = -1\n        self.over_length = 0\n        self.mg1 = 40\n        self.mg2 = 40\n        self.pvalue = 1\n        self.over_gap = 3\n        self.points = points\n        self.p_value = 0\n        self.coverage_ratio = 0.8\n        \n        # Set user-defined options\n        for k, v in options:\n            setattr(self, str(k), v)\n\n        # Initialize grading and mg values\n        self.grading = [50, 40, 25] if not hasattr(self, 'grading') else [int(k) for k in self.grading.split(',')]\n        self.mg1, self.mg2 = [40, 40] if not hasattr(self, 'mg') else [int(k) for k in self.mg.split(',')]\n\n        # Convert string values to floats\n        self.pvalue = float(self.pvalue)\n        self.coverage_ratio = float(self.coverage_ratio)\n\n    def get_matrix(self):\n        \"\"\"Initialize the matrix for the collinearity points.\"\"\"\n        self.points['usedtimes1'] = 0\n        self.points['usedtimes2'] = 0\n        self.points['times'] = 1\n        self.points['score1'] = self.points['grading']\n        self.points['score2'] = self.points['grading']\n        self.points['path1'] = self.points.index.to_numpy().reshape(len(self.points), 1).tolist()\n        self.points['path2'] = self.points['path1']\n        self.points_init = self.points.copy()\n        self.mat_points = self.points\n\n    def run(self):\n        \"\"\"Run the main collinearity processing.\"\"\"\n        self.get_matrix()\n        self.score_matrix()\n        data = []\n\n        # Process points for maxPath in the positive direction\n        points1 = self.points[['loc1', 'loc2', 'score1', 'path1', 'usedtimes1']].sort_values(by=['score1'], ascending=False)\n        points1.drop(index=points1[points1['usedtimes1'] < 1].index, inplace=True)\n        points1.columns = ['loc1', 'loc2', 'score', 'path', 'usedtimes']\n        \n        while (self.over_length >= self.over_gap or len(points1) >= self.over_gap):\n            if self.max_path(points1):\n                if self.p_value > self.pvalue:\n                    continue\n                data.append([self.path, self.p_value, self.score])\n\n        # Process points for maxPath in the negative direction\n        points2 = self.points[['loc1', 'loc2', 'score2', 'path2', 'usedtimes2']].sort_values(by=['score2'], ascending=False)\n        points2.drop(index=points2[points2['usedtimes2'] < 1].index, inplace=True)\n        points2.columns = ['loc1', 'loc2', 'score', 'path', 'usedtimes']\n\n        while (self.over_length >= self.over_gap) or (len(points2) >= self.over_gap):\n            if self.max_path(points2):\n                if self.p_value > self.pvalue:\n                    continue\n                data.append([self.path, self.p_value, self.score])\n\n        return data\n\n    def score_matrix(self):\n        \"\"\"Calculate the scoring matrix for the points.\"\"\"\n        for index, row, col in self.points[['loc1', 'loc2']].itertuples():\n            # Get points within a certain range\n            points = self.points[(self.points['loc1'] > row) & \n                                 (self.points['loc2'] > col) & \n                                 (self.points['loc1'] < row + self.mg1) & \n                                 (self.points['loc2'] < col + self.mg2)]\n            \n            row_i_old, gap = row, self.mg2\n            for index_ij, row_i, col_j, grading in points[['loc1', 'loc2', 'grading']].itertuples():\n                if col_j - col > gap and row_i > row_i_old:\n                    break\n                score = grading + (row_i - row + col_j - col) * self.gap_penalty\n                score1 = score + self.points.at[index, 'score1']\n                if score > 0 and self.points.at[index_ij, 'score1'] < score1:\n                    self.points.at[index_ij, 'score1'] = score1\n                    self.points.at[index, 'usedtimes1'] += 1\n                    self.points.at[index_ij, 'usedtimes1'] += 1\n                    self.points.at[index_ij, 'path1'] = self.points.at[index, 'path1'] + [index_ij]\n                    gap = min(col_j - col, gap)\n                    row_i_old = row_i\n\n        # Reverse processing to handle negative direction\n        points_reverse = self.points.sort_values(by=['loc1', 'loc2'], ascending=[False, True])\n        for index, row, col in points_reverse[['loc1', 'loc2']].itertuples():\n            points = points_reverse[(points_reverse['loc1'] < row) & \n                                    (points_reverse['loc2'] > col) & \n                                    (points_reverse['loc1'] > row - self.mg1) & \n                                    (points_reverse['loc2'] < col + self.mg2)]\n            \n            row_i_old, gap = row, self.mg2\n            for index_ij, row_i, col_j, grading in points[['loc1', 'loc2', 'grading']].itertuples():\n                if col_j - col > gap and row_i < row_i_old:\n                    break\n                score = grading + (row - row_i + col_j - col) * self.gap_penalty\n                score2 = score + self.points.at[index, 'score2']\n                if score > 0 and self.points.at[index_ij, 'score2'] < score2:\n                    self.points.at[index_ij, 'score2'] = score2\n                    self.points.at[index, 'usedtimes2'] += 1\n                    self.points.at[index_ij, 'usedtimes2'] += 1\n                    self.points.at[index_ij, 'path2'] = self.points.at[index, 'path2'] + [index_ij]\n                    gap = min(col_j - col, gap)\n                    row_i_old = row_i\n\n    def max_path(self, points):\n        \"\"\"Find the maximum path for the given points.\"\"\"\n        if len(points) == 0:\n            self.over_length = 0\n            return False\n        \n        # Initialize path score and index\n        self.score, self.path_index = points.loc[points.index[0], ['score', 'path']]\n        self.path = points[points.index.isin(self.path_index)]\n        self.over_length = len(self.path_index)\n        \n        # Check if the block overlaps with other blocks\n        if self.over_length >= self.over_gap and len(self.path) / self.over_length > self.coverage_ratio:\n            points.drop(index=self.path.index, inplace=True)\n            [loc1_min, loc2_min], [loc1_max, loc2_max] = self.path[['loc1', 'loc2']].agg(['min', 'max']).to_numpy()\n\n            # Calculate p-value\n            gap_init = self.points_init[(loc1_min <= self.points_init['loc1']) & \n                                        (self.points_init['loc1'] <= loc1_max) & \n                                        (loc2_min <= self.points_init['loc2']) & \n                                        (self.points_init['loc2'] <= loc2_max)].copy()\n            \n            self.p_value = self.p_value_estimated(gap_init, loc1_max - loc1_min + 1, loc2_max - loc2_min + 1)\n            self.path = self.path.sort_values(by=['loc1'], ascending=[True])[['loc1', 'loc2']]\n            return True\n        else:\n            points.drop(index=points.index[0], inplace=True)\n        return False\n\n    def p_value_estimated(self, gap, L1, L2):\n        \"\"\"Estimate p-value based on the given gap and lengths.\"\"\"\n        N1 = gap['times'].sum()\n        N = len(gap)\n        self.points_init.loc[gap.index, 'times'] += 1\n        m = len(self.path)\n        a = (1 - self.score / m / self.grading[0]) * (N1 - m + 1) / N * (L1 - m + 1) * (L2 - m + 1) / L1 / L2\n        return round(a, 4)\n"
  },
  {
    "path": "wgdi/dotplot.py",
    "content": "import re\n\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\n\nimport wgdi.base as base\n\n\nclass dotplot():\n    def __init__(self, options):\n        self.multiple = 1\n        self.score = 100\n        self.evalue = 1e-5\n        self.repeat_number = 20\n        self.markersize = 0.5\n        self.figsize = 'default'\n        self.position = 'order'\n        self.ancestor_top = None\n        self.ancestor_left = None\n        self.blast_reverse = False\n        for k, v in options:\n            setattr(self, str(k), v)\n            print(k, ' = ', v)\n        if self.ancestor_top == 'none' or self.ancestor_top == '':\n            self.ancestor_top = None\n        if self.ancestor_left == 'none' or self.ancestor_left == '':\n            self.ancestor_left = None\n        base.str_to_bool(self.blast_reverse)\n\n    def pair_positon(self, blast, gff1, gff2, rednum, repeat_number):\n        blast['color'] = ''\n        blast['loc1'] = blast[0].map(gff1['loc'])\n        blast['loc2'] = blast[1].map(gff2['loc'])\n        bluenum = 5+rednum\n        index = [group.sort_values(by=[11], ascending=[False])[:repeat_number].index.tolist()\n                 for name, group in blast.groupby([0])]\n        reddata = np.array([k[:rednum] for k in index], dtype=object)\n        bluedata = np.array([k[rednum:bluenum] for k in index], dtype=object)\n        graydata = np.array([k[bluenum:repeat_number] for k in index], dtype=object)\n        if len(reddata):\n            redindex = np.concatenate(reddata)\n        else:\n            redindex = []\n        if len(bluedata):\n            blueindex = np.concatenate(bluedata)\n        else:\n            blueindex = []\n        if len(graydata):\n            grayindex = np.concatenate(graydata)\n        else:\n            grayindex = []\n        blast.loc[redindex, 'color'] = 'red'\n        blast.loc[blueindex, 'color'] = 'blue'\n        blast.loc[grayindex, 'color'] = 'gray'\n        return blast[blast['color'].str.contains(r'\\w')]\n\n    def run(self):\n        axis = [0, 1, 1, 0]\n        left, right, top, bottom = 0.07, 0.97, 0.93, 0.03\n        lens1 = base.newlens(self.lens1, self.position)\n        lens2 = base.newlens(self.lens2, self.position)\n        step1 = 1 / float(lens1.sum())\n        step2 = 1 / float(lens2.sum())\n        if self.ancestor_left != None:\n            axis[0] = -0.02\n            lens_ancestor_left = pd.read_csv(\n                self.ancestor_left, sep=\"\\t\", header=None)\n            lens_ancestor_left[0] = lens_ancestor_left[0].astype(str)\n            lens_ancestor_left[3] = lens_ancestor_left[3].astype(str)\n            lens_ancestor_left[4] = lens_ancestor_left[4].astype(int)\n            lens_ancestor_left[4] = lens_ancestor_left[4] / lens_ancestor_left[4].max()\n            lens_ancestor_left = lens_ancestor_left[lens_ancestor_left[0].isin(\n                lens1.index)]\n        if self.ancestor_top != None:\n            axis[3] = -0.02\n            lens_ancestor_top = pd.read_csv(\n                self.ancestor_top, sep=\"\\t\", header=None)\n            lens_ancestor_top[0] = lens_ancestor_top[0].astype(str)\n            lens_ancestor_top[3] = lens_ancestor_top[3].astype(str)\n            lens_ancestor_top[4] = lens_ancestor_top[4].astype(int)\n            lens_ancestor_top[4] = lens_ancestor_top[4] / lens_ancestor_top[4].max()\n            lens_ancestor_top = lens_ancestor_top[lens_ancestor_top[0].isin(\n                lens2.index)]\n        if re.search(r'\\d', self.figsize):\n            self.figsize = [float(k) for k in self.figsize.split(',')]\n        else:\n            self.figsize = np.array(\n                [1, float(lens1.sum())/float(lens2.sum())])*10\n        plt.rcParams['ytick.major.pad'] = 0\n        fig, ax = plt.subplots(figsize=self.figsize)\n        ax.xaxis.set_ticks_position('top')\n        base.dotplot_frame(fig, ax, lens1, lens2, step1, step2,\n                           self.genome1_name, self.genome2_name, [axis[0], axis[3]])\n        gff1 = base.newgff(self.gff1)\n        gff2 = base.newgff(self.gff2)\n        gff1 = base.gene_location(gff1, lens1, step1, self.position)\n        gff2 = base.gene_location(gff2, lens2, step2, self.position)\n        if self.ancestor_top != None:\n            top = top\n            self.aree_left = self.ancestor_posion(ax, gff2, lens_ancestor_top, 'top')\n        if self.ancestor_left != None:\n            left = left\n            self.aree_top = self.ancestor_posion(ax, gff1, lens_ancestor_left, 'left')\n        print('read gffs')\n        blast = base.newblast(self.blast, int(self.score),\n                              float(self.evalue), gff1, gff2, self.blast_reverse)\n        if len(blast) ==0:\n            print('Stoped! \\n\\nThe gene id in blast file does not correspond to gff1 and gff2.')\n            exit(0)\n        print('read blast')\n        df = self.pair_positon(blast, gff1, gff2,\n                               int(self.multiple), int(self.repeat_number))\n        print('deal blast')\n        ax.scatter(df['loc2'], df['loc1'], s=float(self.markersize), c=df['color'],\n                   alpha=0.5, edgecolors=None, linewidths=0, marker='o')\n        ax.axis(axis)\n        plt.subplots_adjust(left=left, right=right, top=top, bottom=bottom)\n        plt.savefig(self.savefig, dpi=300)\n        plt.show()\n\n    def ancestor_posion(self, ax, gff, lens, mark):\n        data = []\n        for index, row in lens.iterrows():\n            loc1 = gff[(gff['chr'] == row[0]) & (\n                gff['order'] == int(row[1]))].index\n            loc2 = gff[(gff['chr'] == row[0]) & (\n                gff['order'] == int(row[2])-1)].index\n            loc1, loc2 = gff.loc[[loc1[0], loc2[0]], 'loc']\n            if mark == 'top':\n                width = abs(loc1-loc2)\n                loc = [min(loc1, loc2), 0]\n                height = -0.02\n                base.Rectangle(ax, loc, height, width, row[3], row[4])\n            if mark == 'left':\n                height = abs(loc1-loc2)\n                loc = [-0.02, min(loc1, loc2), ]\n                width = 0.02\n                base.Rectangle(ax, loc, height, width, row[3], row[4])\n            data.append([loc, height, width, row[3], row[4]])\n        return data\n"
  },
  {
    "path": "wgdi/example/__init__.py",
    "content": ""
  },
  {
    "path": "wgdi/example/align.conf",
    "content": "[alignment]\nblockinfo = block information file (.csv)\nblockinfo_reverse = false\nclassid =  class1\ngff1 =  gff1 file\ngff2 =  gff2 file\nlens1 = lens1 file\nlens2 = lens2 file\ngenome1_name =  Genome1 name\ngenome2_name =  Genome2 name\nmarkersize = 0.5\nks_area = -1,3\nposition = order\ncolors = red,blue,green\nfigsize = 10,10\nsavefile = savefile(.csv)\nsavefig= save image(.png, .pdf, .svg)"
  },
  {
    "path": "wgdi/example/alignmenttrees.conf",
    "content": "[alignmenttrees]\nalignment = alignment file (.csv)\ngff = gff file (reference genome, If alignment has no reference species, delete it)\nlens = lens file (If alignment has no reference species, delete it)\ndir = output folder\nsequence_file = sequence file (.fa)\ncds_file = cds file (.fa)\ncodon_positon = 1,2,3  (1,2 mean codon1&2; 1,2,3 mean no codon removed)\ntrees_file =  trees (.nwk)\nalign_software = (mafft,muscle)\ntree_software =  (iqtree,fasttree)\nthreads = 1 (Number,AUTO)\nmodel = MFP\ntrimming =  (trimal,divvier)\nminimum = 4\ndelete_detail = true\n"
  },
  {
    "path": "wgdi/example/ancestral_karyotype.conf",
    "content": "[ancestral_karyotype]\ngff = gff file (cat the relevant 'gff' files into a file)\npep_file = pep file (cat the relevant 'pep.fa' files into a file)\nancestor = ancestor file  (this file requires you to provide)\nmark = aak \nancestor_gff =  result file\nancestor_lens =  result file\nancestor_pep =  result file\nancestor_file =  result file"
  },
  {
    "path": "wgdi/example/ancestral_karyotype_repertoire.conf",
    "content": "[ancestral_karyotype_repertoire]\nblockinfo =  block information (*.csv)\n# blockinfo: processed *.csv\nblockinfo_reverse =  False\ngff1 =  gff1 file (ancestor's gff)\ngff2 =  gff2 file (the other species's gff)\ngap = 5\nmark = aak1s\nancestor = ancestor file \n#current ancestor file\nancestor_new =  result file\nancestor_pep =  ancestor pep file \n#cat all pep files together\nancestor_pep_new =  result file\nancestor_gff =  result file\nancestor_lens =  result file\n"
  },
  {
    "path": "wgdi/example/blockinfo.conf",
    "content": "[blockinfo]\nblast = blast file\ngff1 =  gff1 file\ngff2 =  gff2 file\nlens1 = lens1 file\nlens2 = lens2 file\ncollinearity = collinearity file\nscore = 100\nevalue = 1e-5\nrepeat_number = 20\nposition = order\nks = ks file\nks_col = ks_NG86\nsavefile = block information (*.csv)\n"
  },
  {
    "path": "wgdi/example/blockks.conf",
    "content": "[blockks]\nlens1 = lens1 file\nlens2 = lens2 file\ngenome1_name =  Genome1 name\ngenome2_name =  Genome2 name\nblockinfo = block information (*.csv)\npvalue = 0.2\ntandem = true\ntandem_length = 200\nmarkersize = 1\narea = 0,2\nblock_length =  minimum length\nfigsize = 8,8\nsavefig = save image(.png, .pdf, .svg)\n"
  },
  {
    "path": "wgdi/example/circos.conf",
    "content": "[circos]\ngff =  gff file\nlens =  lens file\nradius = 0.2\nangle_gap = 0.05\nring_width = 0.015\ncolors  = 1:c,2:m,3:blue,4:gold,5:red,6:lawngreen,7:darkgreen,8:k,9:darkred,10:gray\nalignment = alignment file \nchr_label = chr\nancestor = ancestor alignment file \nancestor_location = ancestor file \nfigsize = 10,10\nlabel_size = 9\nposition = order\nlegend_square = 0.04, 0.04\ncolumn_names = 1,2,3,4,5\nsavefig = result(.png, .pdf, .svg)\n"
  },
  {
    "path": "wgdi/example/collinearity.conf",
    "content": "[collinearity]\ngff1 = gff1 file\ngff2 = gff2 file\nlens1 = lens1 file\nlens2 = lens2 file\nblast = blast file\nblast_reverse = false\ncomparison = genomes\nmultiple  = 1\nprocess = 8\nevalue = 1e-5\nscore = 100\ngrading = 50,30,25\nmg = 25,25\npvalue = 1\nrepeat_number = 20\npositon = order\nsavefile = collinearity file\n"
  },
  {
    "path": "wgdi/example/conf.ini",
    "content": "[ini]\nmafft_path = /home/sunpc/micromamba/envs/wgdi/bin/mafft\npal2nal_path = /home/sunpc/micromamba/envs/wgdi/bin/pal2nal.pl\nyn00_path = /home/sunpc/micromamba/envs/wgdi/bin/yn00\nmuscle_path = /home/sunpc/micromamba/envs/wgdi/bin/muscle\niqtree_path =  /home/sunpc/micromamba/envs/wgdi/bin/iqtree\ntrimal_path = /home/sunpc/micromamba/envs/wgdi/bin/trimal\nfasttree_path = /home/sunpc/micromamba/envs/wgdi/bin/fasttree\ndivvier_path = /home/sunpc/micromamba/envs/wgdi/bin/divvier\n"
  },
  {
    "path": "wgdi/example/corr.conf",
    "content": "[correspondence]\nblockinfo =  blockinfo file(.csv) \nlens1 = lens1 file\nlens2 = lens2 file\ntandem = true\ntandem_length = 200\npvalue = 0.2\nblock_length = 5\ntandem_ratio = 0.5\nmultiple  = 1\nhomo = -1,1\nsavefile = savefile(.csv)\n"
  },
  {
    "path": "wgdi/example/dotplot.conf",
    "content": "[dotplot]\nblast = blast file\ngff1 =  gff1 file\ngff2 =  gff2 file\nlens1 = lens1 file\nlens2 = lens2 file\ngenome1_name =  Genome1 name\ngenome2_name =  Genome2 name\nmultiple  = 1\nscore = 100\nevalue = 1e-5\nrepeat_number = 10\nposition = order\nblast_reverse = false\nancestor_left = ancestor file or none\nancestor_top = ancestor file or none\nmarkersize = 0.5\nfigsize = 10,10\nsavefig = savefile(.png, .pdf, .svg)\n"
  },
  {
    "path": "wgdi/example/fusion_positions_database.conf",
    "content": "[fusion_positions_database]\npep = pep file\ngff = gff file\nfusion_positions = fusion_positions file\n# Number of gene sets on each side of the breakpoint\nancestor_gff =  result file\nancestor_lens =  result file\nancestor_pep =  result file\nancestor_file =  result file\n"
  },
  {
    "path": "wgdi/example/fusions_detection.conf",
    "content": "[fusions_detection]\nblockinfo = block information (*.csv)\nancestor = ancestor file\n#The number of genes spanned by a synteny block on both sides of a breakpoint.\nmin_genes_per_side = 5\ndensity = 0.3\nfiltered_blockinfo = result blockinfo (.csv)\n"
  },
  {
    "path": "wgdi/example/karyotype.conf",
    "content": "[karyotype]\nancestor = ancestor chromosome file\nwidth = 0.5\nfigsize = 10,6.18\nsavefig = save image(.png, .pdf, .svg)"
  },
  {
    "path": "wgdi/example/karyotype_mapping.conf",
    "content": "[karyotype_mapping]\nblast = blast file\nblast_reverse = false\ngff1 = gff1 file\ngff2 = gff2 file \nscore = 100\nevalue = 1e-5\nrepeat_number = 5\nancestor_left = ancestor location file (Only one of ('left', 'top') can be reserved)\nancestor_top = ancestor location file\nthe_other_lens = the other lens file\nblockinfo = block information (*.csv)\nblockinfo_reverse = false\nlimit_length = 5\nthe_other_ancestor_file =  result file "
  },
  {
    "path": "wgdi/example/ks.conf",
    "content": "[ks]\ncds_file = \tcds file \n#cat all cds files together\npep_file = \tpep file\n#cat all pep files together\nalign_software = muscle\npairs_file = gene pairs file\nks_file = ks result"
  },
  {
    "path": "wgdi/example/ks_fit_result.csv",
    "content": ",color,linewidth,linestyle,,,,,,\ncsa_csa,red,2,-,2.532090116,1.510453744,0.229652282,1.638111687,2.048906176,0.345639862\nvvi_vvi,blue,2,-,3.00367275,1.288717936,0.177816426,,,\nvvi_oin_gamma,orange,2,-,1.910418336,1.328469514,0.262257112,,,\nvvi_oin,orange,2,--,4.948194212,0.882608858,0.10426873,,,\nvvi_csa,green,2,--,2.470770292464022,1.4131842495219498,0.21391959288821544,,,\n"
  },
  {
    "path": "wgdi/example/ksfigure.conf",
    "content": "[ksfigure]\nksfit = ksfit result(*.csv)\nlabelfontsize = 15\nlegendfontsize = 15\nxlabel = none            \nylabel = none            \ntitle = none\narea = 0,2\nfigsize = 10,6.18\nshadow = true (true/false)\nsavefig =  save image(.png, .pdf, .svg)\n"
  },
  {
    "path": "wgdi/example/kspeaks.conf",
    "content": "[kspeaks]\nblockinfo = block information (*.csv)\npvalue = 0.2\ntandem = true\nblock_length = int number\nks_area = 0,10\nmultiple  = 1\nhomo = 0,1\nfontsize = 9\narea = 0,3\nfigsize = 10,6.18\nsavefig = saving image(.png,.pdf)\nsavefile = ks medain savefile\n"
  },
  {
    "path": "wgdi/example/peaksfit.conf",
    "content": "[peaksfit]\nblockinfo = block information (*.csv)\nmode = median\nbins_number = 200\nks_area = 0,10\nfontsize = 9\narea = 0,3\nfigsize = 10,6.18\nshadow = true \nsavefig = saving image(.png,.pdf,.svg)"
  },
  {
    "path": "wgdi/example/pindex.conf",
    "content": "[pindex]\nalignment = alignment file (.csv)\ngff = gff file\nlens =lens file\ngap = 50\nretention = 0.05\ndiff = 0.05\nremove_delta = (true/false)\nsavefile = result file(.csv)\n"
  },
  {
    "path": "wgdi/example/polyploidy_classification.conf",
    "content": "[polyploidy classification]\nblockinfo = block information (*.csv)\nancestor_left = ancestor file\nancestor_top = ancestor file\nclassid = class1,class2\nsame_protochromosome =  False\nsame_subgenome =  False\nsavefile = result file(.csv)"
  },
  {
    "path": "wgdi/example/retain.conf",
    "content": "[retain]\nalignment = alignment file\ngff = gff file\nlens = lens file\ncolors = red,blue,green\nrefgenome = shorthand\nfigsize = 10,12\nstep = 50\nylabel = y label\nsavefile = retain file (result)\nsavefig = result(.png, .pdf, .svg)\n"
  },
  {
    "path": "wgdi/example/shared_fusion.conf",
    "content": "[shared_fusion]\nblockinfo = block information (*.csv)\n# The new lens file is the output filtered by lens file.\nlens1 = lens file, new lens file\nlens2 =  lens file,  new lens file\nancestor_left = ancestor file\nancestor_top = ancestor file\nclassid = class1,class2\nlimit_length = 5\nfiltered_blockinfo = result blockinfo (.csv)"
  },
  {
    "path": "wgdi/fusion_positions_database.py",
    "content": "import pandas as pd\nimport os\nfrom Bio import SeqIO\n\nclass fusion_positions_database:\n    def __init__(self, options):\n        for k, v in options:\n            setattr(self, k, v)\n            print(f'{k} = {v}')\n\n    def run(self):\n        # Load and remove duplicates from data\n        gff = pd.read_csv(self.gff, sep=\"\\t\", header=None, dtype={0: str, 5: int}).drop_duplicates()\n        pep = SeqIO.to_dict(SeqIO.parse(self.pep, \"fasta\"))\n        df = pd.read_csv(self.fusion_positions, sep=\"\\t\", header=None, dtype={0: str, 1: int, 2:int, 3:str}).drop_duplicates()\n        \n        # Load ancestral sequence file if it exists\n        seqs = SeqIO.to_dict(SeqIO.parse(self.ancestor_pep, \"fasta\")) if os.path.exists(self.ancestor_pep) else {}\n\n        sf_gff, sf_lens = [], []\n\n        # Process fusion positions\n        for _, row in df.iterrows():\n            newchr = row[3]\n            newgff = gff[(gff[0] == row[0]) & \n                         (gff[5] >= row[1] - row[2]) & \n                         (gff[5] < row[1] + row[2])].copy()\n            newgff['id'] = [f\"{newchr}s{str(row[0]).zfill(2)}g{str(i).zfill(3)}\" for i in range(1, len(newgff) + 1)]\n\n            sf_position = row[1] - newgff.iloc[0, 5]\n            sf_lens.append([newchr, sf_position, len(newgff)])\n            \n            # For each gene in the filtered GFF region\n            for _, gff_row in newgff.iterrows():\n                if gff_row[1] in pep and gff_row['id'] not in seqs:\n                    gene = pep[gff_row[1]][:]\n                    gene.id, gene.description = gff_row['id'], ''\n                    seqs[gff_row['id']] = gene\n                    # Collect data for the final GFF output\n                    sf_gff.append([gff_row['id'], newchr, sf_position, gff_row[2], gff_row[3], gff_row[4], gff_row[1]])\n\n        # Write sequences to FASTA file\n        SeqIO.write(seqs.values(), self.ancestor_pep, 'fasta')\n\n        # Save filtered GFF data\n        if sf_gff:\n            sf_gff = pd.DataFrame(sf_gff)\n            sf_gff.rename(columns={3: 'start', 4: 'end', 5: 'strand'}, inplace=True)\n            sf_gff['order'] = sf_gff[0].str[-3:].astype(int)\n            sf_gff[[1, 0, 'start', 'end', 'strand', 'order', 6]].to_csv(self.ancestor_gff, sep=\"\\t\", mode='a', index=False, header=None)\n            sf_lens = pd.DataFrame(sf_lens).drop_duplicates()\n            sf_lens.to_csv(self.ancestor_lens, sep=\"\\t\", mode='a', index=False, header=None)\n\n            # Generate ancestral sequence data\n            ancestor = []\n            for _, row in sf_lens.iterrows():\n                ancestor.append([row[0], 1, row[1], 'red', 1])\n                ancestor.append([row[0], row[1] + 1, row[2], 'blue', 1])\n            pd.DataFrame(ancestor).to_csv(self.ancestor_file, sep=\"\\t\", mode='a', index=False, header=None)\n\n        # Remove duplicates from the output files\n        for file in [self.ancestor_gff, self.ancestor_lens, self.ancestor_file]:\n            df = pd.read_csv(file, header=None).drop_duplicates().to_csv(file, index=False, header=None)\n"
  },
  {
    "path": "wgdi/fusions_detection.py",
    "content": "import pandas as pd\nfrom tabulate import tabulate\n\nclass fusions_detection:\n    def __init__(self, options):\n        self.min_genes_per_side = 5\n        self.density = 0.3\n        for k, v in options:\n            setattr(self, k, v)\n            print(f\"{k} = {v}\")\n        self.min_genes_per_side = int(self.min_genes_per_side)\n        self.density = float(self.density)\n\n    def run(self):\n        # Load the ancestor file and process the positions\n        ancestor = pd.read_csv(self.ancestor, sep='\\t', header=None)\n        position = ancestor.groupby(0)[2].unique().apply(pd.Series)\n        bkinfo = pd.read_csv(self.blockinfo)\n        newbkinfo = bkinfo.head(0)\n        \n        # Iterate over each row in the position dataframe\n        for index, row in position.iterrows():\n            # Filter the bkinfo dataframe based on chr2 and density\n            filtered_group = bkinfo[(bkinfo['chr2'] == index) & (bkinfo['density2'] >= self.density)].copy()\n            # Split the block2 column and stack the resulting series\n            df = filtered_group['block2'].str.split('_', expand=True).stack().astype(int)\n            # Count the number of genes greater and less than the current position\n            filtered_group['greater'] = (df > row[0]).groupby(level=0).sum()\n            filtered_group['less'] = (df < row[0]).groupby(level=0).sum()\n            # Filter the group based on the minimum number of genes per side\n            filtered_group = filtered_group[(filtered_group['greater'] >= self.min_genes_per_side) & (filtered_group['less'] >= self.min_genes_per_side)]\n            # Concatenate the filtered group with the newbkinfo dataframe\n            newbkinfo = pd.concat([newbkinfo, filtered_group])\n        if len(newbkinfo) ==0:\n            print(\"\\nNo shared fusion breakpoints detected\")\n            exit(0)\n\n        # Get and print the shared fusion positions\n        newbkinfo.to_csv(self.filtered_blockinfo, header=True, index=False)\n        non_overlap_counts = newbkinfo.groupby('chr2').apply(self.count_non_overlapping)\n        data = [(chr2, count) for chr2, count in non_overlap_counts.items()]\n        print(\"\\nThe following are the shared fusion breakpoints and counts:\")\n        print(tabulate(data, headers=[\"Fusion Breakpoint\", \"Count\"], tablefmt=\"github\"))\n\n    def count_non_overlapping(self, group):\n        if len(group) == 1:\n            return 1\n        grouped = group.groupby('chr1')\n        total_count = 0\n        for chr1, chr_group in grouped:\n            chr_group = chr_group.sort_values(by='start1').reset_index(drop=True)\n            count = 0\n            current_end = -1 \n            for _, row in chr_group.iterrows():\n                start1, end1 = row['start1'], row['end1']\n                if start1 > current_end:\n                    count += 1\n                    current_end = end1 \n            total_count += count\n        return total_count"
  },
  {
    "path": "wgdi/karyotype.py",
    "content": "import matplotlib.pyplot as plt\nimport pandas as pd\n\nimport wgdi.base as base\n\n\nclass karyotype():\n    def __init__(self, options):\n        self.width = 0.5\n        for k, v in options:\n            setattr(self, str(k), v)\n            print(str(k), ' = ', v)\n        if hasattr(self, 'figsize'):\n            self.figsize = [float(k) for k in self.figsize.split(',')]\n        else:\n            self.figsize = 10, 6.18\n        if hasattr(self, 'width'):\n            self.width = float(self.width)\n        else:\n            self.width = 0.5\n\n    def run(self):\n        fig, ax = plt.subplots(figsize=self.figsize)\n        ancestor_lens = pd.read_csv(\n            self.ancestor, sep=\"\\t\", header=None)\n        ancestor_lens[0] = ancestor_lens[0].astype(str)\n        ancestor_lens[3] = ancestor_lens[3].astype(str)\n        ancestor_lens[4] = ancestor_lens[4].astype(int)\n        ancestor_lens[4] = ancestor_lens[4] / ancestor_lens[4].max()\n        chrs = ancestor_lens[0].drop_duplicates().to_list()\n        ax.bar(chrs, 10, color='white', alpha=0)\n        for index, row in ancestor_lens.iterrows():\n            base.Rectangle(ax, [chrs.index(row[0])-self.width*0.5,\n                                row[1]], row[2]-row[1], self.width, row[3], row[4])\n        ax.tick_params(labelsize=15)\n        ax.spines['top'].set_visible(False)\n        ax.spines['right'].set_visible(False)\n        ax.spines['left'].set_visible(False)\n        ax.spines['bottom'].set_visible(False)\n        ax.set_xticks([])\n        ax.set_yticks([])\n        plt.savefig(self.savefig, dpi=500)\n        plt.show()\n"
  },
  {
    "path": "wgdi/karyotype_mapping.py",
    "content": "import numpy as np\nimport pandas as pd\n\nimport wgdi.base as base\n\n\nclass karyotype_mapping:\n    def __init__(self, options):\n        # Initialize default attributes\n        self.blast_reverse = False\n        self.blockinfo_reverse = False\n        self.position = 'order'\n        self.block_length = 5\n        self.limit_length = 5\n        self.repeat_number = 20\n        self.score = 100\n        self.evalue = 1e-5\n\n        # Update attributes with provided keyword arguments and print them\n        for k, v in options:\n            setattr(self, k, v)\n            print(f\"{k} = {v}\")\n        \n        self.blast_reverse = base.str_to_bool(self.blast_reverse)\n        self.blockinfo_reverse = base.str_to_bool(self.blockinfo_reverse)\n        self.limit_length = int(self.limit_length)\n\n    def karyotype_left(self, pairs, ancestor, gff1, gff2):\n        # Loop through each row in ancestor to set color and classification in gff1\n        for _, row in ancestor.iterrows():\n            loc_min, loc_max = sorted([row[1], row[2]])\n            index1 = gff1[(gff1['chr'] == row[0]) &\n                          (gff1['order'] >= loc_min) &\n                          (gff1['order'] <= loc_max)].index\n            gff1.loc[index1, ['color', 'classification']] = row[3], row[4]\n\n        # Merge pairs with gff1 and update gff2 with color and classification\n        data = pd.merge(pairs, gff1, left_on=0, right_index=True, how='left')\n        data.drop_duplicates(subset=[1], inplace=True)\n        data.set_index(1, inplace=True)\n        gff2.loc[data.index, ['color', 'classification']] = data[['color', 'classification']]\n        return gff2\n\n    def karyotype_top(self, pairs, ancestor, gff1, gff2):\n        # Loop through each row in ancestor to set color and classification in gff2\n        for _, row in ancestor.iterrows():\n            loc_min, loc_max = sorted([row[1], row[2]])\n            index1 = gff2[(gff2['chr'] == row[0]) &\n                          (gff2['order'] >= loc_min) &\n                          (gff2['order'] <= loc_max)].index\n            gff2.loc[index1, ['color', 'classification']] = row[3], row[4]\n\n        # Merge pairs with gff2 and update gff1 with color and classification\n        data = pd.merge(pairs, gff2, left_on=1, right_index=True, how='left')\n        data.drop_duplicates(subset=[0], inplace=True)\n        data.set_index(0, inplace=True)\n        gff1.loc[data.index, ['color', 'classification']] = data[['color', 'classification']]\n        return gff1\n\n    def karyotype_map(self, gff, lens):\n        # Filter gff based on lens index and non-null color\n        gff = gff[gff['chr'].isin(lens.index) & gff['color'].notnull()]\n        ancestor = []\n        # Group by chromosome and process each group to create ancestor records\n        for chr, group in gff.groupby('chr'):\n            color, class_id, arr = '', 1, []\n            for _, row in group.iterrows():\n                if color ==  row['color'] and class_id == row['classification']:\n                    arr.append(row['order'])\n                else:\n                    if len(arr) >= self.limit_length:\n                        ancestor.append([chr, min(arr), max(arr), color, class_id, len(arr)])\n                    color, class_id = row['color'], row['classification']\n                    arr = []\n                    if len(ancestor) >= 1 and color == ancestor[-1][3] and class_id == ancestor[-1][4] and chr == ancestor[-1][0]:\n                        arr.append(ancestor[-1][1])\n                        arr += np.random.randint(ancestor[-1][1], ancestor[-1][2], size=ancestor[-1][5]-1).tolist()\n                        ancestor.pop()\n                    arr.append(row['order'])\n            if len(arr) >= self.limit_length:\n                ancestor.append([chr, min(arr), max(arr), color, class_id, len(arr)])\n\n        ancestor = pd.DataFrame(ancestor)\n        # Adjust min and max positions for each chromosome group\n        for chr, group in ancestor.groupby(0):\n            ancestor.loc[group.index[0], 1] = 1\n            ancestor.loc[group.index[-1], 2] = lens[chr]\n        ancestor[4] = ancestor[4].astype(int)\n        return ancestor[[0, 1, 2, 3, 4, 5]]\n\n    def colinear_gene_pairs(self, bkinfo, gff1, gff2):\n        gff1 = gff1.reset_index()\n        gff2 = gff2.reset_index()\n        \n        gff1_indexed = gff1.set_index(['chr', 'order'])\n        gff2_indexed = gff2.set_index(['chr', 'order'])\n        \n        data = []\n        for _, row in bkinfo.iterrows():\n            b1 = list(map(int, row['block1'].split('_')))\n            b2 = list(map(int, row['block2'].split('_')))\n\n            for order1, order2 in zip(b1, b2):\n                a = gff1_indexed.loc[(row['chr1'], order1), 1]\n                b = gff2_indexed.loc[(row['chr2'], order2), 1]\n                data.append([a, b])\n        return pd.DataFrame(data)\n    \n    def new_ancestor(self, ancestor, gff1, gff2, blast):\n        # Iterate through ancestor rows to adjust positions based on neighboring rows\n        for i in range(1, len(ancestor)):\n            if ancestor.iloc[i, 0] == ancestor.iloc[i-1, 0]:\n                area = ancestor.iloc[i, 1] - ancestor.iloc[i-1, 2]\n                if area <= 5:\n                    ancestor.iloc[i-1, 2] = ancestor.iloc[i, 1] - 1\n                else:\n                    index1 = gff1[(gff1['chr'] == ancestor.iloc[i, 0]) &\n                                (gff1['order'] >= ancestor.iloc[i-1, 2]+1) &\n                                (gff1['order'] <= ancestor.iloc[i, 1]-1)].index\n                    index2 = gff2[gff2['color'] == ancestor.iloc[i-1, 3]].index\n                    index3 = gff2[gff2['color'] == ancestor.iloc[i, 3]].index\n\n                    newblast1 = blast[(blast[0].isin(index1)) & (blast[1].isin(index2))]\n                    newblast2 = blast[(blast[0].isin(index1)) & (blast[1].isin(index3))]\n\n                    if len(newblast1) >= len(newblast2):\n                        ancestor.iloc[i-1, 2] = ancestor.iloc[i, 1] - 1\n                    else:\n                        ancestor.iloc[i, 1] = ancestor.iloc[i-1, 2] + 1\n        for chr, group in ancestor.groupby(0):\n            if len(group) == 1:\n                continue\n            newgff1 = gff1[gff1['chr'] == chr]\n            for i in range(1, len(group)):\n                if group.iloc[i, 5] > 200:\n                    continue\n\n                index_left = newgff1[(newgff1['order'] >= group.iloc[i, 1]) &\n                                (newgff1['order'] <= group.iloc[i, 2])].index\n                blast_left = blast[blast[0].isin(index_left)]\n\n                index_prev = gff2[gff2['color'] == group.iloc[i-1, 3]].index\n                blast_prev = blast_left[blast_left[1].isin(index_prev)]\n\n                index_curr = gff2[gff2['color'] == group.iloc[i, 3]].index\n                blast_curr = blast_left[blast_left[1].isin(index_curr)]\n\n                if len(blast_curr) <= len(blast_prev):\n                    ancestor.loc[group.index[i],3] = ancestor.loc[group.index[i]-1,3]\n\n                if i < len(group)-1:\n                    index_next = gff2[gff2['color'] == group.iloc[i+1, 3]].index\n                    blast_next = blast_left[blast_left[1].isin(index_next)]\n                    if len(blast_next) > max(len(blast_prev),len(blast_curr)):\n                        ancestor.loc[group.index[i],3] = ancestor.loc[group.index[i]+1,3]\n        \n        ancestor['group'] = (ancestor[0].shift(1) != ancestor[0]) | (ancestor[3].shift(1) != ancestor[3]) | (ancestor[4].shift(1) != ancestor[4])\n        ancestor['group'] = ancestor['group'].cumsum()\n        result = ancestor.groupby('group').agg({\n            0: 'first',\n            1: 'min',\n            2: 'max',\n            3: 'first',\n            4: 'first',\n        }).reset_index(drop=True)\n\n        return result\n\n    def run(self):\n        # Read and process block information\n        bkinfo = pd.read_csv(self.blockinfo, index_col='id')\n        bkinfo['chr1'] = bkinfo['chr1'].astype(str)\n        bkinfo['chr2'] = bkinfo['chr2'].astype(str)\n        if self.blockinfo_reverse == True:\n            bkinfo[['chr1', 'chr2']] =  bkinfo[['chr2', 'chr1']]\n            bkinfo[['block1', 'block2']] =  bkinfo[['block2', 'block1']]\n        bkinfo = bkinfo[bkinfo['length'] > int(self.block_length)]\n\n        # Read GFF and lens data\n        gff1 = base.newgff(self.gff1)\n        gff2 = base.newgff(self.gff2)\n        lens = base.newlens(self.the_other_lens, self.position)\n        blast = base.newblast(self.blast, int(self.score), float(self.evalue), gff1, gff2, self.blast_reverse)\n        # blast.drop_duplicates(subset=[0], keep='first', inplace=True)\n\n        # Find colinear gene pairs\n        pairs = self.colinear_gene_pairs(bkinfo, gff1, gff2)\n\n        # Depending on available attributes, call either karyotype_top or karyotype_left\n        if hasattr(self, 'ancestor_top'):\n            ancestor = base.read_classification(self.ancestor_top)\n            data = self.karyotype_top(pairs, ancestor, gff1, gff2)\n        elif hasattr(self, 'ancestor_left'):\n            ancestor = base.read_classification(self.ancestor_left)\n            data = self.karyotype_left(pairs, ancestor, gff1, gff2)\n            gff1, gff2 = gff2, gff1\n            blast.iloc[:, :2] = blast.iloc[:, [1, 0]].to_numpy()\n        else:\n            print('Missing ancestor file.')\n            exit(0)\n\n        # Map the data and create the final ancestor file\n        the_other_ancestor_file = self.karyotype_map(data, lens)\n        the_other_ancestor_file = self.new_ancestor(the_other_ancestor_file, gff1, gff2, blast)\n        the_other_ancestor_file.to_csv(self.the_other_ancestor_file, sep='\\t', header=False, index=False)"
  },
  {
    "path": "wgdi/ks.py",
    "content": "import os\nimport sys\nimport numpy as np\nimport pandas as pd\nfrom Bio import SeqIO\nimport subprocess\nfrom Bio.Phylo.PAML import yn00\nimport wgdi.base as base\n\n\nclass ks:\n    def __init__(self, options):\n        base_conf = base.config()\n        self.pair_pep_file = 'pair.pep'\n        self.pair_cds_file = 'pair.cds'\n        self.prot_align_file = 'prot.aln'\n        self.mrtrans = 'pair.mrtrans'\n        self.pair_yn = 'pair.yn'\n\n        for k, v in base_conf:\n            setattr(self, str(k), v)\n        for k, v in options:\n            setattr(self, str(k), v)\n            print(f'{str(k)} = {v}')\n\n    def auto_file(self):\n        pairs = []\n        with open(self.pairs_file) as f:\n            p = ' '.join(f.readlines()[:30])\n\n        # Detect file format and process accordingly\n        if 'path length' in p or 'MAXIMUM GAP' in p:\n            collinearity = base.read_colinearscan(self.pairs_file)\n            pairs = [[v[0], v[2]] for k in collinearity for v in k[1]]\n        elif 'MATCH_SIZE' in p or '## Alignment' in p:\n            collinearity = base.read_mcscanx(self.pairs_file)\n            pairs = [[v[0], v[2]] for k in collinearity for v in k[1]]\n        elif '# Alignment' in p:\n            collinearity = base.read_collinearity(self.pairs_file)\n            pairs = [[v[0], v[2]] for k in collinearity for v in k[1]]\n        elif '###' in p:\n            collinearity = base.read_jcvi(self.pairs_file)\n            pairs = [[v[0], v[2]] for k in collinearity for v in k[1]]\n        elif ',' in p:\n            collinearity = pd.read_csv(self.pairs_file, header=None)\n            pairs = collinearity.values.tolist()\n        else:\n            collinearity = pd.read_csv(self.pairs_file, header=None, sep='\\t')\n            pairs = collinearity.values.tolist()\n\n        df = pd.DataFrame(pairs).drop_duplicates()\n        df[0] = df[0].astype(str)\n        df[1] = df[1].astype(str)\n        df.index = df[0] + ',' + df[1]\n        return df\n\n    def run(self):\n        # Load sequence data\n        cds = SeqIO.to_dict(SeqIO.parse(self.cds_file, \"fasta\"))\n        pep = SeqIO.to_dict(SeqIO.parse(self.pep_file, \"fasta\"))\n        df_pairs = self.auto_file()\n\n        # Check if ks file exists and load it, otherwise create a new one\n        if os.path.exists(self.ks_file):\n            ks = pd.read_csv(self.ks_file, sep='\\t').drop_duplicates()\n            kscopy = ks.copy()\n            names = ks.columns.tolist()\n            names[0], names[1] = names[1], names[0]\n            kscopy.columns = names\n            ks = pd.concat([ks, kscopy])\n            ks['id'] = ks['id1'] + ',' + ks['id2']\n            df_pairs.drop(np.intersect1d(df_pairs.index, ks['id'].to_numpy()), inplace=True)\n            ks_file = open(self.ks_file, 'a+')\n        else:\n            ks_file = open(self.ks_file, 'w')\n            ks_file.write('\\t'.join(['id1', 'id2', 'ka_NG86', 'ks_NG86', 'ka_YN00', 'ks_YN00']) + '\\n')\n\n        # Filter valid pairs based on sequence data\n        df_pairs = df_pairs[\n            (df_pairs[0].isin(cds.keys())) & (df_pairs[1].isin(cds.keys())) &\n            (df_pairs[0].isin(pep.keys())) & (df_pairs[1].isin(pep.keys()))\n        ]\n\n        pairs = df_pairs[[0, 1]].to_numpy()\n\n        if len(pairs) > 0 and pairs[0][0][:3] == pairs[0][1][:3]:\n            allpairs = []\n            pair_hash = {}\n            for k in pairs:\n                if k[0] + ',' + k[1] in pair_hash or k[1] + ',' + k[0] in pair_hash:\n                    continue\n                else:\n                    pair_hash[k[0] + ',' + k[1]] = 1\n                    pair_hash[k[1] + ',' + k[0]] = 1\n                    allpairs.append(k)\n            pairs = allpairs\n\n        for k in pairs:\n            cds_gene1, cds_gene2 = cds[k[0]], cds[k[1]]\n            cds_gene1.id, cds_gene2.id = 'gene1', 'gene2'\n            pep_gene1, pep_gene2 = pep[k[0]], pep[k[1]]\n            pep_gene1.id, pep_gene2.id = 'gene1', 'gene2'\n\n            # Write sequences to files\n            SeqIO.write([cds[k[0]], cds[k[1]]], self.pair_cds_file, \"fasta\")\n            SeqIO.write([pep[k[0]], pep[k[1]]], self.pair_pep_file, \"fasta\")\n\n            # Compute Ka/Ks values\n            kaks = self.pair_kaks(['gene1', 'gene2'])\n            if kaks is None:\n                continue\n\n            ks_file.write('\\t'.join([str(i) for i in list(k) + list(kaks)]) + '\\n')\n\n        ks_file.close()\n\n        # Clean up temporary files\n        for file in [\n            self.pair_pep_file, self.pair_cds_file, self.mrtrans, self.pair_yn,\n            self.prot_align_file, '2YN.dN', '2YN.dS', '2YN.t', 'rst', 'rst1', 'yn00.ctl', 'rub'\n        ]:\n            try:\n                os.remove(file)\n            except OSError:\n                pass\n\n    def pair_kaks(self, k):\n        self.align()\n        pal = self.pal2nal()\n        if not pal:\n            return []\n\n        kaks = self.run_yn00()\n        if kaks is None:\n            return []\n\n        kaks_new = [\n            kaks[k[0]][k[1]]['NG86']['dN'], kaks[k[0]][k[1]]['NG86']['dS'],\n            kaks[k[0]][k[1]]['YN00']['dN'], kaks[k[0]][k[1]]['YN00']['dS']\n        ]\n        return kaks_new\n\n    def align(self):\n        if self.align_software == 'mafft':\n            try:\n                command = [self.mafft_path, '--quiet', self.pair_pep_file, '>', self.prot_align_file]\n                subprocess.run(\" \".join(command), shell=True, check=True)\n            except subprocess.CalledProcessError as e:\n                print(f\"Error while running MAFFT: {e}\")\n\n        elif self.align_software == 'muscle':\n            try:\n                command = [self.muscle_path, '-align', self.pair_pep_file, '-output', self.prot_align_file, '-quiet']\n                subprocess.run(\" \".join(command), shell=True, check=True)\n            except subprocess.CalledProcessError as e:\n                print(f\"Error while running Muscle: {e}\")\n\n    def pal2nal(self):\n        args = ['perl', self.pal2nal_path, self.prot_align_file, self.pair_cds_file, '-output paml -nogap', '>' + self.mrtrans]\n        command = ' '.join(args)\n        try:\n            os.system(command)\n        except:\n            return False\n        return True\n\n    def run_yn00(self):\n        yn = yn00.Yn00()\n        yn.alignment = self.mrtrans\n        yn.out_file = self.pair_yn\n        yn.set_options(icode=0, commonf3x4=0, weighting=0, verbose=1)\n\n        try:\n            run_result = yn.run(command=self.yn00_path)\n        except:\n            run_result = None\n        return run_result\n"
  },
  {
    "path": "wgdi/ks_peaks.py",
    "content": "import matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\nfrom scipy.stats.kde import gaussian_kde\n\nimport wgdi.base as base\n\nclass kspeaks:\n    def __init__(self, options):\n        # Default values\n        self.tandem_length = 200\n        self.figsize = 10, 6.18\n        self.fontsize = 9\n        self.block_length = 3\n        self.area = 0, 3\n        self.tandem =  True\n\n        # Set options passed in\n        for k, v in options:\n            setattr(self, str(k), v)\n            print(f'{str(k)} = {v}')\n\n        # Convert string values to lists of floats\n        self.homo = [float(k) for k in self.homo.split(',')]\n        self.ks_area = [float(k) for k in self.ks_area.split(',')]\n        self.figsize = [float(k) for k in self.figsize.split(',')]\n        self.area = [float(k) for k in self.area.split(',')]\n        self.pvalue = float(self.pvalue)\n        self.block_length = int(self.block_length)\n        self.tandem = base.str_to_bool(self.tandem)\n\n    def remove_tandem(self, bkinfo):\n        \"\"\"\n        Remove tandem duplications based on start and end position differences.\n        \"\"\"\n        group = bkinfo[bkinfo['chr1'] == bkinfo['chr2']].copy()\n        group.loc[:, 'start'] = group.loc[:, 'start1'] - group.loc[:, 'start2']\n        group.loc[:, 'end'] = group.loc[:, 'end1'] - group.loc[:, 'end2']\n        \n        # Drop rows where start or end difference is within tandem length\n        index = group[(group['start'].abs() <= self.tandem_length) | \n                      (group['end'].abs() <= self.tandem_length)].index\n        bkinfo = bkinfo.drop(index)\n        return bkinfo\n\n    def ks_kde(self, df):\n        \"\"\"\n        Perform kernel density estimation (KDE) on Ks data.\n        \"\"\"\n        # Clean up 'ks' column by removing leading underscores\n        df.loc[df['ks'].str.startswith('_'), 'ks'] = df.loc[df['ks'].str.startswith('_'), 'ks'].str[1:]\n        \n        ks = df['ks'].str.split('_')\n        arr = []\n        ks_ave = []\n        \n        # Collect individual Ks values and calculate average Ks per row\n        for v in ks.values:\n            v = [float(k) for k in v if float(k) >= 0]\n            if len(v) == 0:\n                continue\n            arr.extend(v)\n            ks_ave.append(sum(v) / len(v))  # Mean of each row's Ks values\n        \n        # KDE for three distributions: median, average, total\n        kdemedian = gaussian_kde(df['ks_median'].values)\n        kdemedian.set_bandwidth(bw_method=kdemedian.factor / 3.)\n        \n        kdeaverage = gaussian_kde(ks_ave)\n        kdeaverage.set_bandwidth(bw_method=kdeaverage.factor / 3.)\n        \n        kdetotal = gaussian_kde(arr)\n        kdetotal.set_bandwidth(bw_method=kdetotal.factor / 3.)\n\n        return [kdemedian, kdeaverage, kdetotal]\n\n    def run(self):\n        \"\"\"\n        Main method to process the data, perform KDE, and generate the plot.\n        \"\"\"\n        plt.rcParams['ytick.major.pad'] = 0\n        fig, ax = plt.subplots(figsize=self.figsize)\n\n        # Read the block info file\n        bkinfo = pd.read_csv(self.blockinfo)\n        bkinfo['chr1'] = bkinfo['chr1'].astype(str)\n        bkinfo['chr2'] = bkinfo['chr2'].astype(str)\n        bkinfo['length'] = bkinfo['length'].astype(int)\n\n        # Filter based on block length and p-value\n        bkinfo = bkinfo[(bkinfo['length'] > self.block_length) &\n                        (bkinfo['pvalue'] < self.pvalue)]\n\n        # Remove tandem duplications if needed\n        if self.tandem == False:\n            bkinfo = self.remove_tandem(bkinfo)\n\n        # Further filtering based on homozygous range and Ks area\n        bkinfo = bkinfo[bkinfo[f'homo{self.multiple}'] >= self.homo[0]]\n        bkinfo = bkinfo[bkinfo[f'homo{self.multiple}'] <= self.homo[1]]\n        bkinfo = bkinfo[bkinfo['ks_median'] >= self.ks_area[0]]\n        bkinfo = bkinfo[bkinfo['ks_median'] <= self.ks_area[1]]\n\n        # Perform KDE on the Ks data\n        kdemedian, kdeaverage, kdetotal = self.ks_kde(bkinfo)\n\n        # Define the range for the x-axis (Ks values)\n        dist_space = np.linspace(self.area[0], self.area[1], 500)\n\n        # Plot the KDE results\n        ax.plot(dist_space, kdemedian(dist_space), color='red', label='block median')\n        ax.plot(dist_space, kdeaverage(dist_space), color='black', label='block average')\n        ax.plot(dist_space, kdetotal(dist_space), color='blue', label='all pairs')\n\n        # Set plot labels, grid, and limits\n        ax.grid()\n        ax.set_xlabel(r'${K_{s}}$', fontsize=20)\n        ax.set_ylabel('Frequency', fontsize=20)\n        ax.tick_params(labelsize=18)\n        ax.set_xlim(self.area)\n        ax.legend(fontsize=20)\n\n        # Adjust layout for better display\n        plt.subplots_adjust(left=0.09, right=0.96, top=0.93, bottom=0.12)\n\n        # Save the figure\n        plt.savefig(self.savefig, dpi=500)\n        plt.show()\n\n        # Save the filtered data to CSV\n        bkinfo.to_csv(self.savefile, index=False)"
  },
  {
    "path": "wgdi/ksfigure.py",
    "content": "import re\nimport sys\n\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\nimport wgdi.base as base\nfrom scipy import stats\n\n\nclass ksfigure():\n    def __init__(self, options):\n        self.figsize = 10, 6.18\n        self.legendfontsize = 30\n        self.labelfontsize = 9\n        self.area = 0, 3\n        self.shadow = True\n        self.mode = 'median'\n        for k, v in options:\n            setattr(self, str(k), v)\n            print(str(k), ' = ', v)\n        if self.xlabel == 'none' or self.xlabel == '':\n            self.xlabel = r'Synonymous nucleotide subsititution (${K_{s}}$)'\n        if self.ylabel == 'none' or self.ylabel == '':\n            self.ylabel = 'kernel density of syntenic blocks'\n        if self.title == 'none' or self.title == '':\n            self.title = ''\n        self.figsize = [float(k) for k in self.figsize.split(',')]\n        self.area = [float(k) for k in self.area.split(',')]\n        self.shadow = base.str_to_bool(self.shadow)\n\n    def Gaussian_distribution(self, t, k):\n        y = np.zeros(len(t))\n        for i in range(0, int((len(k) - 1) / 3)+1):\n            if np.isnan(k[3 * i + 2]):\n                continue\n            k[3 * i + 2] = float(k[3 * i + 2])/np.sqrt(2)\n            k[3 * i + 0] = float(k[3 * i + 0]) * \\\n                np.sqrt(2*np.pi)*float(k[3 * i + 2])\n            y1 = stats.norm.pdf(\n                t, float(k[3 * i + 1]), float(k[3 * i + 2])) * float(k[3 * i + 0])\n            y = y+y1\n        return y\n\n    def run(self):\n        plt.rcParams['ytick.major.pad'] = 0\n        fig, ax = plt.subplots(figsize=self.figsize)\n        ksfit = pd.read_csv(self.ksfit, index_col=0)\n        t = np.arange(self.area[0], self.area[1], 0.0005)\n        col = [k for k in ksfit.columns if re.match('Unnamed:', k)]\n        for index, row in ksfit.iterrows():\n            ax.plot(t, self.Gaussian_distribution(\n                t, row[col].values), linestyle=row['linestyle'], color=row['color'],alpha=0.8, label=index, linewidth=row['linewidth'])\n            if self.shadow == True:\n                ax.fill_between(t, 0, self.Gaussian_distribution(t, row[col].values),  color=row['color'], alpha=0.15, interpolate=True, edgecolor=None, label=index,)\n        align = dict(family='Arial', verticalalignment=\"center\",\n                     horizontalalignment=\"center\")\n        ax.set_xlabel(self.xlabel, fontsize=self.labelfontsize,\n                      labelpad=20, **align)\n        ax.set_ylabel(self.ylabel, fontsize=self.labelfontsize,\n                      labelpad=20, **align)\n        ax.set_title(self.title, weight='bold',\n                     fontsize=self.labelfontsize, **align)\n        plt.tick_params(labelsize=10)\n        handles,labels = ax.get_legend_handles_labels()\n        df = pd.DataFrame({  'handles': handles, 'labels': labels})\n        df.drop_duplicates(subset='labels', keep='first', inplace=True)\n        handles, labels = df['handles'].tolist(), df['labels'].tolist()\n        if self.shadow == True:\n            plt.legend(handles=handles,labels=labels,loc='upper right', prop={\n                   'family': 'Arial', 'style': 'italic', 'size': self.legendfontsize})\n        else:\n            plt.legend(handles=handles,labels=labels,loc='upper right', prop={\n                   'family': 'Arial', 'style': 'italic', 'size': self.legendfontsize})\n        plt.gca().spines['top'].set_visible(False)\n        plt.gca().spines['right'].set_visible(False)\n        plt.savefig(self.savefig, dpi=500)\n        plt.show()\n        sys.exit(0)\n"
  },
  {
    "path": "wgdi/peaksfit.py",
    "content": "import re\nimport sys\n\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\nfrom scipy.optimize import curve_fit\nfrom scipy.stats import gaussian_kde, linregress\n\nimport wgdi.base as base\n\n\nclass peaksfit():\n    def __init__(self, options):\n        self.figsize = 10, 6.18\n        self.fontsize = 9\n        self.area = 0, 3\n        self.mode = 'median'\n        self.histogram_only = False\n        for k, v in options:\n            setattr(self, str(k), v)\n            print(str(k), ' = ', v)\n        self.figsize = [float(k) for k in self.figsize.split(',')]\n        self.area = [float(k) for k in self.area.split(',')]\n        self.bins_number = int(self.bins_number)\n        self.peaks = 1\n        self.histogram_only = base.str_to_bool(self.histogram_only)\n\n    def ks_values(self, df):\n        df.loc[df['ks'].str.startswith('_'),'ks']= df.loc[df['ks'].str.startswith('_'),'ks'].str[1:]\n        ks = df['ks'].str.split('_')\n        ks_total = []\n        ks_average = []\n        for v in ks.values:\n            ks_total.extend([float(k) for k in v])\n        ks_average = df['ks_average'].values\n        ks_median = df['ks_median'].values\n        return [ks_median, ks_average, ks_total]\n\n    def gaussian_fuc(self, x, *params):\n        y = np.zeros_like(x)\n        for i in range(0, len(params), 3):\n            amp = float(params[i])\n            ctr = float(params[i+1])\n            wid = float(params[i+2])\n            y = y + amp * np.exp(-((x - ctr)/wid)**2)\n        return y\n\n    def kde_fit(self, data, x):\n        kde = gaussian_kde(data)\n        kde.set_bandwidth(bw_method=kde.factor/3.)\n        p = kde(x)\n        guess = [1,1, 1]*self.peaks\n        popt, pcov = curve_fit(self.gaussian_fuc, x, p, guess, maxfev = 80000)\n        popt = [abs(k) for k in popt]\n        data = []\n        y = self.gaussian_fuc(x, *popt)\n        for i in range(0, len(popt), 3):\n            array = [popt[i], popt[i+1], popt[i+2]]\n            data.append(self.gaussian_fuc(x, *array))\n        slope, intercept, r_value, p_value, std_err = linregress(p, y)\n        print(\"\\nR-square: \"+str(r_value**2))\n        print(\"The gaussian fitting curve parameters are :\")\n        print('  |  '.join([str(k) for k in popt]))\n        return y, data\n\n    def run(self):\n        plt.rcParams['ytick.major.pad'] = 0\n        fig, ax = plt.subplots(figsize=self.figsize)\n        bkinfo = pd.read_csv(self.blockinfo)\n        ks_median, ks_average, ks_total = self.ks_values(bkinfo)\n        data = eval('ks_'+self.mode)\n        data = [k for k in data if self.area[0] <= k <= self.area[1]]\n        x = np.linspace(self.area[0], self.area[1], self.bins_number)\n        n, bins, patches = ax.hist(data, int(\n            self.bins_number), density=1, facecolor='blue', alpha=0.3, label='Histogram')\n        if self.histogram_only == True:\n            pass\n        else:\n            y, fit = self.kde_fit(data, x)\n            ax.plot(x, y, color='black', linestyle='-', label='Gaussian fitting')\n        ax.grid()\n        align = dict(family='Arial', verticalalignment=\"center\",\n                     horizontalalignment=\"center\")\n        ax.set_xlabel(r'${K_{s}}$', fontsize=20)\n        ax.set_ylabel('Frequency', fontsize=20)\n        ax.tick_params(labelsize=18)\n        ax.legend(fontsize=20)\n        ax.set_xlim(self.area)\n        plt.subplots_adjust(left=0.09, right=0.96, top=0.93, bottom=0.12)\n        plt.savefig(self.savefig, dpi=500)\n        plt.show()\n        sys.exit(0)\n"
  },
  {
    "path": "wgdi/pindex.py",
    "content": "import os\nimport sys\n\nimport numpy as np\nimport pandas as pd\nimport wgdi.base as base\n\n\nclass pindex():\n    def __init__(self, options):\n        self.remove_delta = True\n        self.position = 'order'\n        self.retention = 0.05\n        self.diff = 0.05\n        self.gap = 50\n        for k, v in options:\n            setattr(self, str(k), v)\n            print(k, ' = ', v)\n        self.gap = int(self.gap)\n        self.retention = float(self.retention)\n        self.diff = float(self.diff)\n\n    def Pindex(self, sub1, sub2):\n        r1 = self.retain(sub1)\n        r2 = self.retain(sub2)\n        r = []\n        for i in range(len(r2)):\n            if(r1[i] < self.retention or r2[i] < self.retention):\n                r.append(0)\n                continue\n            d = (r1[i]-r2[i])/(r1[i]+r2[i])*0.5\n            if d > self.diff:\n                r.append(1)\n            elif -d > self.diff:\n                r.append(-1)\n            else:\n                r.append(0)\n        a, b, c = len([i for i in r if i == 1]), len(\n            [i for i in r if i == -1]), len([i for i in r if i == 0])\n        return [a, -b, c, len(r)]\n\n    def retain(self, arr):\n        a = []\n        for i in range(0, len(arr), 2*self.gap):\n            start, end = i-self.gap, i+self.gap\n            genenum, retainnum = 0, 0\n            for j in range(start, end):\n                if((j >= int(len(arr))) or (j < 0)):\n                    continue\n                else:\n                    retainnum += arr[j]\n                    genenum += 1\n            a.append(float(retainnum/genenum))\n        return a\n\n    def run(self):\n        alignment = pd.read_csv(self.alignment, header=None, index_col=0)\n        alignment.replace(r'\\w+', 1, regex=True, inplace=True)\n        alignment.replace('.', 0, inplace=True)\n        alignment.fillna(0, inplace=True)\n        gff = base.newgff(self.gff)\n        lens = base.newlens(self.lens, self.position)\n        gff = gff[gff['chr'].isin(lens.index)]\n        alignment = alignment.join(gff[['chr', self.position]], how='left')\n        alignment.dropna(axis=0, how='any', inplace=True)\n        p = self.cal_pindex(alignment)\n        print('Polyploidy-index: ', p)\n        sys.exit(0)\n\n    def cal_pindex(self, alignment):\n        data, df = [], []\n        columns = alignment.columns[:-2].tolist()\n        for i in range(len(columns)-1):\n            for j in range(i+1, len(columns)):\n                b = []\n                for chr, group in alignment.groupby('chr'):\n                    sub1 = group.loc[:, columns[i]].tolist()\n                    sub2 = group.loc[:, columns[j]].tolist()\n                    p = self.Pindex(sub1, sub2)\n                    b.append(p)\n                    df.append([i, j, chr]+p)\n                sub_diver = sum([abs(k[0]+k[1]) for k in b])\n                if self.remove_delta == True:\n                    sub_total = sum([abs(k[1])+abs(k[0]) for k in b])\n                    if sub_total == 0:\n                        c = 0\n                    else:\n                        c = sub_diver/sub_total\n                else:\n                    sub_total = sum([abs(k[1])+abs(k[0])+abs(k[2]) for k in b])\n                    c = sub_diver/sub_total\n                data.append(c)\n        df = pd.DataFrame(df, columns=[\n                          'sub1', 'sub2', 'chr', 'sub1_high', 'sub2_high', 'No_diff', 'Total'])\n        df['sub2_high'] = df['sub2_high'].abs()\n        self.infomation(df)\n        print('\\nPolyploidy-index between subgenomes are ', data)\n        return sum(data)/len(data)\n\n    def turn_percentage(self, x):\n        return '(%.2f%%)' % (x * 100)\n\n    def infomation(self, df):\n        data = []\n        for names, group in df.groupby(['sub1', 'sub2']):\n            newgroup = pd.concat([group.head(1), group],\n                                 axis=0, ignore_index=True)\n            cols = ['sub1_high', 'sub2_high', 'No_diff', 'Total']\n            newgroup.loc[0, cols] = group.loc[:, cols].sum()\n            group1 = newgroup.copy()\n            group1[cols] = group1[cols].astype(str)\n            newgroup['sub1_high'] = (\n                newgroup['sub1_high'] / newgroup['Total']).apply(self.turn_percentage)\n            newgroup['sub2_high'] = (\n                newgroup['sub2_high'] / newgroup['Total']).apply(self.turn_percentage)\n            newgroup['No_diff'] = (\n                newgroup['No_diff'] / newgroup['Total']).apply(self.turn_percentage)\n            newgroup['Total'] = (\n                newgroup['Total'] / group['Total'].sum()).apply(self.turn_percentage)\n            newgroup[cols] = group1[cols]+newgroup[cols]\n            group_list = []\n            a = newgroup[['chr']+cols].columns.to_numpy()\n            a[0] = 'Chromosome'\n            a[1], a[2] = 'Sub_'+str(names[0]+1), 'Sub_'+str(names[1]+1)\n            group_list.append(a)\n            b = newgroup[['chr']+cols].to_numpy()\n            b[0][0] = 'Total'\n            for k in b:\n                group_list.append(k)\n            group_list = np.array(group_list).T\n            for k in group_list:\n                data.append(k)\n        data = pd.DataFrame(data)\n        data.to_csv(self.savefile, header=None, index=None)\n"
  },
  {
    "path": "wgdi/polyploidy_classification.py",
    "content": "import pandas as pd\nimport wgdi.base as base\n\n\nclass polyploidy_classification:\n    def __init__(self, options):\n        self.same_protochromosome = False\n        self.same_subgenome = False\n        for k, v in options:\n            setattr(self, str(k), v)\n            print(f\"{k} = {v}\")\n        \n        self.same_protochromosome = base.str_to_bool(self.same_protochromosome)\n        self.same_subgenome = base.str_to_bool(self.same_subgenome)\n        \n        # Initialize classid with a default value if not provided\n        self.classid = [str(k) for k in getattr(self, 'classid', 'class1,class2').split(',')]\n\n    def run(self):\n        # Read input files\n        ancestor_left = base.read_classification(self.ancestor_left)\n        ancestor_top = base.read_classification(self.ancestor_top)\n        bkinfo = pd.read_csv(self.blockinfo)\n\n        # Ensure chr1 and chr2 are treated as strings\n        bkinfo['chr1'] = bkinfo['chr1'].astype(str)\n        bkinfo['chr2'] = bkinfo['chr2'].astype(str)\n\n        # Filter rows where chr1 and chr2 match ancestor values\n        bkinfo = bkinfo[bkinfo['chr1'].isin(ancestor_left[0].values) & bkinfo['chr2'].isin(ancestor_top[0].values)]\n\n        # Initialize additional columns\n        bkinfo[self.classid[0]] = 0\n        bkinfo[self.classid[1]] = 0\n        bkinfo[self.classid[0] + '_color'] = ''\n        bkinfo[self.classid[1] + '_color'] = ''\n        bkinfo['diff'] = 0.0\n\n        # Processing the first classification (ancestor_left vs chr1)\n        for name, group in bkinfo.groupby('chr1'):\n            d1 = ancestor_left[ancestor_left[0] == name]\n            for index1, row1 in group.iterrows():\n                a, b = sorted([row1['start1'], row1['end1']])\n                a, b = int(a), int(b)\n                for index2, row2 in d1.iterrows():\n                    c, d = sorted([row2[1], row2[2]])\n                    h = len([k for k in range(a, b) if k in range(c, d)]) / (b - a)\n                    if h > bkinfo.loc[index1, 'diff']:\n                        bkinfo.loc[index1, 'diff'] = float(h)\n                        bkinfo.loc[index1, self.classid[0]] = row2[4]\n                        bkinfo.loc[index1, self.classid[0] + '_color'] = row2[3]\n\n        # Reset 'diff' and process the second classification (ancestor_top vs chr2)\n        bkinfo['diff'] = 0.0\n        for name, group in bkinfo.groupby('chr2'):\n            d2 = ancestor_top[ancestor_top[0] == name]\n            for index1, row1 in group.iterrows():\n                a, b = sorted([row1['start2'], row1['end2']])\n                a, b = int(a), int(b)\n                for index2, row2 in d2.iterrows():\n                    c, d = sorted([row2[1], row2[2]])\n                    h = len([k for k in range(a, b) if k in range(c, d)]) / (b - a)\n                    if h > bkinfo.loc[index1, 'diff']:\n                        bkinfo.loc[index1, 'diff'] = float(h)\n                        bkinfo.loc[index1, self.classid[1]] = row2[4]\n                        bkinfo.loc[index1, self.classid[1] + '_color'] = row2[3]\n\n        # Uncomment if you want to filter rows where both colors match\n        if self.same_protochromosome == True:\n            bkinfo = bkinfo[bkinfo[self.classid[1] + '_color'] == bkinfo[self.classid[0] + '_color']]\n        if self.same_subgenome == True:\n            bkinfo = bkinfo[bkinfo[self.classid[1]] == bkinfo[self.classid[0]]]  \n\n        # Save the result to a CSV file\n        bkinfo.to_csv(self.savefile, index=False)\n"
  },
  {
    "path": "wgdi/retain.py",
    "content": "import matplotlib.pyplot as plt\nimport pandas as pd\nimport wgdi.base as base\n\nclass retain:\n    def __init__(self, options):\n        self.position = 'order'\n        \n        # Initialize the options by setting attributes dynamically\n        for k, v in options:\n            setattr(self, str(k), v)\n            print(f\"{str(k)} = {v}\")\n\n        # Handle the ylim parameter, which defines the y-axis limits\n        self.ylim = [float(k) for k in self.ylim.split(',')] if hasattr(self, 'ylim') else [0, 1]\n        \n        # Handle the colors and figsize parameters\n        self.colors = [str(k) for k in self.colors.split(',')]\n        self.figsize = [float(k) for k in self.figsize.split(',')]\n\n    def run(self):\n        # Load GFF and lens data\n        gff = base.newgff(self.gff)\n        lens = base.newlens(self.lens, self.position)\n        \n        # Filter GFF data based on lens chromosome index\n        gff = gff[gff['chr'].isin(lens.index)]\n        \n        # Load alignment data and join with GFF\n        alignment = pd.read_csv(self.alignment, header=None, index_col=0)\n        alignment = alignment.join(gff[['chr', self.position]], how='left')\n        \n        # Perform alignment processing\n        self.retain = self.align_chr(alignment)\n        \n        # Save the processed data to a file\n        self.retain[self.retain.columns[:-2]].to_csv(self.savefile, sep='\\t', header=None)\n        \n        # Create a figure for plotting\n        fig, axs = plt.subplots(len(lens), 1, sharex=True, sharey=True, figsize=tuple(self.figsize))\n        fig.add_subplot(111, frameon=False)\n        \n        align = dict(family='DejaVu Sans', verticalalignment=\"center\", horizontalalignment=\"center\")\n\n        \n        # Hide all the spines and ticks on the plot\n        for spine in plt.gca().spines.values():\n            spine.set_visible(False)\n        plt.tick_params(top=False, bottom=False, left=False, right=False, labelleft=False, labelbottom=False)\n        \n        # Group the retain data by chromosome and plot each chromosome's data\n        groups = self.retain.groupby('chr')\n        for i, chr_name in enumerate(lens.index):\n            group = groups.get_group(chr_name)\n\n            if len(lens) == 1:\n                for j, col in enumerate(self.retain.columns[:-2]):\n                    axs.plot(group['order'].values, group[col].values,\n                                linestyle='-', color=self.colors[j], linewidth=1)\n                axs.spines['right'].set_visible(False)\n                axs.spines['top'].set_visible(False)\n                axs.set_ylim(self.ylim)\n                axs.tick_params(labelsize=12)                \n            else:\n                # Plot each column's data for the current chromosome\n                for j, col in enumerate(self.retain.columns[:-2]):\n                    axs[i].plot(group['order'].values, group[col].values,\n                                linestyle='-', color=self.colors[j], linewidth=1)\n            \n                # Hide the right and top spines for each subplot\n                axs[i].spines['right'].set_visible(False)\n                axs[i].spines['top'].set_visible(False)\n                axs[i].set_ylim(self.ylim)\n                axs[i].tick_params(labelsize=12)\n\n        for i, chr_name in enumerate(lens.index):\n            if len(lens) == 1:\n                x, y = axs.get_xlim()[1] * 0.90, axs.get_ylim()[1] * 0.8\n                axs.text(x, y, f\"{self.refgenome} {chr_name}\", fontsize=14, **align)\n            else:\n                # Add a label for the reference genome and chromosome\n                x, y = axs[i].get_xlim()[1] * 0.90, axs[i].get_ylim()[1] * 0.8\n                axs[i].text(x, y, f\"{self.refgenome} {chr_name}\", fontsize=14, **align)\n        \n        # Adjust layout and save the figure as an image\n        plt.ylabel(f\"{self.ylabel}\\n\\n\\n\\n\", fontsize=18, **align)\n        plt.subplots_adjust(left=0.1, right=0.95, top=0.95, bottom=0.05)\n        plt.savefig(self.savefig, dpi=500)\n        plt.show()\n\n    def align_chr(self, alignment):\n        \"\"\"\n        Perform the alignment processing for each chromosome by updating the values.\n        \"\"\"\n        for i in alignment.columns[:-2]:\n            # Update values: set '1' for valid values, '0' for invalid, and fill NaN with 0\n            alignment.loc[alignment[i].str.contains(r'\\w', na=False), i] = 1\n            alignment.loc[alignment[i] == '.', i] = 0\n            alignment.loc[alignment[i] == ' ', i] = 0\n            alignment[i] = alignment[i].astype('float64').fillna(0)\n            \n            # Apply the moving average function to each group by chromosome\n            for chr_name, group in alignment.groupby(['chr']):\n                a = self.moving_average(group[i].values.tolist())\n                alignment.loc[group.index, i] = a\n        return alignment\n\n    def moving_average(self, arr):\n        \"\"\"\n        Calculate a moving average over a specified window size.\n        This function smooths the input array using a sliding window.\n        \"\"\"\n        a = []\n        for i in range(len(arr)):\n            # Define the window range\n            start, end = max(0, i - int(self.step)), min(len(arr), i + int(self.step))\n            ave = sum(arr[start:end]) / (end - start)\n            a.append(ave)\n        return a\n"
  },
  {
    "path": "wgdi/run.py",
    "content": "import argparse\nimport os\nimport shutil\nimport sys\n\nimport wgdi\nimport wgdi.base as base\nfrom wgdi.align_dotplot import align_dotplot\nfrom wgdi.block_correspondence import block_correspondence\nfrom wgdi.block_info import block_info\nfrom wgdi.block_ks import block_ks\nfrom wgdi.circos import circos\nfrom wgdi.dotplot import dotplot\nfrom wgdi.karyotype import karyotype\nfrom wgdi.karyotype_mapping import karyotype_mapping\nfrom wgdi.ks import ks\nfrom wgdi.ks_peaks import kspeaks\nfrom wgdi.ksfigure import ksfigure\nfrom wgdi.peaksfit import peaksfit\nfrom wgdi.pindex import pindex\nfrom wgdi.polyploidy_classification import polyploidy_classification\nfrom wgdi.retain import retain\nfrom wgdi.run_colliearity import mycollinearity\nfrom wgdi.trees import trees\nfrom wgdi.ancestral_karyotype import ancestral_karyotype\nfrom wgdi.ancestral_karyotype_repertoire import ancestral_karyotype_repertoire\nfrom wgdi.shared_fusion import shared_fusion\nfrom wgdi.fusion_positions_database import fusion_positions_database\nfrom wgdi.fusions_detection import fusions_detection\n\n\n# Argument parser setup\nparser = argparse.ArgumentParser(\n    prog='wgdi', usage='%(prog)s [options]', epilog=\"\",\n    formatter_class=argparse.RawDescriptionHelpFormatter\n)\n\nparser.description = '''\\\nWGDI(Whole-Genome Duplication Integrated): A user-friendly toolkit for evolutionary analyses of whole-genome duplications and ancestral karyotypes.\n\n    https://wgdi.readthedocs.io/en/latest/\n    -------------------------------------- \n'''\n\nparser.add_argument(\"-v\", \"--version\", action='version', version='0.75')\nparser.add_argument(\"-d\", dest=\"dotplot\", help=\"Show homologous gene dotplot\")\nparser.add_argument(\"-icl\", dest=\"improvedcollinearity\", help=\"Improved version of ColinearScan \")\nparser.add_argument(\"-ks\", dest=\"calks\", help=\"Calculate Ka/Ks for homologous gene pairs by YN00\")\nparser.add_argument(\"-bk\", dest=\"blockks\", help=\"Show Ks of blocks in a dotplot\")\nparser.add_argument(\"-bi\", dest=\"blockinfo\", help=\"Collinearity and Ks speculate whole genome duplication\")\nparser.add_argument(\"-c\", dest=\"correspondence\", help=\"Extract event-related genomic alignment\")\nparser.add_argument(\"-kp\", dest=\"kspeaks\", help=\"A simple way to get ks peaks\")\nparser.add_argument(\"-kf\", dest=\"ksfigure\", help=\"A simple way to draw ks distribution map\")\nparser.add_argument(\"-pf\", dest=\"peaksfit\", help=\"Gaussian fitting of ks distribution\")\nparser.add_argument(\"-pc\", dest=\"polyploidy_classification\", help=\"Polyploid distinguish among subgenomes\")\nparser.add_argument(\"-a\", dest=\"alignment\", help=\"Show event-related genomic alignment in a dotplot\")\nparser.add_argument(\"-k\", dest=\"karyotype\", help=\"Show genome evolution from reconstructed ancestors\")\nparser.add_argument(\"-ak\", dest=\"ancestral_karyotype\", help=\"Generation of ancestral karyotypes from chromosomes that retain same structures in genomes\")\nparser.add_argument(\"-akr\", dest=\"ancestral_karyotype_repertoire\", help=\"Incorporate genes from collinearity blocks into the ancestral karyotype repertoire\")\nparser.add_argument(\"-km\", dest=\"karyotype_mapping\", help=\"Mapping from the known karyotype result to this species\")\nparser.add_argument(\"-fpd\", dest=\"fusion_positions_database\", help=\"Extract the fusion positions dataset\")\nparser.add_argument(\"-fd\", dest=\"fusions_detection\", help=\"Determine whether these fusion events occur in other genomes\")\nparser.add_argument(\"-sf\", dest=\"shared_fusion\", help=\"Quickly find shared fusions between species\")\nparser.add_argument(\"-at\", dest=\"alignmenttrees\", help=\"Collinear genes construct phylogenetic trees\")\nparser.add_argument(\"-p\", dest=\"pindex\", help=\"Polyploidy-index characterize the degree of divergence between subgenomes of a polyploidy\")\nparser.add_argument(\"-r\", dest=\"retain\", help=\"Show subgenomes in gene retention or genome fractionation\")\nparser.add_argument(\"-ci\", dest=\"circos\", help=\"A simple way to run circos\")\nparser.add_argument(\"-conf\", dest=\"configure\", help=\"Display and modify the environment variable\")\n\nargs = parser.parse_args()\n\n# Function to run subprograms based on options\ndef run_subprogram(program, conf, name):\n    options = base.load_conf(conf, name)\n    r = program(options)\n    r.run()\n\n# Function to configure environment\ndef run_configure():\n    base.rewrite(args.configure, 'ini')\n\n# Main function to decide which module to run based on input arguments\ndef module_to_run(argument, conf):\n    switcher = {\n        'dotplot': (dotplot, conf, 'dotplot'),\n        'correspondence': (block_correspondence, conf, 'correspondence'),\n        'alignment': (align_dotplot, conf, 'alignment'),\n        'retain': (retain, conf, 'retain'),\n        'blockks': (block_ks, conf, 'blockks'),\n        'blockinfo': (block_info, conf, 'blockinfo'),\n        'calks': (ks, conf, 'ks'),\n        'circos': (circos, conf, 'circos'),\n        'kspeaks': (kspeaks, conf, 'kspeaks'),\n        'peaksfit': (peaksfit, conf, 'peaksfit'),\n        'ksfigure': (ksfigure, conf, 'ksfigure'),\n        'pindex': (pindex, conf, 'pindex'),\n        'alignmenttrees': (trees, conf, 'alignmenttrees'),\n        'improvedcollinearity': (mycollinearity, conf, 'collinearity'),\n        'configure': run_configure,\n        'polyploidy_classification': (polyploidy_classification, conf, 'polyploidy classification'),\n        'karyotype': (karyotype, conf, 'karyotype'),\n        'ancestral_karyotype': (ancestral_karyotype, conf, 'ancestral_karyotype'),\n        'karyotype_mapping': (karyotype_mapping, conf, 'karyotype_mapping'),\n        'ancestral_karyotype_repertoire': (ancestral_karyotype_repertoire, conf, 'ancestral_karyotype_repertoire'),\n        'shared_fusion': (shared_fusion, conf, 'shared_fusion'),\n        'fusion_positions_database': (fusion_positions_database, conf, 'fusion_positions_database'),\n        'fusions_detection': (fusions_detection, conf, 'fusions_detection'),\n    }\n    \n    if argument == 'configure':\n        run_configure()\n    else:\n        program, conf, name = switcher.get(argument)\n        if program:\n            run_subprogram(program, conf, name)\n\n\n# Main entry point\ndef main():\n    path = wgdi.__path__[0]\n    options = {\n        'dotplot': 'dotplot.conf',\n        'correspondence': 'corr.conf',\n        'alignment': 'align.conf',\n        'retain': 'retain.conf',\n        'blockks': 'blockks.conf',\n        'blockinfo': 'blockinfo.conf',\n        'calks': 'ks.conf',\n        'circos': 'circos.conf',\n        'kspeaks': 'kspeaks.conf',\n        'ksfigure': 'ksfigure.conf',\n        'pindex': 'pindex.conf',\n        'alignmenttrees': 'alignmenttrees.conf',\n        'peaksfit': 'peaksfit.conf',\n        'configure': 'conf.ini',\n        'improvedcollinearity': 'collinearity.conf',\n        'polyploidy_classification': 'polyploidy_classification.conf',\n        'karyotype': 'karyotype.conf',\n        'ancestral_karyotype': 'ancestral_karyotype.conf',\n        'ancestral_karyotype_repertoire': 'ancestral_karyotype_repertoire.conf',\n        'karyotype_mapping': 'karyotype_mapping.conf',\n        'shared_fusion': 'shared_fusion.conf',\n        'fusion_positions_database': 'fusion_positions_database.conf',\n        'fusions_detection': 'fusions_detection.conf',\n    }\n\n    for arg in vars(args):\n        value = getattr(args, arg)\n        if value is not None:\n            if value in ['?', 'help', 'example']:\n                with open(os.path.join(path, 'example', options[arg])) as f:\n                    print(f.read())\n                \n                if arg == 'ksfigure' and not os.path.exists('ks_fit_result.csv'):\n                    shutil.copy2(os.path.join(wgdi.__path__[0], 'example/ks_fit_result.csv'), os.getcwd())\n            elif not os.path.exists(value):\n                print(f'{value} not exists')\n                sys.exit(0)\n            else:\n                module_to_run(arg, value)\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "wgdi/run_colliearity.py",
    "content": "import gc\nimport re\nimport sys\nfrom multiprocessing import Pool\n\nimport numpy as np\nimport pandas as pd\n\nimport wgdi.base as base\nimport wgdi.collinearity as improvedcollinearity\n\n\nclass mycollinearity():\n    def __init__(self, options):\n        # Initialize parameters with default values\n        self.repeat_number = 10\n        self.multiple = 1\n        self.score = 100\n        self.evalue = 1e-5\n        self.blast_reverse = False\n        self.over_gap  = 5\n        self.comparison = 'genomes'\n        self.options = options\n\n        for k, v in options:\n            setattr(self, str(k), v)\n            print(f\"{str(k)} = {v}\")\n        self.position = 'order'\n        # Parse grading values\n        if hasattr(self, 'grading'):\n            self.grading = [int(k) for k in self.grading.split(',')]\n        else:\n            self.grading = [50, 40, 25]\n        # Ensure process is an integer\n        if hasattr(self, 'process'):\n            self.process = int(self.process)\n        else:\n            self.process = 4\n        self.over_gap  = int(self.over_gap )\n        base.str_to_bool(self.blast_reverse)\n\n    def deal_blast_for_chromosomes(self, blast, rednum, repeat_number):\n        bluenum = rednum\n        blast = blast.sort_values(by=[0, 11], ascending=[True, False])\n        def assign_grading(group):\n            group['cumcount'] = group.groupby(1).cumcount()\n            group = group[group['cumcount'] <= repeat_number]\n            group['grading'] = pd.cut(\n                group['cumcount'],\n                bins=[-1, 0, bluenum, repeat_number],\n                labels=self.grading,\n                right=True\n            )\n            return group\n        newblast = blast.groupby(['chr1', 'chr2']).apply(assign_grading).reset_index(drop=True)\n        newblast['grading'] = newblast['grading'].astype(int)\n        return newblast[newblast['grading'] > 0]\n    \n    def deal_blast_for_genomes(self, blast, rednum, repeat_number):\n        # Initialize the grading column\n        blast['grading'] = 0\n        \n        # Define the blue number as the sum of rednum and the predefined constant\n        bluenum = 4 + rednum\n        \n        # Get the indices for each group by sorting the 11th column in descending order\n        index = [group.sort_values(by=[11], ascending=[False])[:repeat_number].index.tolist()\n                for name, group in blast.groupby([0])]\n        \n        # Split the indices into red, blue, and gray groups\n        reddata = np.array([k[:rednum] for k in index], dtype=object)\n        bluedata = np.array([k[rednum:bluenum] for k in index], dtype=object)\n        graydata = np.array([k[bluenum:repeat_number] for k in index], dtype=object)\n        \n        # Concatenate the results into flat lists\n        redindex = np.concatenate(reddata) if reddata.size else []\n        blueindex = np.concatenate(bluedata) if bluedata.size else []\n        grayindex = np.concatenate(graydata) if graydata.size else []\n\n        # Update the grading column based on the group indices\n        blast.loc[redindex, 'grading'] = self.grading[0]\n        blast.loc[blueindex, 'grading'] = self.grading[1]\n        blast.loc[grayindex, 'grading'] = self.grading[2]\n\n        # Return only the rows with non-zero grading\n        return blast[blast['grading'] > 0]\n\n    def run(self):\n        # Read and process lens files\n        lens1 = base.newlens(self.lens1, 'order')\n        lens2 = base.newlens(self.lens2, 'order')\n        # Read and process gff files\n        gff1 = base.newgff(self.gff1)\n        gff2 = base.newgff(self.gff2)\n        # Filter gff data based on lens indices\n        gff1 = gff1[gff1['chr'].isin(lens1.index)]\n        gff2 = gff2[gff2['chr'].isin(lens2.index)]\n        # Process blast data\n\n        blast = base.newblast(self.blast, int(self.score), float(self.evalue),gff1, gff2, self.blast_reverse)\n\n        # Map positions and chromosome information\n        blast['loc1'] = blast[0].map(gff1[self.position])\n        blast['loc2'] = blast[1].map(gff2[self.position])\n        blast['chr1'] = blast[0].map(gff1['chr'])\n        blast['chr2'] = blast[1].map(gff2['chr'])\n        # Apply blast filtering and grading\n        if self.comparison.lower() == 'genomes':\n            blast = self.deal_blast_for_genomes(blast, int(self.multiple), int(self.repeat_number))\n        if self.comparison.lower() == 'chromosomes':\n            blast = self.deal_blast_for_chromosomes(blast, int(self.multiple), int(self.repeat_number))\n        print(f\"The filtered homologous gene pairs are {len(blast)}.\\n\")\n        if len(blast) < 1:\n            print(\"Stopped!\\n\\nIt may be that the id1 and id2 in the BLAST file do not match with (gff1, lens1) and (gff2, lens2).\")\n            sys.exit(1)\n        # Group blast data by 'chr1' and 'chr2'\n        total = []\n        for (chr1, chr2), group in blast.groupby(['chr1', 'chr2']):\n            total.append([chr1, chr2, group])\n        del blast, group\n        gc.collect()\n        # Determine chunk size for multiprocessing\n        n = int(np.ceil(len(total) / float(self.process)))\n        result, data = '', []\n        try:\n            # Initialize multiprocessing Pool\n            pool = Pool(self.process)\n            for i in range(0, len(total), n):\n                # Apply single_pool function asynchronously\n                data.append(pool.apply_async(\n                    self.single_pool, args=(total[i:i + n], gff1, gff2, lens1, lens2)\n                ))\n            pool.close()\n            pool.join()\n        except:\n            pool.terminate()\n        for k in data:\n            # Collect results from async tasks\n            text = k.get()\n            if text:\n                result += text\n        # Write final output to file\n        result = re.split('\\n', result)\n        fout = open(self.savefile, 'w')\n        num = 1\n        for line in result:\n            if re.match(r\"# Alignment\", line):\n                # Replace alignment number\n                s = f'# Alignment {num}:'\n                fout.write(s + line.split(':')[1] + '\\n')\n                num += 1\n                continue\n            if len(line) > 0:\n                fout.write(line + '\\n')\n        fout.close()\n        sys.exit(0)\n\n    def single_pool(self, group, gff1, gff2, lens1, lens2):\n        text = ''\n        for bk in group:\n            chr1, chr2 = str(bk[0]), str(bk[1])\n            print(f'Running {chr1} vs {chr2}')\n            # Extract and sort points\n            points = bk[2][['loc1', 'loc2', 'grading']].sort_values(\n                by=['loc1', 'loc2'], ascending=[True, True]\n            )\n            # Initialize collinearity analysis\n            collinearity = improvedcollinearity.collinearity(\n                self.options, points)\n            data = collinearity.run()\n            if not data:\n                continue\n            # Extract gene information\n            gf1 = gff1[gff1['chr'] == chr1].reset_index().set_index('order')[[1, 'strand']]\n            gf2 = gff2[gff2['chr'] == chr2].reset_index().set_index('order')[[1, 'strand']]\n            n = 1\n            for block, evalue, score in data:\n                if len(block) < self.over_gap:\n                    continue\n                # Map gene names and strands\n                block['name1'] = block['loc1'].map(gf1[1])\n                block['name2'] = block['loc2'].map(gf2[1])\n                block['strand1'] = block['loc1'].map(gf1['strand'])\n                block['strand2'] = block['loc2'].map(gf2['strand'])\n                block['strand'] = np.where(\n                    block['strand1'] == block['strand2'], '1', '-1'\n                )\n                # Prepare text output\n                block['text'] = block.apply(\n                    lambda x: f\"{x['name1']} {x['loc1']} {x['name2']} {x['loc2']} {x['strand']}\\n\",\n                    axis=1\n                )\n                # Determine alignment mark\n                a, b = block['loc2'].head(2).values\n                mark = 'plus' if a < b else 'minus'\n                # Append alignment information\n                text += f'# Alignment {n}: score={score} pvalue={evalue} N={len(block)} {chr1}&{chr2} {mark}\\n'\n                text += ''.join(block['text'].values)\n                n += 1\n        return text"
  },
  {
    "path": "wgdi/shared_fusion.py",
    "content": "import pandas as pd\nimport wgdi.base as base\n\nclass shared_fusion:\n    def __init__(self, options):\n        for k, v in options:\n            setattr(self, str(k), v)\n            print(f\"{k} = {v}\")\n        \n        # Handle classid and limit_length options\n        self.classid = [str(k) for k in self.classid.split(',')] if hasattr(self, 'classid') else ['class1', 'class2']\n        self.limit_length = int(self.limit_length) if hasattr(self, 'limit_length') else 20\n        \n        # Clean and split lens files\n        self.lens1 = self.lens1.replace(' ', '').split(',')\n        self.lens2 = self.lens2.replace(' ', '').split(',')\n\n    def run(self):\n        # Read classification files and block information\n        ancestor_left = base.read_classification(self.ancestor_left)\n        ancestor_top = base.read_classification(self.ancestor_top)\n        bkinfo = pd.read_csv(self.blockinfo)\n\n        # Preprocess blockinfo columns\n        bkinfo['chr1'] = bkinfo['chr1'].astype(str)\n        bkinfo['chr2'] = bkinfo['chr2'].astype(str)\n        bkinfo['start1'] = bkinfo['start1'].astype(int)\n        bkinfo['end1'] = bkinfo['end1'].astype(int)\n        bkinfo['start2'] = bkinfo['start2'].astype(int)\n        bkinfo['end2'] = bkinfo['end2'].astype(int)\n        \n        # Filter based on ancestor chromosomes\n        bkinfo = bkinfo[(bkinfo['chr1'].isin(ancestor_left[0].values)) & \n                        (bkinfo['chr2'].isin(ancestor_top[0].values))]\n\n        # Read lens files\n        lens1 = pd.read_csv(self.lens1[0], sep='\\t', header=None)\n        lens2 = pd.read_csv(self.lens2[0], sep='\\t', header=None)\n        lens1[0] = lens1[0].astype(str)\n        lens2[0] = lens2[0].astype(str)\n\n        # Perform block fusion analysis\n        blockinfoout = self.block_fusions(bkinfo, ancestor_left, ancestor_top)\n\n        # Apply filters based on breakpoints and length\n        blockinfoout = blockinfoout[(blockinfoout['breakpoints1'] == 1) & \n                                     (blockinfoout['breakpoints2'] == 1)]\n        blockinfoout = blockinfoout[(blockinfoout['break_length1'] >= self.limit_length) & \n                                     (blockinfoout['break_length2'] >= self.limit_length)]\n\n        # Save the filtered block info\n        blockinfoout.to_csv(self.filtered_blockinfo, index=False)\n\n        # Filter lens data based on the blockinfoout\n        lens1 = lens1[lens1[0].isin(blockinfoout['chr1'].values)]\n        lens2 = lens2[lens2[0].isin(blockinfoout['chr2'].values)]\n\n        # Save filtered lens data\n        lens1.to_csv(self.lens1[1], sep='\\t', index=False, header=False)\n        lens2.to_csv(self.lens2[1], sep='\\t', index=False, header=False)\n\n    def block_fusions(self, bkinfo, ancestor_left, ancestor_top):\n        # Initialize new columns in the bkinfo dataframe\n        bkinfo['breakpoints1'] = 0\n        bkinfo['breakpoints2'] = 0\n        bkinfo['break_length1'] = 0\n        bkinfo['break_length2'] = 0\n\n        for index, row in bkinfo.iterrows():\n            # Process species 1 (chr1)\n            a, b = sorted([row['start1'], row['end1']])\n            d1 = ancestor_left[(ancestor_left[0] == row['chr1']) & \n                               (ancestor_left[2] >= a) & (ancestor_left[1] <= b)]\n            if len(d1) > 1:\n                bkinfo.loc[index, 'breakpoints1'] = 1\n                breaklength_max = 0\n                for _, row2 in d1.iterrows():\n                    length_in = len([k for k in range(a, b) if k in range(row2[1], row2[2])])\n                    length_out = (b - a) - length_in\n                    breaklength_max = max(breaklength_max, min(length_in, length_out) + 1)\n                bkinfo.loc[index, 'break_length1'] = breaklength_max\n\n            # Process species 2 (chr2)\n            c, d = sorted([row['start2'], row['end2']])\n            d2 = ancestor_top[(ancestor_top[0] == row['chr2']) & \n                              (ancestor_top[2] >= c) & (ancestor_top[1] <= d)]\n            if len(d2) > 1:\n                bkinfo.loc[index, 'breakpoints2'] = 1\n                breaklength_max = 0\n                for _, row2 in d2.iterrows():\n                    length_in = len([k for k in range(c, d) if k in range(row2[1], row2[2])])\n                    length_out = (d - c) - length_in\n                    breaklength_max = max(breaklength_max, min(length_in, length_out) + 1)\n                bkinfo.loc[index, 'break_length2'] = breaklength_max\n\n        return bkinfo\n"
  },
  {
    "path": "wgdi/trees.py",
    "content": "import os\nimport shutil\nfrom io import StringIO\n\nimport numpy as np\nimport pandas as pd\nfrom Bio import AlignIO, Seq, SeqIO, SeqRecord\nimport subprocess\n\nimport wgdi.base as base\n\n\nclass trees():\n    def __init__(self, options):\n        base_conf = base.config()\n        self.position = 'order'\n        self.alignfile = ''\n        self.align_trimming = ''\n        self.trimming = 'trimal'\n        self.threads = '1'\n        self.minimum = 4\n        self.tree_software = 'iqtree'\n        self.delete_detail = True\n        for k, v in base_conf:\n            setattr(self, str(k), v)\n        for k, v in options:\n            setattr(self, str(k), v)\n            print(str(k), ' = ', v)\n        if hasattr(self, 'codon_position'):\n            self.codon_position = [\n                int(k)-1 for k in self.codon_position.split(',')]\n        else:\n            self.codon_position = [0, 1, 2]\n        self.delete_detail = base.str_to_bool(self.delete_detail)\n\n    def grouping(self, alignment):\n        data = []\n        indexs = []\n        if not os.path.exists(self.dir):\n            os.makedirs(self.dir)\n        sequence = SeqIO.to_dict(SeqIO.parse(self.sequence_file, \"fasta\"))\n        if hasattr(self, 'cds_file'):\n            seq_cds = SeqIO.to_dict(SeqIO.parse(self.cds_file, \"fasta\"))\n        for index, row in alignment.iterrows():\n            file = base.gen_md5_id(str(row.values))\n            self.sequencefile = os.path.join(self.dir, file+'.fasta')\n            self.alignfile = os.path.join(self.dir, file+'.aln')\n            self.align_trimming = self.alignfile+'.trimming'\n            self.treefile = os.path.join(self.dir, file+'.aln.treefile')\n            if os.path.isfile(self.treefile) and os.path.isfile(self.alignfile):\n                data.append(self.treefile)\n                indexs.append(index)\n                continue\n            ids = []\n            ids_cds = []\n            for i in range(len(row)):\n                if type(row[i]) == float and np.isnan(row[i]):\n                    continue\n                gene_sequence = sequence[row[i]]\n                gene_sequence.id = str(int(i)+1)\n                gene_sequence.description = ''\n                ids.append(gene_sequence)\n            SeqIO.write(ids, self.sequencefile, \"fasta\")\n            self.align()\n            if hasattr(self, 'cds_file'):\n                self.seqcdsfile = os.path.join(self.dir, file+'.cds.fasta')\n                for i in range(len(row)):\n                    if type(row[i]) == float and np.isnan(row[i]):\n                        continue\n                    gene_cds = seq_cds[row[i]]\n                    gene_cds.id = str(int(i)+1)\n                    ids_cds.append(gene_cds)\n                SeqIO.write(ids_cds, self.seqcdsfile, \"fasta\")\n                self.pal2nal()\n                self.codon()\n            if self.trimming.upper() == 'TRIMAL':\n                self.trimal()\n            if self.trimming.upper() == 'DIVVIER':\n                self.divvier()\n            self.buildtrees()\n            if os.path.isfile(self.treefile):\n                data.append(self.treefile)\n        return data\n\n    def codon(self):\n        if self.codon_position == [0, 1, 2]:\n            shutil.move(self.alignfile+'.mrtrans', self.alignfile)\n            return True\n        records = list(SeqIO.parse(self.alignfile+'.mrtrans', 'fasta'))\n        if len(records) == 0:\n            return False\n        newrecords = []\n        def final_list(test_list, x, y): return [\n            test_list[i+j] for i in range(0, len(test_list), x) for j in y]\n        for k in records:\n            if len(k.seq) % 3 > 0:\n                return False\n            seq = final_list(k.seq, 3, self.codon_position)\n            k.seq = ''.join(seq)\n            newrecords.append(SeqRecord.SeqRecord(\n                Seq.Seq(k.seq), id=k.id, description=''))\n        SeqIO.write(newrecords, self.alignfile, 'fasta')\n        return True\n\n    def pal2nal(self):\n        args = ['perl', self.pal2nal_path, self.alignfile,\n                self.seqcdsfile, '-output fasta', '>'+self.alignfile+'.mrtrans']\n        command = ' '.join(args)\n        try:\n            os.system(command)\n        except:\n            return False\n        return True\n\n    def align(self):\n        if self.align_software == 'mafft':\n            try:\n                command = [self.mafft_path,'--quiet', self.sequencefile, '>', self.alignfile]\n                subprocess.run(\" \".join(command), shell=True, check=True)\n            except subprocess.CalledProcessError as e:\n                print(f\"Error while running MAFFT: {e}\")\n\n        if self.align_software == 'muscle':\n            try:\n                command = [self.muscle_path,'-align', self.sequencefile, '-output', self.alignfile, '-quiet']\n                subprocess.run(\" \".join(command), shell=True, check=True)\n            except subprocess.CalledProcessError as e:\n                print(f\"Error while running Muscle: {e}\")\n\n    def trimal(self):\n        args = [self.trimal_path, '-in', self.alignfile,\n                '-out', self.align_trimming, '-automated1']\n        command = ' '.join(args)\n        try:\n            os.system(command)\n        except:\n            return False\n        return True\n\n    def divvier(self):\n        args = [self.divvier_path, '-mincol', '4', '-divvygap', self.alignfile]\n        command = ' '.join(args)\n        try:\n            os.system(command)\n            os.rename(self.alignfile+'.divvy.fas', self.align_trimming)\n        except:\n            return False\n        return True\n\n    def buildtrees(self):\n        try:\n            if self.tree_software.upper() == 'IQTREE':\n                args = [self.iqtree_path, '-s', self.align_trimming,\n                        '-m', self.model, '-T', self.threads, '--quiet']\n                command = ' '.join(args)\n                os.system(command)\n                os.rename(self.align_trimming+'.treefile', self.treefile)\n            elif self.tree_software.upper() == 'FASTTREE':\n                args = [self.fasttree_path,\n                        self.align_trimming, '>', self.treefile]\n                command = ' '.join(args)\n                os.system(command)\n        except:\n            return False\n        if self.delete_detail == True:\n            for file in (self.sequencefile, self.align_trimming+'.bionj', self.align_trimming+'.iqtree', self.align_trimming+'.ckp.gz',\n                         self.align_trimming+'.log', self.align_trimming+'.mldist', self.align_trimming+'.model.gz'):\n                try:\n                    os.remove(file)\n                except OSError:\n                    pass\n        return True\n\n    def run(self):\n        alignment = pd.read_csv(self.alignment, header=None)\n        alignment.replace('.', np.nan, inplace=True)\n        alignment.dropna(thresh=int(self.minimum), inplace=True)\n        if hasattr(self, 'gff') and hasattr(self, 'lens'):\n            gff = base.newgff(self.gff)\n            lens = base.newlens(self.lens, self.position)\n            alignment = pd.merge(\n                alignment, gff[['chr', self.position]], left_on=0, right_on=gff.index, how='left')\n            alignment.dropna(subset=['chr', 'order'], inplace=True)\n            alignment['order'] = alignment['order'].astype(int)\n            alignment = alignment[alignment['chr'].isin(lens.index)]\n            alignment.drop(alignment.columns[-2:], axis=1, inplace=True)\n        data = self.grouping(alignment)\n        fout = open(self.trees_file, 'w')\n        fout.close()\n        for i in range(0, len(data), 100):\n            trees = ' '.join([str(k) for k in data[i:i+100]])\n            args = ['cat', trees, '>>', self.trees_file]\n            command = ' '.join([str(k) for k in args])\n            os.system(command)\n        df = pd.read_csv(self.trees_file, header=None, sep='\\t')\n        df[0].to_csv(self.trees_file, index=None, sep='\\t', header=False)\n        print(\"done\")"
  },
  {
    "path": "wgdi.egg-info/PKG-INFO",
    "content": "Metadata-Version: 2.1\nName: wgdi\nVersion: 0.75\nSummary: A user-friendly toolkit for evolutionary analyses of whole-genome duplications and ancestral karyotypes\nHome-page: https://github.com/SunPengChuan/wgdi\nAuthor: Pengchuan Sun\nAuthor-email: sunpengchuan@gmail.com\nLicense: BSD License\nClassifier: Intended Audience :: Science/Research\nClassifier: Programming Language :: Python :: 3\nClassifier: License :: OSI Approved :: BSD License\nClassifier: Operating System :: OS Independent\nDescription-Content-Type: text/markdown\nLicense-File: LICENSE\nRequires-Dist: pandas>=1.1.0\nRequires-Dist: numpy\nRequires-Dist: biopython\nRequires-Dist: matplotlib\nRequires-Dist: scipy\nRequires-Dist: tabulate\n\n# WGDI\n\n![Latest PyPI version](https://img.shields.io/pypi/v/wgdi.svg) [![Downloads](https://pepy.tech/badge/wgdi/month)](https://pepy.tech/project/wgdi) [![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/wgdi/README.html)\n\n| | |\n| --- | --- |\n| Author  | Pengchuan Sun ([sunpengchuan](https//github.com/sunpengchuan)) |\n| Email   | <sunpengchuan@gmail.com> |\n| License | [BSD](http://creativecommons.org/licenses/BSD/) |\n\n## Description\n\n**WGDI (Whole-Genome Duplication Integrated analysis)** is a Python-based command-line tool designed to simplify the analysis of whole-genome duplications (WGD) and cross-species genome alignments. It offers three main workflows that enhance the detection and study of WGD events:\n\n## Key Features\n\n### 1. Polyploid Inference\n- Identifies and confirms polyploid events with high accuracy.\n\n### 2. Genomic Homology Inference\n- Traces the evolutionary history of duplicated regions across species, with a focus on distinguishing subgenomes. \n\n### 3. Ancestral Karyotyping\n- Reconstructs protochromosomes and traces common chromosomal rearrangements to understand chromosome evolution. \n\n\n## Installation\n\nPython package and command line interface (IDLE) for the analysis of whole genome duplications (WGDI). WGDI can be deployed in Windows, Linux, and Mac OS operating systems and can be installed via pip and conda.\n\n#### Bioconda\n\n```\nconda install -c bioconda  wgdi\n```\n\n#### Pypi\n\n```\npip3 install wgdi\n```\n\nDocumentation for installation along with a user tutorial, a default parameter file, and test data are provided. please consult the docs at <http://wgdi.readthedocs.io/en/latest/>.\n\n## Tips\n\nHere are some videos with simple examples of WGDI.\n\n###### [WGDI的简单使用（一）](https://www.bilibili.com/video/BV1qK4y1U7eK) or https://youtu.be/k-S6FVcBIQw\n\n###### [WGDI的简单使用（二）](https://www.bilibili.com/video/BV195411P7L1) or https://youtu.be/QiZYFYGclyE\n\nchatting group QQ : 966612552\n\n## Citating WGDI\n\nIf you use wgdi in your work, please cite:\n\n> Sun P., Jiao B., Yang Y., Shan L., Li T., Li X., Xi Z., Wang X., and Liu J. (2022). WGDI: A user-friendly toolkit for evolutionary analyses of whole-genome duplications and ancestral karyotypes. Mol. Plant. doi: https://doi.org/10.1016/j.molp.2022.10.018.\n\n## News\n\n## 0.75\n* Fixed some issues (-fpd).\n* Introduced a threads parameter for the iqtree command within alignmenttrees (-at).\n\n## 0.74\n* Improved the the fusion positions dataset (-fpd).\n* Fixed some issues (-pc).\n\n## 0.7.1\n* Added extract the fusion positions dataset (-fpd).\n* Added determine whether these fusion events occur in other genomes (-fd).\n* Improved the karyotype_mapping (-km) effect.\n* Fixed the problem caused by the Python version, now it is compatible with version 3.12.\n\n\n## 0.6.5\n* Fixed some issues (-sf).\n* Added new tips to avoid some errors.\n\n## 0.6.4\n* Fixed the problem caused by the Python version, now it is compatible with version 3.11.3.\n\n## 0.6.3\n* Fixed some issues (-ks, -sf).\n\n## 0.6.2\n* Added find shared fusions between species (-sf).\n\n## 0.6.1\n\n* Fixed issue with alignment (-a). Only version 0.6.0 has this bug.\n\n## 0.6.0\n\n* Fixed issue with improved collinearity (-icl).\n* Added a parameter 'tandem_ratio' to blockinfo (-bi).\n\n## 0.5.9\n\n* Update the improved collinearity (-icl). Faster than before, but lower than MCscanX, JCVI.\n* Fixed issue with ancestral karyotype repertoire (-akr).\n\n## 0.5.8\n\n* Fixed issue with gene names (-ks).\n\n## 0.5.7\n- Fixed issue with chromosome order (-ak).\n- Fixed issue with gene names (-ks).  This version is not fixed, please install the latest version.\n\n## 0.5.5 and 0.5.6\n* Add ancestral karyotype (-ak)\n* Add ancestral karyotype repertoire (-akr)\n\n## 0.5.4\n* Improved the karyotype_mapping (-km) effect.\n* little change (-at).\n\n## 0.5.3\n* Fixed legend issue with (-kf).\n* Fixed calculate Ks issue with (-ks).\n* Improved the karyotype_mapping (-km) effect.\n* Improved the alignmenttrees (-at) effect.\n\n## 0.5.2\n* Fixed some bugs.\n\n## 0.5.1\n* Fixed the error of the command (-conf).\n* Improved the karyotype_mapping (-km) effect.\n* Added the available data set of alignmenttree (-at). Low copy data set (for example, single-copy_groups.tsv of sonicparanoid2 software).\n\n## 0.4.9\n* The latest version adds karyotype_mapping (-km) and karyotype (-k) display.\n* The latest version changes the calculation of extracting pvalue from collinearity (-icl), making this parameter more sensitive. Therefore, it is recommended to set to 0.2 instead of 0.05.\n* The latest version has also changed the drawing display of ksfigure (-kf) to make it more beautiful.\n"
  },
  {
    "path": "wgdi.egg-info/SOURCES.txt",
    "content": "LICENSE\nREADME.md\nsetup.py\nwgdi/__init__.py\nwgdi/align_dotplot.py\nwgdi/ancestral_karyotype.py\nwgdi/ancestral_karyotype_repertoire.py\nwgdi/base.py\nwgdi/block_correspondence.py\nwgdi/block_info.py\nwgdi/block_ks.py\nwgdi/circos.py\nwgdi/collinearity.py\nwgdi/dotplot.py\nwgdi/fusion_positions_database.py\nwgdi/fusions_detection.py\nwgdi/karyotype.py\nwgdi/karyotype_mapping.py\nwgdi/ks.py\nwgdi/ks_peaks.py\nwgdi/ksfigure.py\nwgdi/peaksfit.py\nwgdi/pindex.py\nwgdi/polyploidy_classification.py\nwgdi/retain.py\nwgdi/run.py\nwgdi/run_colliearity.py\nwgdi/shared_fusion.py\nwgdi/trees.py\nwgdi.egg-info/PKG-INFO\nwgdi.egg-info/SOURCES.txt\nwgdi.egg-info/dependency_links.txt\nwgdi.egg-info/entry_points.txt\nwgdi.egg-info/requires.txt\nwgdi.egg-info/top_level.txt\nwgdi.egg-info/zip-safe\nwgdi/example/__init__.py\nwgdi/example/align.conf\nwgdi/example/alignmenttrees.conf\nwgdi/example/ancestral_karyotype.conf\nwgdi/example/ancestral_karyotype_repertoire.conf\nwgdi/example/blockinfo.conf\nwgdi/example/blockks.conf\nwgdi/example/circos.conf\nwgdi/example/collinearity.conf\nwgdi/example/conf.ini\nwgdi/example/corr.conf\nwgdi/example/dotplot.conf\nwgdi/example/fusion_positions_database.conf\nwgdi/example/fusions_detection.conf\nwgdi/example/karyotype.conf\nwgdi/example/karyotype_mapping.conf\nwgdi/example/ks.conf\nwgdi/example/ks_fit_result.csv\nwgdi/example/ksfigure.conf\nwgdi/example/kspeaks.conf\nwgdi/example/peaksfit.conf\nwgdi/example/pindex.conf\nwgdi/example/polyploidy_classification.conf\nwgdi/example/retain.conf\nwgdi/example/shared_fusion.conf"
  },
  {
    "path": "wgdi.egg-info/dependency_links.txt",
    "content": "\n"
  },
  {
    "path": "wgdi.egg-info/entry_points.txt",
    "content": "[console_scripts]\nwgdi = wgdi.run:main\n"
  },
  {
    "path": "wgdi.egg-info/requires.txt",
    "content": "pandas>=1.1.0\nnumpy\nbiopython\nmatplotlib\nscipy\ntabulate\n"
  },
  {
    "path": "wgdi.egg-info/top_level.txt",
    "content": "wgdi\n"
  },
  {
    "path": "wgdi.egg-info/zip-safe",
    "content": "\n"
  }
]