[
  {
    "path": ".github/workflows/python-package.yml",
    "content": "# This workflow will install Python dependencies, run tests and lint with a variety of Python versions\n# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions\n\nname: Python package\n\non:\n  push:\n    branches: [ master ]\n  pull_request:\n    branches: [ master ]\n\njobs:\n  build:\n\n    runs-on: ubuntu-22.04\n    strategy:\n      matrix:\n        python-version: [\"3.9\", \"3.10\", \"3.11\", \"3.12\", \"3.13\"]\n\n    steps:\n    - uses: actions/checkout@v2\n    - name: Set up Python ${{ matrix.python-version }}\n      uses: actions/setup-python@v2\n      with:\n        python-version: ${{ matrix.python-version }}\n    - name: Install dependencies\n      run: |\n        python -m pip install --upgrade pip poetry==1.8.5\n        poetry install\n    - name: Lint\n      run: |\n        make lint\n    - name: Test\n      run: |\n        make test\n"
  },
  {
    "path": ".gitignore",
    "content": ".coverage\n.vscode/\n.mypy_cache/\n.python-version\n__pycache__/\ndist\nhtmlcov/\nncbitax2lin.egg-info/\nbuild\n"
  },
  {
    "path": "CHANGELOG.md",
    "content": "## Change Log\n\n### v3.0.0 (2025/09/23)\n\n- Fixed https://github.com/zyxue/ncbitax2lin/issues/31\n- Upgraded dependencies and support py39 to py313 instead.\n\n### v2.3.0 (2022/03/20)\n\n- Supports Python-3.9\n\n### v2.2.0 (2022/03/20)\n\n- Fixed bug related to sharing global variables among multiple processes. (#14, #15)\n\n### v2.0.2 (2020/05/02)\n\n- Made pylint and mypy pass.\n\n### v2.0.1 (2020/05/02)\n\n- Adopted [poetry](https://python-poetry.org/) for package management.\n- Modernized the code (Python-3.7, typing, and some tests).\n\n### v1.1 (2017/03/17)\n\n- Remove hosting converted lineages.csv.gz from the repo.\n- Converted lineages will be versioned and hosted elsewhere.\n\n### v1.0 (2016/04/24)\n\n- Organized the code into a release.\n"
  },
  {
    "path": "LICENSE.txt",
    "content": "MIT License\n\nCopyright (c) 2017 zyxue\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.\n"
  },
  {
    "path": "Makefile",
    "content": "SRC_DIR=ncbitax2lin\nTESTS_DIR=tests\n\n# https://www.gnu.org/software/make/manual/html_node/Force-Targets.html\nFORCE:\n\nformat: FORCE\n\tpoetry run autoflake --recursive --in-place --remove-all-unused-imports $(SRC_DIR) $(TESTS_DIR) \\\n\t&& poetry run black $(SRC_DIR) $(TESTS_DIR) \\\n\t&& poetry run isort $(SRC_DIR) $(TESTS_DIR) \\\n\nblack: FORCE\n\tpoetry run black --check $(SRC_DIR) $(TESTS_DIR)\n\nisort: FORCE\n\tpoetry run isort --check $(SRC_DIR) $(TESTS_DIR)\n\nmypy: FORCE\n\tpoetry run mypy $(SRC_DIR) $(TESTS_DIR)\n\npylint: FORCE\n\tpoetry run pylint $(SRC_DIR) $(TESTS_DIR)\n\ntest: FORCE\n\tPYTHONHASHSEED=1 \\\n\t&& poetry run coverage run --source=$(SRC_DIR) --module pytest --durations=10 --failed-first $(1) \\\n\t&& poetry run coverage report --show-missing \\\n\t&& poetry run coverage html\n\nlint: black isort mypy pylint\n\nall: lint test\n"
  },
  {
    "path": "README.md",
    "content": "# NCBItax2lin\n\n[![Downloads](https://pepy.tech/badge/ncbitax2lin/week)](https://pepy.tech/project/ncbitax2lin)\n\nConvert NCBI taxonomy dump into lineages. An example for [human\n(tax_id=9606)](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9606)\nis like\n\n| tax_id | superkingdom | phylum   | class    | order    | family    | genus | species      | family1 | forma | genus1 | infraclass | infraorder  | kingdom | no rank            | no rank1     | no rank10            | no rank11 | no rank12 | no rank13 | no rank14 | no rank15     | no rank16 | no rank17 | no rank18 | no rank19 | no rank2  | no rank20 | no rank21 | no rank22 | no rank3  | no rank4      | no rank5   | no rank6      | no rank7   | no rank8     | no rank9      | parvorder  | species group | species subgroup | species1 | subclass | subfamily | subgenus | subkingdom | suborder    | subphylum | subspecies | subtribe | superclass | superfamily | superorder       | superorder1 | superphylum | tribe | varietas |\n|--------|--------------|----------|----------|----------|-----------|-------|--------------|---------|-------|--------|------------|-------------|---------|--------------------|--------------|----------------------|-----------|-----------|-----------|-----------|---------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|---------------|------------|---------------|------------|--------------|---------------|------------|---------------|------------------|----------|----------|-----------|----------|------------|-------------|-----------|------------|----------|------------|-------------|------------------|-------------|-------------|-------|----------|\n| 9606   | Eukaryota    | Chordata | Mammalia | Primates | Hominidae | Homo  | Homo sapiens |         |       |        |            | Simiiformes | Metazoa | cellular organisms | Opisthokonta | Dipnotetrapodomorpha | Tetrapoda | Amniota   | Theria    | Eutheria  | Boreoeutheria |           |           |           |           | Eumetazoa |           |           |           | Bilateria | Deuterostomia | Vertebrata | Gnathostomata | Teleostomi | Euteleostomi | Sarcopterygii | Catarrhini |               |                  |          |          | Homininae |          |            | Haplorrhini | Craniata  |            |          |            | Hominoidea  | Euarchontoglires |             |             |       |          |\n\n### Install\n\nncbitax2lin supports python-3.9 to python-3.13.\n\n```\npip install -U ncbitax2lin\n```\n\nIt is also available in Conda on the Bioconda channel:\n\n```\nconda install bioconda::ncbitax2lin\n```\n\n### Generate lineages\n\nFirst download taxonomy dump from NCBI:\n\n```bash\nwget -N ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz\nmkdir -p taxdump && tar zxf taxdump.tar.gz -C ./taxdump\n```\n\nThen, run ncbitax2lin\n\n```bash\nncbitax2lin --nodes-file taxdump/nodes.dmp --names-file taxdump/names.dmp\n```\n\nBy default, the generated lineages will be saved to\n`ncbi_lineages_[date_of_utcnow].csv.gz`. The output file can be overwritten with\n`--output` option.\n\n\n## FAQ\n\n**Q**: I have a large number of sequences with their corresponding accession\nnumbers from NCBI, how to get their lineages?\n\n**A**: First, you need to map accession numbers (GI is deprecated) to tax IDs\nbased on `nucl_*accession2taxid.gz` files from\nftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/. Secondly, you can trace a\nsequence's whole lineage based on its tax ID. The tax-id-to-lineage mapping is\nwhat NCBItax2lin can generate for you.\n\nIf you have any question about this project, please feel free to create a new\n[issue](https://github.com/zyxue/ncbitax2lin/issues/new).\n\n## Note on `taxdump.tar.gz.md5`\n\nIt appears that NCBI periodically regenerates `taxdump.tar.gz` and\n`taxdump.tar.gz.md5` even when its content is still the same. I am not sure how\ntheir regeneration works, but `taxdump.tar.gz.md5` will differ simply because\nof a different timestamp.\n\n## Used in\n\n* Mahmoudabadi, G., & Phillips, R. (2018). A comprehensive and quantitative exploration of thousands of viral genomes. ELife, 7. https://doi.org/10.7554/eLife.31955\n* Dombrowski, N. et al. (2020) Undinarchaeota illuminate DPANN phylogeny and the impact of gene transfer on archaeal evolution, Nature Communications. Springer US, 11(1). doi: 10.1038/s41467-020-17408-w. https://www.nature.com/articles/s41467-020-17408-w\n* Schenberger Santos, A. R. et al. (2020) NAD+ biosynthesis in bacteria is controlled by global carbon/ nitrogen levels via PII signaling, Journal of Biological Chemistry, 295(18), pp. 6165–6176. doi: 10.1074/jbc.RA120.012793. https://www.sciencedirect.com/science/article/pii/S0021925817482433\n* Villada, J. C., Duran, M. F. and Lee, P. K. H. (2020) Interplay between Position-Dependent Codon Usage Bias and Hydrogen Bonding at the 5' End of ORFeomes, mSystems, 5(4), pp. 1–18. doi: 10.1128/msystems.00613-20. https://msystems.asm.org/content/5/4/e00613-20\n* Byadgi, O. et al. (2020) Transcriptome analysis of amyloodinium ocellatum tomonts revealed basic information on the major potential virulence factors, Genes, 11(11), pp. 1–12. doi: 10.3390/genes11111252. https://www.mdpi.com/2073-4425/11/11/1252\n* Cumbo, F., & Blankenberg, D. (2025). Characterization of microbial dark matter at scale with MetaSBT and taxonomy-aware Sequence Bloom Trees. bioRxiv. https://doi.org/10.1101/2025.08.25.672238\n\n## Development\n\n### Install dependencies\n\n```\npoetry install --sync\n```\n\n### Testing\n\n```\nmake format\nmake all\n```\n\n### Publish (only for administrator)\n\n```\npoetry version [minor/major etc.]\ngit tag vx.y.z\ngit push origin vx.y.z\npoetry publish --build -u __token__ --password pypi-<token-from-pypi>\n```\nUpdate [CHANGELOG.md](/CHANGELOG.md).\n"
  },
  {
    "path": "mypy.ini",
    "content": "[mypy]\npython_version = 3.9\ndisallow_untyped_defs = True\nignore_missing_imports = True\nshow_column_numbers = True"
  },
  {
    "path": "ncbitax2lin/__init__.py",
    "content": "\"\"\"__init__.py for this project\"\"\"\n\n__version__ = \"2.4.1\"\n"
  },
  {
    "path": "ncbitax2lin/data_io.py",
    "content": "\"\"\"utility functions related to IO\"\"\"\n\nimport pandas as pd\n\nfrom ncbitax2lin import utils\n\n\ndef strip(str_: str) -> str:\n    \"\"\"\n    :param str_: a string\n    \"\"\"\n    return str_.strip()\n\n\n@utils.timeit\ndef load_nodes(nodes_file: str) -> pd.DataFrame:\n    \"\"\"\n    load nodes.dmp and convert it into a pandas.DataFrame\n    \"\"\"\n    df_data = pd.read_csv(\n        nodes_file,\n        sep=\"|\",\n        header=None,\n        index_col=False,\n        names=[\n            \"tax_id\",\n            \"parent_tax_id\",\n            \"rank\",\n            \"embl_code\",\n            \"division_id\",\n            \"inherited_div_flag\",\n            \"genetic_code_id\",\n            \"inherited_GC__flag\",\n            \"mitochondrial_genetic_code_id\",\n            \"inherited_MGC_flag\",\n            \"GenBank_hidden_flag\",\n            \"hidden_subtree_root_flag\",\n            \"comments\",\n        ],\n    )\n\n    return df_data.assign(\n        rank=lambda df: df[\"rank\"].apply(strip),\n        embl_code=lambda df: df[\"embl_code\"].apply(strip),\n        comments=lambda df: df[\"comments\"].apply(strip),\n    )\n\n\n@utils.timeit\ndef load_names(names_file: str) -> pd.DataFrame:\n    \"\"\"\n    load names.dmp and convert it into a pandas.DataFrame\n    \"\"\"\n    df_data = pd.read_csv(\n        names_file,\n        sep=\"|\",\n        header=None,\n        index_col=False,\n        names=[\"tax_id\", \"name_txt\", \"unique_name\", \"name_class\"],\n    )\n\n    return (\n        df_data.assign(\n            name_txt=lambda df: df[\"name_txt\"].apply(strip),\n            unique_name=lambda df: df[\"unique_name\"].apply(strip),\n            name_class=lambda df: df[\"name_class\"].apply(strip),\n        )\n        .loc[lambda df: df[\"name_class\"] == \"scientific name\"]\n        .reset_index(drop=True)\n    )\n\n\ndef read_names_and_nodes(names_file: str, nodes_file: str) -> pd.DataFrame:\n    \"\"\"Reads in data from names and nodes files\"\"\"\n    # data downloaded from ftp://ftp.ncbi.nih.gov/pub/taxonomy/\n    # args = parse_args()\n    nodes_df = load_nodes(nodes_file)\n    names_df = load_names(names_file)\n\n    return (\n        nodes_df.merge(names_df, on=\"tax_id\")[\n            [\"tax_id\", \"parent_tax_id\", \"rank\", \"name_txt\"]\n        ]\n        .rename(columns={\"name_txt\": \"rank_name\"})\n        .reset_index(drop=True)\n    )\n\n\ndef write_lineages_to_disk(df_lineages: pd.DataFrame, output_path: str) -> None:\n    \"\"\"Gzip lineages and write them to disk\"\"\"\n    # superkingdom has been renamed to domain in\n    # https://ncbiinsights.ncbi.nlm.nih.gov/2024/06/04/changes-ncbi-taxonomy-classifications/\n    domain_col = \"domain\"\n\n    # For backwards compatibility with older taxdumps.\n    if \"superkingdom\" in df_lineages:\n        domain_col = \"superkingdom\"\n\n    cols = [\n        \"tax_id\",\n        domain_col,\n        \"phylum\",\n        \"class\",\n        \"order\",\n        \"family\",\n        \"genus\",\n        \"species\",\n    ]\n    other_cols = sorted([col for col in df_lineages.columns if col not in cols])\n    output_cols = cols + other_cols\n\n    df_lineages.to_csv(\n        output_path, index=False, compression=\"gzip\", columns=output_cols\n    )\n"
  },
  {
    "path": "ncbitax2lin/fmt.py",
    "content": "\"\"\"Utilities for preparing the lineages for output.\"\"\"\n\nimport concurrent.futures\nfrom typing import Container, Dict, List, Union\n\nimport pandas as pd\n\nfrom ncbitax2lin.struct import Lineage\n\n\ndef _calc_rank_key(rank: str, existing_ranks: Container[str]) -> str:\n    \"\"\"Calcluates a key for the lineage representation in a dictionary.\n\n    Defaults to the rank itself, e.g. no rank, superkingdom, phylum, etc. but\n    when a rank appears multiple times (common for \"no rank\" rank) in a single\n    linearge it will be numbered, e.g. no rank1, no rank2, and so on.\n\n    Args:\n        rank: e.g. no rank, superkingdom, phylum, etc.\n        existing_ranks: rank keys already existing\n    \"\"\"\n    # e.g. there could be multiple 'no rank'\n    if rank not in existing_ranks:\n        return rank\n\n    count = 1\n    numbered_rank = f\"{rank}{count}\"\n    while numbered_rank in existing_ranks:\n        count += 1\n        numbered_rank = f\"{rank}{count}\"\n    return numbered_rank\n\n\ndef _convert_lineage_to_dict(lineage: Lineage) -> Dict[str, Union[int, str]]:\n    \"\"\"Converts the lineage in a list-of-tuples represetantion to a dictionary representation\n\n    [\n        (\"tax_id1\", \"rank1\", \"name_txt1\"),\n        (\"tax_id2\", \"rank2\", \"name_txt2\"),\n        ...\n    ]\n\n    becomes\n\n    {\n        \"rank1\": \"name_txt1\",\n        \"rank2\": \"name_txt2\",\n        \"tax_id\": \"tax_id2\",   # using the last rank as the tax_id of this lineage\n    }\n\n    A concrete example:\n\n        [\n            (131567, 'no rank', 'cellular organisms'),\n            (2, 'superkingdom', 'Bacteria')\n        ]\n\n    becomes\n\n        {\n            'no rank': 'cellular organisms',\n            'superkingdom': 'Bacteria',\n            'tax_id': 2,\n        }\n\n    \"\"\"\n    output: Dict[str, Union[int, str]] = {}\n    len_lineage = len(lineage)\n    for k, (tax_id, rank, rank_name) in enumerate(lineage):\n        # use the last rank of the lineage as the tax_id of the lineage\n        if k == len_lineage - 1:\n            output[\"tax_id\"] = tax_id\n\n        rank_key = _calc_rank_key(rank, output.keys())\n        output[rank_key] = rank_name\n    return output\n\n\ndef prepare_lineages_for_output(lineages: List[Lineage]) -> pd.DataFrame:\n    \"\"\"prepares lineages into a dataframe for writing to disk\"\"\"\n\n    with concurrent.futures.ProcessPoolExecutor() as executors:\n        out = executors.map(_convert_lineage_to_dict, lineages, chunksize=5000)\n\n    df_out = pd.DataFrame(out)\n\n    return df_out.sort_values(\"tax_id\")\n"
  },
  {
    "path": "ncbitax2lin/lineage.py",
    "content": "\"\"\"Utilities for finding lineages.\"\"\"\n\nimport logging\nimport math\nimport multiprocessing\nimport os\nimport pickle\nimport tempfile\nfrom typing import Dict, List\n\nfrom ncbitax2lin import utils\nfrom ncbitax2lin.struct import Lineage, TaxUnit\n\n_LOGGER = logging.getLogger(__name__)\n\n# tax_id of first line in names.dmp: no rank\nROOT_TAX_ID = 1\n\n\ndef _find_one_lineage(tax_id: int, tax_dict: Dict[int, TaxUnit]) -> Lineage:\n    \"\"\"Finds lineage for a single tax id\"\"\"\n    if tax_id % 50000 == 0:\n        # TODO: it's tricky why _LOGGER.info here won't make the log show up.\n        # Note, this function is run in a subprocess.\n        print(f\"working on tax_id: {tax_id}\")\n\n    lineage = []\n    while True:\n        record = tax_dict[tax_id]\n        lineage.append((record[\"tax_id\"], record[\"rank\"], record[\"rank_name\"]))\n        tax_id = record[\"parent_tax_id\"]\n\n        # every tax can be traced back to tax_id == 1, the root\n        if tax_id == ROOT_TAX_ID:\n            break\n\n    # reverse results in lineage of Kingdom => species, this is helpful for\n    # to_dict when there are multiple \"no rank\"s\n    lineage.reverse()\n    return Lineage(lineage)\n\n\ndef _find_lineages(\n    tax_ids: List[int], tax_dict: Dict[int, TaxUnit], output: str\n) -> None:\n    \"\"\"Finds lineages for a list of tax ids.\"\"\"\n\n    lineages = []\n    for tax_id in tax_ids:\n        lineage = _find_one_lineage(tax_id, tax_dict)\n        lineages.append(lineage)\n\n    with open(output, \"wb\") as opened:\n        pickle.dump(lineages, opened)\n\n\ndef _calc_num_procs(max_num: int = 6) -> int:\n    \"\"\"Calculates number of the processes to use.\"\"\"\n    return min(multiprocessing.cpu_count(), max_num)\n\n\ndef _calc_chunk_size(num_vals: int, num_chunks: int) -> int:\n    \"\"\"Calculates the chunk size.\"\"\"\n    return math.ceil(num_vals / num_chunks)\n\n\ndef find_all_lineages(\n    tax_ids: List[int], tax_dict: Dict[int, TaxUnit]\n) -> List[Lineage]:\n    \"\"\"Finds the lineages for all tax ids\n\n    Args:\n        tax_id: all tax ids to find lineages for.\n        tax_dict: a dictionary of tax_id => tax_unit.\n    \"\"\"\n    nprocs = _calc_num_procs()\n    _LOGGER.info(\n        \"will use %d processes to find lineages for all %s tax ids\",\n        nprocs,\n        f\"{len(tax_ids):,d}\",\n    )\n\n    chunk_size = _calc_chunk_size(len(tax_ids), num_chunks=nprocs)\n    _LOGGER.info(\"chunk_size = %d\", chunk_size)\n\n    tax_id_chunks = utils.partition(tax_ids, size=chunk_size)\n    _LOGGER.info(\"chunked sizes: %s\", [len(_) for _ in tax_id_chunks])\n\n    procs, tmp_outputs, all_lineages = [], [], []\n\n    with tempfile.TemporaryDirectory(suffix=\"_ncbitax2lin\") as tmpdir:\n        for index, chunk in enumerate(tax_id_chunks):\n            tmp_output = os.path.join(tmpdir, f\"_lineages_{index}.pkl\")\n\n            tmp_outputs.append(tmp_output)\n            proc = multiprocessing.Process(\n                target=_find_lineages, args=(chunk, tax_dict, tmp_output)\n            )\n            procs.append(proc)\n\n        _LOGGER.info(\"Starting %d processes ...\", len(procs))\n        for proc in procs:\n            proc.start()\n\n        _LOGGER.info(\"Joining %d processes ...\", len(procs))\n        for proc in procs:\n            proc.join()\n\n        for tmp_output in tmp_outputs:\n            _LOGGER.info(\"adding lineages from %s ...\", tmp_output)\n            with open(tmp_output, \"rb\") as opened:\n                all_lineages.extend(pickle.load(opened))\n\n    assert len(all_lineages) == len(tax_ids), (\n        f\"There are {len(tax_ids)} tax_ids, but {len(all_lineages)} lineages are generated, \"\n        \"the two numbers should've been the same\"\n    )\n    return all_lineages\n"
  },
  {
    "path": "ncbitax2lin/ncbitax2lin.py",
    "content": "\"\"\"Converts NCBI taxonomy dump into lineages\"\"\"\n\nimport logging\nimport sys\nfrom typing import Dict, Optional\n\nimport fire\nimport pandas as pd\n\nfrom ncbitax2lin import data_io, fmt, lineage, utils\nfrom ncbitax2lin.struct import TaxUnit\n\nlogging.basicConfig(level=logging.DEBUG, format=\"%(asctime)s|%(levelname)s|%(message)s\")\n\n\n_LOGGER = logging.getLogger(__name__)\n\n\ndef _calc_taxonomy_dict(df_tax: pd.DataFrame) -> Dict[int, TaxUnit]:\n    \"\"\"Converts dataframe of df_tax into a dictionary with tax_id as the keys\"\"\"\n    return dict(zip(df_tax.tax_id.values, df_tax.to_dict(\"records\")))\n\n\ndef taxonomy_to_lineages(\n    nodes_file: str, names_file: str, output: Optional[str] = None\n) -> None:\n    \"\"\"Converts NCBI taxomony dump into lineages.\n\n    NCBI taxonomy dump can be downloaded from\n    ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz\n\n    Args:\n        nodes_file: path/to/taxdump/nodes.dmp from NCBI taxonomy\n        names_file: path/to/taxdump/names.dmp from NCBI taxonomy\n        output_prefix: output lineages will be written to output_prefix.csv.gz\n    \"\"\"\n    df_data = data_io.read_names_and_nodes(names_file, nodes_file)\n    _LOGGER.info(\"# of tax ids: %s\", f\"{df_data.shape[0]:,d}\")\n    _LOGGER.info(\"df.info:\\n%s\", f\"{utils.collect_df_info(df_data)}\")\n\n    _LOGGER.info(\"Generating a dictionary of taxonomy: tax_id => tax_unit ...\")\n    tax_dict = _calc_taxonomy_dict(df_data)\n\n    tax_dict_size_mb = sys.getsizeof(tax_dict) / 2**20\n    _LOGGER.info(\"size of taxonomy_dict: ~%s MB\", f\"{tax_dict_size_mb:.0f}\")\n\n    tax_ids = df_data.tax_id.to_numpy().tolist()\n\n    _LOGGER.info(\"Finding all lineages ...\")\n    all_lineages = lineage.find_all_lineages(tax_ids, tax_dict)\n\n    _LOGGER.info(\"Preparings all lineages into a dataframe to be written to disk ...\")\n    df_lineages = fmt.prepare_lineages_for_output(all_lineages)\n\n    if output is None:\n        output = f\"ncbi_lineages_{pd.Timestamp.utcnow().date()}.csv.gz\"\n\n    utils.maybe_backup_file(output)\n\n    _LOGGER.info(\"Writing lineages to %s ...\", output)\n    data_io.write_lineages_to_disk(df_lineages, output)\n\n\ndef main() -> None:\n    \"\"\"Main function, entry point\"\"\"\n    fire.Fire(taxonomy_to_lineages)\n"
  },
  {
    "path": "ncbitax2lin/struct.py",
    "content": "\"\"\"Data strutures.\"\"\"\n\nfrom typing import List, NewType, Tuple\n\nfrom typing_extensions import TypedDict\n\n\nclass TaxUnit(TypedDict):\n    \"\"\"\n    Represents a basic unit in taxonomy e.g. (phylum, Proteobacteria), where\n    phylum is the rank, and Proteobacteria is the rank name\n    \"\"\"\n\n    tax_id: int\n    parent_tax_id: int  # tax_id of parent tax unit for this tax unit\n    rank: str\n    rank_name: str\n\n\n# A lineage is a list of (tax_id, rank, rank_name) tuples.\nLineage = NewType(\"Lineage\", List[Tuple[int, str, str]])\n"
  },
  {
    "path": "ncbitax2lin/utils.py",
    "content": "\"\"\"Utility functions\"\"\"\n\nimport datetime\nimport functools\nimport io\nimport logging\nimport os\nimport time\nfrom typing import Any, Callable, List, TypeVar\n\nimport pandas as pd\n\n_LOGGER = logging.getLogger(__name__)\n\n\ndef timeit(func: Callable[..., Any]) -> Callable[..., Any]:\n    \"\"\"Times a function, usually used as decorator\"\"\"\n\n    @functools.wraps(func)\n    def timed_func(*args: Any, **kwargs: Any) -> Any:\n        \"\"\"Returns the timed function\"\"\"\n        start_time = time.time()\n        result = func(*args, **kwargs)\n        elapsed_time = datetime.timedelta(seconds=time.time() - start_time)\n        _LOGGER.info(\"time spent on %s: %s\", func.__name__, elapsed_time)\n        return result\n\n    return timed_func\n\n\ndef maybe_backup_file(filepath: str) -> None:\n    \"\"\"\n    Back up a file, old_file will be renamed to #old_file.n#, where n is a\n    number incremented each time a backup takes place\n    \"\"\"\n    backup = None\n    if os.path.exists(filepath):\n        dirname = os.path.dirname(filepath)\n        basename = os.path.basename(filepath)\n        count = 1\n        backup = os.path.join(dirname, f\"#{basename}.{count}#\")\n        while os.path.exists(backup):\n            count += 1\n            backup = os.path.join(dirname, f\"#{basename}.{count}#\")\n        logging.info(\"Backing up %s to %s\", filepath, backup)\n        os.rename(filepath, backup)\n\n\nElemType = TypeVar(\"ElemType\")  # pylint: disable=invalid-name\n\n\ndef partition(vals: List[ElemType], size: int) -> List[List[ElemType]]:\n    \"\"\"Partion a list into a list of lists by size.\"\"\"\n    return [vals[i : i + size] for i in range(0, len(vals), size)]\n\n\ndef collect_df_info(df_data: pd.DataFrame) -> str:\n    \"\"\"Collects information of a dataframe\"\"\"\n    buf = io.StringIO()\n    df_data.info(buf=buf, verbose=True, memory_usage=\"deep\")\n    return buf.getvalue()\n"
  },
  {
    "path": "pylintrc",
    "content": "[MESSAGES CONTROL]\ndisable=fixme, duplicate-code\n"
  },
  {
    "path": "pyproject.toml",
    "content": "[tool.poetry]\nname = \"ncbitax2lin\"\nversion = \"3.0.0\"\ndescription = \"A tool that converts NCBI taxonomy dump into lineages\"\nauthors = [\"Zhuyi Xue <zhuyi.xue@alum.utoronto.ca>\"]\nreadme = \"README.md\"\nhomepage = \"https://github.com/zyxue/ncbitax2lin\"\nlicense = \"MIT\"\n\n[tool.poetry.dependencies]\nfire = \"^0.7.1\"\npandas = \"^2.3.2\"\npython = \"^3.9,<3.14\"\ntyping-extensions = \"^4.15.0\"\n\n[tool.poetry.dev-dependencies]\nautoflake = \"^1.3.1\"\nblack = \"^22.1.0\"\ncoverage = \"^7.5.4\"\nisort = \"^5.7.0\"\nmypy = \"^1.18.2\"\npylint = \"^3.3.8\"\npytest = \"^8.4.2\"\npytest-parallel = \"^0.1.0\"\ntox = \"^3.21.4\"\n\n[tool.poetry.scripts]\nncbitax2lin = \"ncbitax2lin.ncbitax2lin:main\"\n\n[build-system]\nrequires = [\"poetry>=0.12\"]\nbuild-backend = \"poetry.masonry.api\"\n\n# https://pycqa.github.io/isort/docs/configuration/black_compatibility/\n[tool.isort]\nprofile = \"black\"\nmulti_line_output = 3\nknown_first_party = [\"ncbitax2lin\"]"
  },
  {
    "path": "tests/__init__.py",
    "content": ""
  },
  {
    "path": "tests/test___init__.py",
    "content": "\"\"\"tests for __init__.py\"\"\"\n\n# pylint: disable=protected-access, missing-function-docstring\nfrom ncbitax2lin import __version__\n\n\ndef test_version() -> None:\n    assert __version__ == \"2.4.1\"\n"
  },
  {
    "path": "tests/test_data/names.head_20.dmp",
    "content": "1\t|\tall\t|\t\t|\tsynonym\t|\n1\t|\troot\t|\t\t|\tscientific name\t|\n2\t|\tBacteria\t|\tBacteria <bacteria>\t|\tscientific name\t|\n2\t|\tMonera\t|\tMonera <bacteria>\t|\tin-part\t|\n2\t|\tProcaryotae\t|\tProcaryotae <bacteria>\t|\tin-part\t|\n2\t|\tProkaryota\t|\tProkaryota <bacteria>\t|\tin-part\t|\n2\t|\tProkaryotae\t|\tProkaryotae <bacteria>\t|\tin-part\t|\n2\t|\tbacteria\t|\t\t|\tblast name\t|\n2\t|\teubacteria\t|\t\t|\tgenbank common name\t|\n2\t|\tprokaryote\t|\tprokaryote <bacteria>\t|\tin-part\t|\n2\t|\tprokaryotes\t|\tprokaryotes <bacteria>\t|\tin-part\t|\n6\t|\tAzorhizobium\t|\t\t|\tscientific name\t|\n6\t|\tAzorhizobium Dreyfus et al. 1988 emend. Lang et al. 2013\t|\t\t|\tauthority\t|\n7\t|\tATCC 43989\t|\tATCC 43989 <type strain>\t|\ttype material\t|\n7\t|\tAzorhizobium caulinodans\t|\t\t|\tscientific name\t|\n7\t|\tAzorhizobium caulinodans Dreyfus et al. 1988\t|\t\t|\tauthority\t|\n7\t|\tAzotirhizobium caulinodans\t|\t\t|\tequivalent name\t|\n7\t|\tCCUG 26647\t|\tCCUG 26647 <type strain>\t|\ttype material\t|\n7\t|\tDSM 5975\t|\tDSM 5975 <type strain>\t|\ttype material\t|\n7\t|\tIFO 14845\t|\tIFO 14845 <type strain>\t|\ttype material\t|\n"
  },
  {
    "path": "tests/test_data/nodes.head_20.dmp",
    "content": "1\t|\t1\t|\tno rank\t|\t\t|\t8\t|\t0\t|\t1\t|\t0\t|\t0\t|\t0\t|\t0\t|\t0\t|\t\t|\n2\t|\t131567\t|\tsuperkingdom\t|\t\t|\t0\t|\t0\t|\t11\t|\t0\t|\t0\t|\t0\t|\t0\t|\t0\t|\t\t|\n6\t|\t335928\t|\tgenus\t|\t\t|\t0\t|\t1\t|\t11\t|\t1\t|\t0\t|\t1\t|\t0\t|\t0\t|\t\t|\n7\t|\t6\t|\tspecies\t|\tAC\t|\t0\t|\t1\t|\t11\t|\t1\t|\t0\t|\t1\t|\t1\t|\t0\t|\t\t|\n9\t|\t32199\t|\tspecies\t|\tBA\t|\t0\t|\t1\t|\t11\t|\t1\t|\t0\t|\t1\t|\t1\t|\t0\t|\t\t|\n10\t|\t1706371\t|\tgenus\t|\t\t|\t0\t|\t1\t|\t11\t|\t1\t|\t0\t|\t1\t|\t0\t|\t0\t|\t\t|\n11\t|\t1707\t|\tspecies\t|\tCG\t|\t0\t|\t1\t|\t11\t|\t1\t|\t0\t|\t1\t|\t1\t|\t0\t|\t\t|\n13\t|\t203488\t|\tgenus\t|\t\t|\t0\t|\t1\t|\t11\t|\t1\t|\t0\t|\t1\t|\t0\t|\t0\t|\t\t|\n14\t|\t13\t|\tspecies\t|\tDT\t|\t0\t|\t1\t|\t11\t|\t1\t|\t0\t|\t1\t|\t1\t|\t0\t|\t\t|\n16\t|\t32011\t|\tgenus\t|\t\t|\t0\t|\t1\t|\t11\t|\t1\t|\t0\t|\t1\t|\t0\t|\t0\t|\t\t|\n17\t|\t16\t|\tspecies\t|\tMM\t|\t0\t|\t1\t|\t11\t|\t1\t|\t0\t|\t1\t|\t1\t|\t0\t|\t\t|\n18\t|\t213421\t|\tgenus\t|\t\t|\t0\t|\t1\t|\t11\t|\t1\t|\t0\t|\t1\t|\t0\t|\t0\t|\t\t|\n19\t|\t18\t|\tspecies\t|\tPC\t|\t0\t|\t1\t|\t11\t|\t1\t|\t0\t|\t1\t|\t1\t|\t0\t|\t\t|\n20\t|\t76892\t|\tgenus\t|\t\t|\t0\t|\t1\t|\t11\t|\t1\t|\t0\t|\t1\t|\t0\t|\t0\t|\t\t|\n21\t|\t20\t|\tspecies\t|\tPI\t|\t0\t|\t1\t|\t11\t|\t1\t|\t0\t|\t1\t|\t1\t|\t0\t|\t\t|\n22\t|\t267890\t|\tgenus\t|\t\t|\t0\t|\t1\t|\t11\t|\t1\t|\t0\t|\t1\t|\t0\t|\t0\t|\t\t|\n23\t|\t22\t|\tspecies\t|\tSC\t|\t0\t|\t1\t|\t11\t|\t1\t|\t0\t|\t1\t|\t1\t|\t0\t|\t\t|\n24\t|\t22\t|\tspecies\t|\tSP\t|\t0\t|\t1\t|\t11\t|\t1\t|\t0\t|\t1\t|\t1\t|\t0\t|\t\t|\n25\t|\t22\t|\tspecies\t|\tSH\t|\t0\t|\t1\t|\t11\t|\t1\t|\t0\t|\t1\t|\t1\t|\t0\t|\t\t|\n27\t|\t49928\t|\tspecies\t|\tHE\t|\t0\t|\t1\t|\t11\t|\t1\t|\t0\t|\t1\t|\t1\t|\t0\t|\t\t|\n"
  },
  {
    "path": "tests/test_data_io.py",
    "content": "\"\"\"tests for data_reader.py\"\"\"\n# pylint: disable=protected-access, missing-function-docstring\n\nfrom pathlib import Path\n\nimport pandas as pd\n\nfrom ncbitax2lin import data_io\n\n\ndef test_load_nodes() -> None:\n    # top 20 lines of nodes.dmp from NCBI\n    test_input = (Path(__file__).parent / \"./test_data/nodes.head_20.dmp\").as_posix()\n    actual = data_io.load_nodes(test_input)\n    assert isinstance(actual, pd.DataFrame)\n\n\ndef test_load_names() -> None:\n    # top 20 lines of names.dmp from NCBI\n    test_input = (Path(__file__).parent / \"./test_data/names.head_20.dmp\").as_posix()\n    actual = data_io.load_names(test_input)\n    assert isinstance(actual, pd.DataFrame)\n"
  },
  {
    "path": "tests/test_fmt.py",
    "content": "\"\"\"tests for fmt.py\"\"\"\n# pylint: disable=missing-function-docstring, protected-access\nfrom typing import Container\n\nimport pytest\n\nfrom ncbitax2lin import fmt\n\n\n@pytest.mark.parametrize(\n    \"test_input_rank, test_input_existing_ranks, expected\",\n    [\n        (\"no rank\", {}, \"no rank\"),\n        (\"no rank\", {\"some other rank\"}, \"no rank\"),\n        (\"no rank\", {\"no rank\"}, \"no rank1\"),\n        (\"rankx\", [\"rankx\"], \"rankx1\"),\n        (\"rankx\", [\"rankx\", \"rankx1\"], \"rankx2\"),\n    ],\n)\ndef test__calc_rank_key(\n    test_input_rank: str, test_input_existing_ranks: Container[str], expected: str\n) -> None:\n    actual = fmt._calc_rank_key(test_input_rank, test_input_existing_ranks)\n    assert actual == expected\n"
  },
  {
    "path": "tests/test_lineage.py",
    "content": "\"\"\"tests for lineage.py\"\"\"\n# pylint: disable=missing-function-docstring, protected-access\n\nfrom unittest.mock import MagicMock, patch\n\nimport pytest\n\nfrom ncbitax2lin import lineage\n\n\n@patch(\"multiprocessing.cpu_count\", return_value=999, autospec=True)\ndef test__calc_num_procs(mock_cpu_count: MagicMock) -> None:\n    actual = lineage._calc_num_procs()\n    expected = 6\n    assert actual == expected\n    mock_cpu_count.assert_called_once_with()\n\n\n@pytest.mark.parametrize(\n    \"num_vals, num_chunks, chunk_size\",\n    [\n        (10, 3, 4),\n        (11, 3, 4),\n        (12, 3, 4),\n        (13, 3, 5),\n        (14, 3, 5),\n        (15, 3, 5),\n        (16, 3, 6),\n    ],\n)\ndef test__calc_chunk_size_procs(\n    num_vals: int, num_chunks: int, chunk_size: int\n) -> None:\n    actual = lineage._calc_chunk_size(num_vals, num_chunks)\n    expected = chunk_size\n    assert actual == expected\n    assert isinstance(chunk_size, int)\n"
  },
  {
    "path": "tests/test_ncbitax2lin.py",
    "content": "\"\"\"tests for ncbitax2lin.py\"\"\"\n# pylint: disable=protected-access, missing-function-docstring\n\n\nimport pandas as pd\n\nfrom ncbitax2lin import ncbitax2lin\n\n\ndef test__calc_taxonomy_dict() -> None:\n    df_data = pd.DataFrame(\n        {\n            \"tax_id\": [1, 2, 6],\n            \"parent_tax_id\": [1, 131567, 335928],\n            \"rank\": [\"no rank\", \"superkingdom\", \"genus\"],\n            \"rank_name\": [\n                \"root\",\n                \"Bacteria\",\n                \"Azorhizobium\",\n            ],\n        }\n    )\n\n    actual = ncbitax2lin._calc_taxonomy_dict(df_data)\n    expected = {\n        1: {\"tax_id\": 1, \"parent_tax_id\": 1, \"rank\": \"no rank\", \"rank_name\": \"root\"},\n        2: {\n            \"tax_id\": 2,\n            \"parent_tax_id\": 131567,\n            \"rank\": \"superkingdom\",\n            \"rank_name\": \"Bacteria\",\n        },\n        6: {\n            \"tax_id\": 6,\n            \"parent_tax_id\": 335928,\n            \"rank\": \"genus\",\n            \"rank_name\": \"Azorhizobium\",\n        },\n    }\n\n    assert actual == expected\n"
  },
  {
    "path": "tests/test_utils.py",
    "content": "\"\"\"tests for utils.py\"\"\"\n\n# pylint: disable=protected-access, missing-function-docstring\n\nimport os\nfrom typing import List\nfrom unittest.mock import MagicMock, call, patch\n\nimport pytest\n\nfrom ncbitax2lin import utils\n\n\ndef test_maybe_backup_file_when_file_path_does_not_exist() -> None:\n    with patch(\"os.path.exists\", return_value=False) as mock_exists:\n        test_input = \"some_non_existing_file\"\n        utils.maybe_backup_file(test_input)\n        mock_exists.assert_called_once_with(test_input)\n\n\n@patch(\"os.rename\", spec=os.rename)\n@patch(\"os.path.exists\")\ndef test_maybe_backup_file_when_file_path_exists(\n    mock_exists: MagicMock, mock_rename: MagicMock\n) -> None:\n    mock_exists.side_effect = [True, False]\n    test_input = \"some_existing_file\"\n\n    utils.maybe_backup_file(test_input)\n    expected = \"#some_existing_file.1#\"\n\n    mock_exists.assert_has_calls([call(test_input), call(expected)])\n    mock_rename.assert_called_once_with(test_input, expected)\n\n\n@patch(\"os.rename\", spec=os.rename)\n@patch(\"os.path.exists\")\ndef test_maybe_backup_file_when_backfile_also_exists(\n    mock_exists: MagicMock, mock_rename: MagicMock\n) -> None:\n    mock_exists.side_effect = [True, True, False]\n    test_input = \"some_existing_file\"\n    intermediary_input = \"#some_existing_file.1#\"\n\n    utils.maybe_backup_file(test_input)\n    expected = \"#some_existing_file.2#\"\n\n    mock_exists.assert_has_calls(\n        [call(test_input), call(intermediary_input), call(expected)]\n    )\n    mock_rename.assert_called_once_with(test_input, expected)\n\n\n@pytest.mark.parametrize(\n    \"test_input, size, expected\",\n    [\n        ([1, 2, 3], 3, [[1, 2, 3]]),\n        ([1, 2, 3], 2, [[1, 2], [3]]),\n        ([1, 2, 3, 4], 2, [[1, 2], [3, 4]]),\n        ([1, 2, 3, 4, 5], 2, [[1, 2], [3, 4], [5]]),\n        ([1, 2, 3, 4, 5], 3, [[1, 2, 3], [4, 5]]),\n    ],\n)\ndef test__partition(\n    test_input: List[int],\n    size: int,\n    expected: List[List[int]],\n) -> None:\n    actual = utils.partition(test_input, size)\n    assert actual == expected\n"
  },
  {
    "path": "tox.ini",
    "content": "[tox]\nisolated_build = True\nenvlist = py39,py310,py311,py312,py313\n\n[testenv]\nallowlist_externals =\n    poetry\n    pytest\ncommands =\n    poetry install --verbose"
  }
]