Full Code of snap-stanford/UCE for AI

main 8ead6e07af0c cached

17 files

374.1 KB

125.0k tokens

56 symbols

1 requests

Download .txt

Showing preview only (470K chars total). Download the full file or copy to clipboard to get everything.

Repository: snap-stanford/UCE
Branch: main
Commit: 8ead6e07af0c
Files: 17
Total size: 374.1 KB

Directory structure:
gitextract_vi_txbci/

├── LICENSE
├── README.md
├── data_proc/
│   ├── Create New Species Files.ipynb
│   ├── data_utils.py
│   ├── download_proc_czi_cxg.py
│   ├── gene_embeddings.py
│   ├── generate_reduced_chrom_files.py
│   └── preproc_many_dataset.py
├── eval_data.py
├── eval_single_anndata.py
├── evaluate.py
├── examples/
│   ├── Benchmark Embeddings with scIB.ipynb
│   └── Label Transfer Using Logistic Classifier.ipynb
├── model.py
├── model_files/
│   └── new_species_protein_embeddings.csv
├── requirements.txt
└── utils.py

================================================
FILE CONTENTS
================================================

================================================
FILE: LICENSE
================================================
MIT License

Copyright (c) 2023 Yanay Rosen, Yusuf Roohani, Jure Leskovec

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.


================================================
FILE: README.md
================================================
# Universal Cell Embeddings

This repo includes a PyTorch [HuggingFace Accelerator](https://huggingface.co/docs/accelerate/package_reference/accelerator) implementation of the UCE model, to be used to embed individual anndata datasets.

## Installation

```
pip install -r requirements.txt
```

## Embedding a new dataset

To generate an embedding for a new single-cell RNA sequencing dataset in the AnnData format, use the `eval_single_anndata.py` script.

```
python eval_single_anndata.py --adata_path {path_to_anndata} --dir {output_dir} --species {species} --model_loc {model_loc} --batch_size {batch_size}
```

where
- `adata_path`: a h5ad file. The `.X` slot of the file should be scRNA-seq counts. The `.var_names` slot should correspond to gene names, *not ENSEMBLIDs*.
- `dir`: the working directory in which intermediate and final output files will be saved to skip repeated processing of the same dataset.
- `species`: the species of the dataset you are embedding.
- `model_loc`: the location of the model weights `.torch` file.
- `batch_size`: the per GPU batch size. For the 33 layer model, on a 80GB GPU, you should use 25. For a 4 layer model on the same GPU, you can use 100.

For a sample output on the 10k pbmc dataset, run
```
python eval_single_anndata.py
```
All necessary model files will be downloaded automatically.


**Note**: This script makes use of additional files, which are described in the code documentation. These are downloaded automatically unless already present in the working directory. The script defaults to the pretrained 4-layer model. For running the pretrained 33-layer model from the paper, please download using this [link](https://figshare.com/articles/dataset/Universal_Cell_Embedding_Model_Files/24320806?file=43423236) and set `--nlayers 33`.

## Output

Final evaluated AnnData: `dir/{dataset_name}.h5ad`. This AnnData will be 
identical to the proccessed input anndata, but have UCE embeddings added in the `.obsm["X_uce"]` slot.

Please see documentation for information on additional output files. All 
outputs from `eval_single_anndata.py` are stored in the `dir` directory.

## Data

You can download processed datasets used in the papere [here](https://drive.google.com/drive/folders/1f63fh0ykgEhCrkd_EVvIootBw7LYDVI7?usp=drive_link)

**Note:** These datasets were embedded using the 33 layer model. Embeddings for the 33 layer model are not compatible with embeddings from the 4 layer model.

## Citing

If you find our paper and code useful, please consider citing the [preprint](https://www.biorxiv.org/content/10.1101/2023.11.28.568918v1):

```
@article{rosen2023universal,
  title={Universal Cell Embeddings: A Foundation Model for Cell Biology},
  author={Rosen, Yanay and Roohani, Yusuf and Agrawal, Ayush and Samotorcan, Leon and Consortium, Tabula Sapiens and Quake, Stephen R and Leskovec, Jure},
  journal={bioRxiv},
  pages={2023--11},
  year={2023},
  publisher={Cold Spring Harbor Laboratory}
}
```

## Analyses

Please see the [reproduce repo](https://github.com/yhr91/uce_reproduce/tree/master) for analyses figures and datasets from the paper.


================================================
FILE: data_proc/Create New Species Files.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0e4018ee",
   "metadata": {},
   "source": [
    "# Embedding Novel Species\n",
    "\n",
    "This notebook will create the files you need to embed a novel species that wasn't included in the training data.\n",
    "\n",
    "To start, you will need to download the ESM2 protein embeddings and the reference proteome for the species.\n",
    "\n",
    "You can find precalculated ESM2 protein embeddings for many species [here](https://drive.google.com/drive/folders/1_Dz7HS5N3GoOAG6MdhsXWY1nwLoN13DJ?usp=drive_link)\n",
    "\n",
    "For reference proteomes, you can download them from [here](https://useast.ensembl.org/info/about/species.html).\n",
    "\n",
    "If there is no protein embedding for the species you are interested in, you can request to have it made via Github or email, or you can create it yourself following instructions [here](https://github.com/snap-stanford/SATURN/tree/main/protein_embeddings)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "ab368d92",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pickle as pkl\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "c9a306f3",
   "metadata": {},
   "outputs": [],
   "source": [
    "SPECIES_NAME = \"chicken\" # short hand name for this species, will be used in arguments and files\n",
    "\n",
    "# Path to the species proteome\n",
    "SPECIES_PROTEIN_FASTA_PATH = \"../../../SATURN/protein_embeddings/data/Gallus_gallus.bGalGal1.mat.broiler.GRCg7b.pep.all.fa\"\n",
    "\n",
    "# Path to the ESM2 Embeddings\n",
    "SPECIES_PROTEIN_EMBEDDINGS_PATH = \"../model_files/protein_embeddings/Gallus_gallus.bGalGal1.mat.broiler.GRCg7b.pep.all.gene_symbol_to_embedding_ESM2.pt\"\n",
    "\n",
    "# primary_assembly name, this needs to be matched to the FASTA file\n",
    "ASSEMBLY_NAME = \"bGalGal1.mat.broiler.GRCg7b\"\n",
    "# NCBI Taxonomy ID, please set this so that if someone else also embeds the same species,\n",
    "# randomly generated chromosome tokens will be the same\n",
    "TAXONOMY_ID = 9031"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e5d37e52",
   "metadata": {},
   "source": [
    "You can view the FASTA format here, please confirm the primary_assembly name is correct."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "2ecf1464",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">ENSGALP00010000002.1 pep primary_assembly:bGalGal1.mat.broiler.GRCg7b:MT:2824:3798:1 gene:ENSGALG00010000007.1 transcript:ENSGALT00010000007.1 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:ND1 description:NADH dehydrogenase subunit 1 [Source:NCBI gene (formerly Entrezgene);Acc:63549479]\r\n",
      "MTLPTLTNLLIMTLSYILPILIAVAFLTLVERKILSYMQARKGPNIVGPFGLLQPVADGV\r\n",
      "KLFIKEPIRPSTSSPFLFIITPILALLLALTIWVPLPLPFPLADLNLGLLFLLAMSSLTV\r\n",
      "YSLLWSGWASNSKYALIGALRAVAQTISYEVTLAIILLSTIMLSGNYTLSTLAITQEPIY\r\n",
      "LIFSAWPLAMMWYISTLAETNRAPFDLTEGESELVSGFNVEYAAGPFAMFFLAEYANIML\r\n",
      "MNTLTTVLFLNPSFLNLPPELFPIALATKTLLLSSSFLWIRASYPRFRYDQLMHLLWKNF\r\n",
      "LPLTLALCLWHTSMPISYAGLPPI\r\n",
      ">ENSGALP00010000003.1 pep primary_assembly:bGalGal1.mat.broiler.GRCg7b:MT:4015:5053:1 gene:ENSGALG00010000011.1 transcript:ENSGALT00010000011.1 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:ND2 description:NADH dehydrogenase subunit 2 [Source:NCBI gene (formerly Entrezgene);Acc:63549482]\r\n",
      "MNPHAKLICTVSLIMGTSITISSNHWILAWTGLEINTLAIIPLISKSHHPRAIEATIKYF\r\n",
      "LTQSTASALILFSSMTNAWSTGQWDITQLNHPTSCLMLTMAIAIKLGLVPFHFWFPEVLQ\r\n"
     ]
    }
   ],
   "source": [
    "!head {SPECIES_PROTEIN_FASTA_PATH}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "90540d0b",
   "metadata": {},
   "outputs": [],
   "source": [
    "species_to_paths = {\n",
    "    SPECIES_NAME: SPECIES_PROTEIN_FASTA_PATH,\n",
    "}\n",
    "\n",
    "species_to_ids = {\n",
    "    SPECIES_NAME: ASSEMBLY_NAME,\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "623b99cf",
   "metadata": {},
   "outputs": [],
   "source": [
    "all_pos_def = []\n",
    "\n",
    "missing_genes = {}\n",
    "for species in species_to_ids.keys():\n",
    "    missing_genes[species] = []\n",
    "    proteome_path = species_to_paths[species]\n",
    "    species_id = species_to_ids[species]\n",
    "\n",
    "    with open(proteome_path) as f:\n",
    "        proteome_lines = f.readlines()\n",
    "\n",
    "    gene_symbol_to_location = {}\n",
    "    gene_symbol_to_chrom = {}\n",
    "\n",
    "    for line in proteome_lines:\n",
    "        if line.startswith(\">\"):\n",
    "            split_line = line.split()\n",
    "            gene_symbol = [token for token in split_line if token.startswith(\"gene_symbol\")]\n",
    "            if len(gene_symbol) > 0:\n",
    "                gene_symbol = gene_symbol[0].split(\":\")\n",
    "                \n",
    "                if len(gene_symbol) == 2:\n",
    "                    gene_symbol = gene_symbol[1]\n",
    "                elif len(gene_symbol) > 2:\n",
    "                    gene_symbol = \":\".join(gene_symbol[1:]) # fix for annoying zebrafish gene names with colons in them\n",
    "                else:\n",
    "                    1/0 # something weird happening, throw an error\n",
    "                \n",
    "                \n",
    "                chrom = None\n",
    "                \n",
    "                chrom_arr = [token for token in split_line if token.startswith(\"chromosome:\")]\n",
    "                if len(chrom_arr) > 0:\n",
    "                    chrom = chrom_arr[0].replace(\"chromosome:\", \"\")\n",
    "                else:\n",
    "                    chrom_arr = [token for token in split_line if token.startswith(\"primary_assembly:\")]\n",
    "                    if len(chrom_arr) > 0:\n",
    "                        chrom = chrom_arr[0].replace(\"primary_assembly:\", \"\")\n",
    "                    else:\n",
    "                        chrom_arr = [token for token in split_line if token.startswith(\"scaffold:\")] \n",
    "                        if len(chrom_arr) > 0:\n",
    "                            chrom = chrom_arr[0].replace(\"scaffold:\", \"\")\n",
    "                if chrom is not None:\n",
    "                    gene_symbol_to_location[gene_symbol] = chrom.split(\":\")[2]\n",
    "                    gene_symbol_to_chrom[gene_symbol] = chrom.split(\":\")[1]\n",
    "                else:\n",
    "                    missing_genes[species].append(gene_symbol)\n",
    "                    \n",
    "\n",
    "    positional_df = pd.DataFrame()\n",
    "    positional_df[\"gene_symbol\"] = [gn.upper() for gn in list(gene_symbol_to_chrom.keys())]\n",
    "    positional_df[\"chromosome\"] = list(gene_symbol_to_chrom.values())\n",
    "    positional_df[\"start\"] = list(gene_symbol_to_location.values())\n",
    "    positional_df = positional_df.sort_values([\"chromosome\", \"start\"])\n",
    "    #positional_df = positional_df.set_index(\"gene_symbol\")\n",
    "    positional_df[\"species\"] = species\n",
    "    all_pos_def.append(positional_df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "b72887b3",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>gene_symbol</th>\n",
       "      <th>chromosome</th>\n",
       "      <th>start</th>\n",
       "      <th>species</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2327</th>\n",
       "      <td>GCC1</td>\n",
       "      <td>1</td>\n",
       "      <td>1006145</td>\n",
       "      <td>chicken</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2502</th>\n",
       "      <td>NCAM2</td>\n",
       "      <td>1</td>\n",
       "      <td>100828671</td>\n",
       "      <td>chicken</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3084</th>\n",
       "      <td>ENS-2</td>\n",
       "      <td>1</td>\n",
       "      <td>101147482</td>\n",
       "      <td>chicken</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2331</th>\n",
       "      <td>DENND6B</td>\n",
       "      <td>1</td>\n",
       "      <td>1012031</td>\n",
       "      <td>chicken</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3973</th>\n",
       "      <td>MRPL39</td>\n",
       "      <td>1</td>\n",
       "      <td>102578362</td>\n",
       "      <td>chicken</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4722</th>\n",
       "      <td>CA9</td>\n",
       "      <td>Z</td>\n",
       "      <td>9779343</td>\n",
       "      <td>chicken</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4738</th>\n",
       "      <td>ARHGEF39</td>\n",
       "      <td>Z</td>\n",
       "      <td>9835547</td>\n",
       "      <td>chicken</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3885</th>\n",
       "      <td>MRPL17</td>\n",
       "      <td>Z</td>\n",
       "      <td>9850679</td>\n",
       "      <td>chicken</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4172</th>\n",
       "      <td>CCBE1</td>\n",
       "      <td>Z</td>\n",
       "      <td>9852827</td>\n",
       "      <td>chicken</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3293</th>\n",
       "      <td>PMAIP1</td>\n",
       "      <td>Z</td>\n",
       "      <td>9998272</td>\n",
       "      <td>chicken</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>13271 rows × 4 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     gene_symbol chromosome      start  species\n",
       "2327        GCC1          1    1006145  chicken\n",
       "2502       NCAM2          1  100828671  chicken\n",
       "3084       ENS-2          1  101147482  chicken\n",
       "2331     DENND6B          1    1012031  chicken\n",
       "3973      MRPL39          1  102578362  chicken\n",
       "...          ...        ...        ...      ...\n",
       "4722         CA9          Z    9779343  chicken\n",
       "4738    ARHGEF39          Z    9835547  chicken\n",
       "3885      MRPL17          Z    9850679  chicken\n",
       "4172       CCBE1          Z    9852827  chicken\n",
       "3293      PMAIP1          Z    9998272  chicken\n",
       "\n",
       "[13271 rows x 4 columns]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "master_pos_def = pd.concat(all_pos_def)\n",
    "master_pos_def"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "6d9dac28",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "chicken    13271\n",
       "Name: species, dtype: int64"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "master_pos_def[\"species\"].value_counts() # double check how many genes are mapped"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "4a3d45c2",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "chicken: 0\n"
     ]
    }
   ],
   "source": [
    "for k, v in missing_genes.items():\n",
    "    print(f\"{k}: {len(v)}\") # are any genes missing?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "c59774b1",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "*********\n",
      "chicken\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "1                    1785\n",
       "2                    1169\n",
       "3                    1067\n",
       "4                     953\n",
       "5                     817\n",
       "Z                     629\n",
       "6                     458\n",
       "8                     450\n",
       "7                     442\n",
       "9                     382\n",
       "10                    366\n",
       "14                    359\n",
       "11                    327\n",
       "15                    326\n",
       "13                    306\n",
       "20                    298\n",
       "12                    293\n",
       "19                    278\n",
       "18                    274\n",
       "17                    260\n",
       "26                    237\n",
       "28                    237\n",
       "27                    235\n",
       "21                    226\n",
       "23                    214\n",
       "25                    176\n",
       "34                    155\n",
       "24                    149\n",
       "22                    142\n",
       "16                     54\n",
       "30                     52\n",
       "38                     49\n",
       "31                     14\n",
       "MT                     13\n",
       "39                     10\n",
       "JAENSK010000484.1       7\n",
       "35                      6\n",
       "JAENSK010000592.1       6\n",
       "W                       5\n",
       "MU179278.1              5\n",
       "MU179279.1              4\n",
       "36                      3\n",
       "JAENSK010000483.1       3\n",
       "JAENSK010000585.1       3\n",
       "JAENSK010000593.1       2\n",
       "MU179258.1              2\n",
       "MU179272.1              2\n",
       "MU179273.1              2\n",
       "JAENSK010000584.1       2\n",
       "JAENSK010000656.1       1\n",
       "Name: chromosome, dtype: int64"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "*********\n"
     ]
    }
   ],
   "source": [
    "# Count genes per chromosome\n",
    "for species in species_to_ids.keys():\n",
    "    print(\"*********\")\n",
    "    print(species)\n",
    "    display(master_pos_def[master_pos_def[\"species\"] == species][\"chromosome\"].value_counts().head(50))\n",
    "    print(\"*********\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "541baded",
   "metadata": {},
   "outputs": [],
   "source": [
    "master_pos_def.to_csv(f\"{SPECIES_NAME}_to_chrom_pos.csv\", index=False) # Save the DF"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "eabd0e31",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "chicken_to_chrom_pos.csv\n"
     ]
    }
   ],
   "source": [
    "# The chromosome file path will be:\n",
    "print(f\"{SPECIES_NAME}_to_chrom_pos.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "fe1345b1",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "66"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "N_UNIQ_CHROM = len(master_pos_def[master_pos_def[\"species\"] == species][\"chromosome\"].unique())\n",
    "N_UNIQ_CHROM"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e37e277f",
   "metadata": {},
   "source": [
    "# Generate token file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "d6904975",
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "import pickle\n",
    "token_dim = 5120"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a2798848",
   "metadata": {},
   "source": [
    "This will create the token file. Please note the offset value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "4355dabd",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CHROM_TOKEN_OFFSET: 13275\n",
      "Saved PE, offsets file\n"
     ]
    }
   ],
   "source": [
    "species_to_offsets = {}\n",
    "\n",
    "all_pe = torch.load(\"../model_files/all_tokens.torch\")[0:4] # read in existing token file to make sure \n",
    "# that special vocab tokens are the same for different seeds\n",
    "\n",
    "offset = len(all_pe) # special tokens at the top!\n",
    "\n",
    "PE = torch.load(SPECIES_PROTEIN_EMBEDDINGS_PATH)\n",
    "\n",
    "pe_stacked = torch.stack(list(PE.values()))\n",
    "all_pe = torch.vstack((all_pe, pe_stacked))\n",
    "species_to_offsets[species] = offset\n",
    "\n",
    "print(\"CHROM_TOKEN_OFFSET:\", all_pe.shape[0])\n",
    "torch.manual_seed(TAXONOMY_ID)\n",
    "CHROM_TENSORS = torch.normal(mean=0, std=1, size=(N_UNIQ_CHROM, 5120)) \n",
    "# N_UNIQ_CHROM is the total number of chromosome choices, it is hardcoded for now (for species in the training data)\n",
    "all_pe = torch.vstack(\n",
    "    (all_pe, CHROM_TENSORS))  # Add the chrom tensors to the end\n",
    "all_pe.requires_grad = False\n",
    "\n",
    "\n",
    "torch.save(all_pe, f\"{SPECIES_NAME}_pe_tokens.torch\")\n",
    "\n",
    "with open(f\"{SPECIES_NAME}_offsets.pkl\", \"wb+\") as f:\n",
    "    pickle.dump(species_to_offsets, f)\n",
    "print(\"Saved PE, offsets file\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "c26fe491",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "torch.Size([13341, 5120])"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_pe.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "21f937ea",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "torch.Size([13341, 5120])"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_pe.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "5faadace",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "chicken_offsets.pkl\n"
     ]
    }
   ],
   "source": [
    "print(f\"{SPECIES_NAME}_offsets.pkl\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "6ceac20b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'../model_files/protein_embeddings/Gallus_gallus.bGalGal1.mat.broiler.GRCg7b.pep.all.gene_symbol_to_embedding_ESM2.pt'"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "SPECIES_PROTEIN_EMBEDDINGS_PATH"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e4697330",
   "metadata": {},
   "source": [
    "# Example evaluation of new species"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2b72667d",
   "metadata": {},
   "source": [
    "**Note: when you evaluate a new species, you need to change some arguments and modify some files:**\n",
    "\n",
    "You will  need to modify the csv in `model_files/new_species_protein_embeddings.csv` to include the new protein embeddings file you downloaded.\n",
    "\n",
    "In the file add a row for the new species with the format:\n",
    "`species name,full path to protein embedding file`\n",
    "\n",
    "Please also add this line to the dictionary created on line 247 in the file `data_proc/data_utils.py`.\n",
    "\n",
    "When you want to embed this new species, you will need to specify these newly created files as arguments.\n",
    "- `CHROM_TOKEN_OFFSET`: This tells UCE when the rows corresponding to chromosome tokens starts.\n",
    "- `spec_chrom_csv_path`: This is a new csv, created by this script, which maps genes to chromosomes and genomic positions\n",
    "- `token_file`: This is a new token file that will work just for this species. The embeddings generated will still be universal though!\n",
    "- `offset_pkl_path`: This is another file that maps genes to tokens\n",
    "\n",
    "\n",
    "```\n",
    "\n",
    "accelerate launch eval_single_anndata.py chicken_heart.h5ad --species=chicken --CHROM_TOKEN_OFFSET=13275 --spec_chrom_csv_path=data_proc/chicken_to_chrom_pos.csv --token_file=data_proc/chicken_pe_tokens.torch --offset_pkl_path=data_proc/chicken_offsets.pkl --dir=... --multi_gpu=True\n",
    "\n",
    "```"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}


================================================
FILE: data_proc/data_utils.py
================================================
import warnings
warnings.filterwarnings("ignore")

import scanpy as sc
import torch

from torch import nn, Tensor
import torch.nn.functional as F
import torch.utils.data as data
import torch.optim as optim
import numpy as np
import pickle
import os
import argparse
import logging
import time

from tqdm.auto import tqdm
import pandas as pd

import math
import anndata
from pathlib import Path


from torch.utils.data import dataset
from torch.utils.data import DataLoader, TensorDataset, dataset
from scipy.stats import binom
from typing import Dict, List, Optional, Tuple
from scanpy import AnnData


from data_proc.gene_embeddings import load_gene_embeddings_adata

def data_to_torch_X(X):
    if isinstance(X, sc.AnnData):
        X = X.X
    if not isinstance(X, np.ndarray):
            X = X.toarray()
    return torch.from_numpy(X).float()

class SincleCellDataset(data.Dataset):
    def __init__(self,
                expression: torch.tensor, # Subset to hv genes, count data! cells x genes
                protein_embeddings: torch.tensor, # same order as expression, also subset genes x pe
                labels: None, # optional, tensor of labels
                covar_vals: None, # tensor of covar values or none
                ) -> None:
        super(SincleCellDataset, self).__init__()
        
        # Set expression
        self.expression = expression
        
        row_sums = self.expression.sum(1) # UMI Counts
        log_norm_count_adj = torch.log1p(self.expression / (self.expression.sum(1)).unsqueeze(1) * torch.tensor(1000))       
        
        # Set log norm and count adjusted expression
        max_vals, max_idx = torch.max(log_norm_count_adj, dim=0)
        self.expression_mod =  log_norm_count_adj / max_vals
        
        # Calculate dropout likliehoods of each gene
        self.dropout_vec = (self.expression == 0).float().mean(0) # per gene dropout percentages
        
        # Set data info
        self.num_cells = self.expression.shape[0]
        self.num_genes = self.expression.shape[1]
        
        # Set optional label info, including categorical covariate index
        self.covar_vals = covar_vals
        self.labels = labels
        
        # Set protein embeddings
        self.protein_embeddings = protein_embeddings
        
        self.item_mode = "expression"
        if self.covar_vals is not None:
            self.item_mode = "expression+covar"
        
        
    def __getitem__(self, idx):
        if self.item_mode == "expression":
            if isinstance(idx, int):
                if idx < self.num_cells:
                    return self.expression[idx, :]
                else:
                    raise IndexError
            else:
                raise NotImplementedError
        elif self.item_mode == "expression+covar":
            if isinstance(idx, int):
                if idx < self.num_cells:
                    return self.expression[idx, :], self.covar_vals[idx]
                else:
                    raise IndexError
            else:
                raise NotImplementedError
            

    def __len__(self) -> int:
        return self.num_cells

    def get_dim(self) -> Dict[str, int]:
        return self.num_genes


def data_to_torch_X(X):
    if isinstance(X, sc.AnnData):
        X = X.X
    if not isinstance(X, np.ndarray):
            X = X.toarray()
    return torch.from_numpy(X).float()


def anndata_to_sc_dataset(adata:sc.AnnData, 
                                 species:str="human", 
                                 labels:list=[],
                                 covar_col:str=None,
                                 hv_genes=None,
                                 embedding_model="ESM2",
                                ) -> (SincleCellDataset, AnnData):
    
    # Subset to just genes we have embeddings for
    adata, protein_embeddings = load_gene_embeddings_adata(
        adata=adata,
        species=[species],
        embedding_model=embedding_model
    )
    
    if hv_genes is not None:
        sc.pp.highly_variable_genes(adata, flavor='seurat_v3', n_top_genes=hv_genes)  # Expects Count Data
    
        hv_index = adata.var["highly_variable"]
        adata = adata[:, hv_index] # Subset to hv genes only
    
        protein_embeddings = protein_embeddings[species][hv_index]
    else:
        protein_embeddings = protein_embeddings[species]
    expression = data_to_torch_X(adata.X)
    
    covar_vals = None
    if len(labels) > 0:
        assert covar_col is None or covar_col in labels, "Covar needs to be in labels" # make sure you keep track of covar column!
        labels = adata.obs.loc[:, labels].values
        
        if covar_col is not None:
            # we have a categorical label to use as covariate
            covar_vals = torch.tensor(pd.Categorical(adata.obs[covar_col]).codes)
    return SincleCellDataset(
        expression=expression,
        protein_embeddings=protein_embeddings,
        labels=labels,
        covar_vals=covar_vals
    ), adata    
    
def adata_path_to_prot_chrom_starts(adata, dataset_species, spec_pe_genes, gene_to_chrom_pos, offset):
    """
        Given a :path: to an h5ad, 
    """    
    pe_row_idxs = torch.tensor([spec_pe_genes.index(k.upper()) + offset for k in adata.var_names]).long()
    print(len(np.unique(pe_row_idxs)))
    
    spec_chrom = gene_to_chrom_pos[gene_to_chrom_pos["species"] == dataset_species].set_index("gene_symbol")

    gene_chrom = spec_chrom.loc[[k.upper() for k in adata.var_names]]

    dataset_chroms = gene_chrom["spec_chrom"].cat.codes # now this is correctely indexed by species and chromosome
    print("Max Code:", max(dataset_chroms))
    dataset_pos = gene_chrom["start"].values
    return pe_row_idxs, dataset_chroms, dataset_pos



def process_raw_anndata(row, h5_folder_path, npz_folder_path, scp, skip,
                        additional_filter, root):
        path = row.path
        if not os.path.isfile(root + "/" + path):
            print( "**********************************")
            print(f"***********{root + '/' + path} File Missing****")
            print( "**********************************")
            print(path, root)
            return None

        name = path.replace(".h5ad", "")
        proc_path = path.replace(".h5ad", "_proc.h5ad")
        if skip:
            if os.path.isfile(h5_folder_path + proc_path):
                print(f"{name} already processed. Skipping")
                return None, None, None

        print(f"Proccessing {name}")

        species = row.species
        covar_col = row.covar_col

        ad = sc.read(root + "/" + path)
        labels = []
        if "cell_type" in ad.obs.columns:
            labels.append("cell_type")


        if covar_col is np.nan or np.isnan(covar_col):
            covar_col = None
        else:
            labels.append(covar_col)

        if additional_filter:
            sc.pp.filter_genes(ad, min_cells=10)
            sc.pp.filter_cells(ad, min_genes=25)


        dataset, adata = anndata_to_sc_dataset(ad, species=species, labels=labels, covar_col=covar_col, hv_genes=None)
        adata = adata.copy()

        if additional_filter:
            sc.pp.filter_genes(ad, min_cells=10)
            sc.pp.filter_cells(ad, min_genes=25)
        
        num_cells = adata.X.shape[0]
        num_genes = adata.X.shape[1]

        adata_path = h5_folder_path + proc_path
        adata.write(adata_path)

        arr = data_to_torch_X(adata.X).numpy()

        print(arr.max()) # this is a nice check to make sure it's counts
        filename = npz_folder_path + f"{name}_counts.npz"
        shape = arr.shape
        print(name, shape)
        fp = np.memmap(filename, dtype='int64', mode='w+', shape=shape)
        fp[:] = arr[:]
        fp.flush()
        
        if scp != "":
            subprocess.call(["scp", filename, f"{scp}:{filename}"])
            subprocess.call(["scp", adata_path, f"{scp}:{adata_path}"])
            
        return adata, num_cells, num_genes
    
    
def get_species_to_pe(EMBEDDING_DIR):
    """
    Given an embedding directory, return all embeddings as a dictionary coded by species.
    Note: In the current form, this function is written such that the directory needs all of the following species embeddings.
    """
    EMBEDDING_DIR = Path(EMBEDDING_DIR)

    embeddings_paths = {
            'human': EMBEDDING_DIR / 'Homo_sapiens.GRCh38.gene_symbol_to_embedding_ESM2.pt',
            'mouse': EMBEDDING_DIR / 'Mus_musculus.GRCm39.gene_symbol_to_embedding_ESM2.pt',
            'frog': EMBEDDING_DIR / 'Xenopus_tropicalis.Xenopus_tropicalis_v9.1.gene_symbol_to_embedding_ESM2.pt',
            'zebrafish': EMBEDDING_DIR / 'Danio_rerio.GRCz11.gene_symbol_to_embedding_ESM2.pt',
            "mouse_lemur": EMBEDDING_DIR / "Microcebus_murinus.Mmur_3.0.gene_symbol_to_embedding_ESM2.pt",
            "pig": EMBEDDING_DIR / 'Sus_scrofa.Sscrofa11.1.gene_symbol_to_embedding_ESM2.pt',
            "macaca_fascicularis": EMBEDDING_DIR / 'Macaca_fascicularis.Macaca_fascicularis_6.0.gene_symbol_to_embedding_ESM2.pt',
            "macaca_mulatta": EMBEDDING_DIR / 'Macaca_mulatta.Mmul_10.gene_symbol_to_embedding_ESM2.pt',
        }
    extra_species = pd.read_csv("./model_files/new_species_protein_embeddings.csv").set_index("species").to_dict()["path"]
    embeddings_paths.update(extra_species) # adds new species
    
    
    
    species_to_pe = {
        species:torch.load(pe_dir) for species, pe_dir in embeddings_paths.items()   
    }

    species_to_pe = {species:{k.upper(): v for k,v in pe.items()} for species, pe in species_to_pe.items()}
    return species_to_pe


def get_spec_chrom_csv(path="/dfs/project/cross-species/yanay/code/all_to_chrom_pos.csv"):
    """
    Get the species to chrom csv file
    """
    gene_to_chrom_pos = pd.read_csv(path)
    gene_to_chrom_pos["spec_chrom"] = pd.Categorical(gene_to_chrom_pos["species"] + "_" +  gene_to_chrom_pos["chromosome"]) # add the spec_chrom list
    return gene_to_chrom_pos

================================================
FILE: data_proc/download_proc_czi_cxg.py
================================================
import os
os.environ["OMP_NUM_THREADS"] = "20" # export OMP_NUM_THREADS=4
os.environ["OPENBLAS_NUM_THREADS"] = "20" # export OPENBLAS_NUM_THREADS=4 
os.environ["MKL_NUM_THREADS"] = "20" # export MKL_NUM_THREADS=6
os.environ["VECLIB_MAXIMUM_THREADS"] = "20" # export VECLIB_MAXIMUM_THREADS=4
os.environ["NUMEXPR_NUM_THREADS"] = "20"


import warnings
warnings.filterwarnings('ignore')

import cellxgene_census
from tqdm import tqdm
import scanpy as sc

from collections import defaultdict
from typing import Dict, List, Optional, Tuple

import torch
import torch.utils.data as data
import torch
import numpy as np
import scanpy as sc
from numpy import array
import os
import pickle as pkl
import glob

def data_to_torch_X(X):
    if isinstance(X, sc.AnnData):
        X = X.X
    if not isinstance(X, np.ndarray):
            X = X.toarray()
    return torch.from_numpy(X).float()
    
import sys
sys.path.append('../')

from gene_embeddings import load_gene_embeddings_adata
import pandas as pd
import numpy as np
from scanpy import AnnData
from multiprocessing import Pool, Process, Manager

import multiprocessing.pool as mpp
# https://stackoverflow.com/questions/57354700/starmap-combined-with-tqdm
def istarmap(self, func, iterable, chunksize=1):
    """starmap-version of imap
    """
    if self._state != mpp.RUN:
        raise ValueError("Pool not running")

    if chunksize < 1:
        raise ValueError(
            "Chunksize must be 1+, not {0:n}".format(
                chunksize))

    task_batches = mpp.Pool._get_tasks(func, iterable, chunksize)
    result = mpp.IMapIterator(self._cache)
    self._taskqueue.put(
        (
            self._guarded_task_generation(result._job,
                                          mpp.starmapstar,
                                          task_batches),
            result._set_length
        ))
    return (item for chunk in result for item in chunk)


mpp.Pool.istarmap = istarmap


VERSION = "2023-04-25"
N_TOP_GENES = 12000


print(cellxgene_census.get_census_version_description(VERSION))

census = cellxgene_census.open_soma(census_version=VERSION)
census_datasets = census["census_info"]["datasets"].read().concat().to_pandas()

# for convenience, indexing on the soma_joinid which links this to other census data.
census_datasets = census_datasets.set_index("soma_joinid")

species_to_readable = {
    "Homo sapiens":"human",
    "Mus musculus":"mouse"    
}

def process_row(row, num_genes, num_cells, paths, all_species, covar_cols, dataset_title, h5_root="/dfs/project/uce/cxg_data/anndatas/", npz_root="/dfs/project/uce/cxg_data/npzs/"):
    dataset_id = row[1].dataset_id
    #dataset_title = row[1].dataset_title.lower().replace(' ', '_').replace(",", "").replace("/", "")
    
    save_path = h5_root + f"{dataset_title}.h5ad"
    no_primary_path = save_path.replace(".h5ad", "_no_primary.h5ad")
    proc_path = save_path.replace(".h5ad", "_proc.h5ad")
    npz_path = npz_root + f"{dataset_title}_counts.npz"
    # Download the anndata
    
    if os.path.exists(no_primary_path):
        print("No Primary, skipping")
        return
    
    if not os.path.exists(save_path) and not os.path.exists(no_primary_path):
        cellxgene_census.download_source_h5ad(
            dataset_id, to_path=save_path
        )
    if os.path.exists(proc_path) and os.path.exists(npz_path):
        print("Already Proc")
        try:
            ad = sc.read(proc_path)
        except:
            print()
            print()
            print("Error reading on:", dataset_title)
            print()
            print()
            return
        # Get organism
        if "organism" in ad.obs.columns:
            unique_organisms = list(ad.obs.organism.unique().categories)
            unique_organism_str = ", ".join(unique_organisms)
        else:
            unique_organism_str = "human"
        species = species_to_readable.get(unique_organism_str, "human")
        # don't need to do hv if already proc
        if "sample" in ad.obs.columns:
            covar_cols[dataset_title] = "sample"
        elif "batch" in ad.obs.columns:
            covar_cols[dataset_title] = "batch"
        else:
            covar_cols[dataset_title] = ""


        num_genes[dataset_title] = ad.X.shape[1]
        num_cells[dataset_title] = ad.X.shape[0]
        paths[dataset_title] = f"{dataset_title}.h5ad"
        all_species[dataset_title] = species
        
        return # Skip everything else
    # Read the raw AD
    ad = sc.read(save_path)
    
    # Change to counts
    if not sc._utils.check_nonnegative_integers(ad.X):
        # don't have counts yet, need raw
        if ad.raw is None:
            print("Skipped, no counts")
            return
        ad.X = ad.raw.X.toarray()
    if not sc._utils.check_nonnegative_integers(ad.X):
        print("Skipped, no counts")
        return
        
    # SUBSET TO primary data
    if len(np.unique(ad.obs["is_primary_data"])) >= 1:
        primary_data = ad.obs.is_primary_data.value_counts()
        ad = ad[ad.obs.is_primary_data]
    if ad.X.shape[0] == 0:
        print("no primary data")
        print(primary_data)
        os.rename(save_path, no_primary_path)
        return # No primary data
    print("has primary data")
    # Switch to gene symbols
    ad.var["feature_id_orig"] = list(ad.var.index)
    ad.var_names = list(ad.var.feature_name)

    # Get organism
    if "organism" in ad.obs.columns:
        unique_organisms = list(ad.obs.organism.unique().categories)
        unique_organism_str = ", ".join(unique_organisms)
    else:
        unique_organism_str = "human"
    species = species_to_readable.get(unique_organism_str, "human")
    # Filter to gene symbols with protein embeddings
    ad, _ = load_gene_embeddings_adata(
        adata=ad,
        species=[species],
        embedding_model="ESM2"
    )
    
    ad = ad.copy()
    # Simple filtering by counts
    sc.pp.filter_cells(ad, min_genes=200)
    sc.pp.filter_genes(ad, min_cells=10)
    
    #print(ad)
    
    if "sample" in ad.obs.columns:
        try:
            sc.pp.highly_variable_genes(ad, flavor="seurat_v3", n_top_genes=N_TOP_GENES, subset=True, batch_key="sample")
        except:
            try:
                sc.pp.highly_variable_genes(ad, flavor="seurat_v3", n_top_genes=N_TOP_GENES, subset=True, batch_key="sample", span=1)
            except:
                print(f"can't hv gene subset {dataset_title}")
        covar_cols[dataset_title] = "sample"
    elif "batch" in ad.obs.columns:
        try:
            sc.pp.highly_variable_genes(ad, flavor="seurat_v3", n_top_genes=N_TOP_GENES, subset=True, batch_key="batch")
        except:
            try:
                sc.pp.highly_variable_genes(ad, flavor="seurat_v3", n_top_genes=N_TOP_GENES, subset=True, batch_key="batch", span=1)
            except:
                print(f"can't hv gene subset {dataset_title}")
        covar_cols[dataset_title] = "batch"
    else:
        try:
            sc.pp.highly_variable_genes(ad, flavor="seurat_v3", n_top_genes=N_TOP_GENES, subset=True)
        except:
            try:
                sc.pp.highly_variable_genes(ad, flavor="seurat_v3", n_top_genes=N_TOP_GENES, subset=True, span=1)
            except:
                print(f"can't hv gene subset {dataset_title}")
        covar_cols[dataset_title] = ""
        
    num_genes[dataset_title] = ad.X.shape[1]
    num_cells[dataset_title] = ad.X.shape[0]
    paths[dataset_title] = f"{dataset_title}.h5ad"
    all_species[dataset_title] = species
    
    print("writing proc")
    ad.write(proc_path)
    
    arr = data_to_torch_X(ad.X).numpy()
    
    shape = arr.shape
    
    fp = np.memmap(npz_path, dtype='int64', mode='w+', shape=shape)
    fp[:] = arr[:]
    fp.flush()
    
    return
    
if __name__ == '__main__':
    '''
    manager = Manager()
    num_genes = manager.dict()
    num_cells = manager.dict()
    paths = manager.dict()
    all_species = manager.dict()
    covar_cols = manager.dict()
    '''
    num_genes = {}
    num_cells = {}
    paths = {}
    all_species = {}
    covar_cols = {}

    df = pd.DataFrame()
    # Shuffle the dataset 
    census_datasets = census_datasets#.iloc[270:]
    iterrows = list(census_datasets.iterrows())
    #p = Pool(8)
    #for row in tqdm(iterrows, total=len(census_datasets)):
    #    p.apply_async(process_row, args=(row, num_genes, num_cells, paths, all_species, covar_cols))            
    #p.close()
    #p.join()
    '''
    with Pool(1) as p:
        nrows = len(iterrows)
        inputs = zip(iterrows, [num_genes]*nrows, [num_cells]*nrows, [paths]*nrows, [all_species]*nrows, [covar_cols]*nrows)
        for _ in tqdm(p.istarmap(process_row, inputs),
                           total=nrows):
            pass
    
    '''
    
    if os.path.exists("dataset_rows_mouse_fixed.pkl"):
        dataset_rows = {}
        for path in glob.glob("dataset_rows_mouse_fixed*.pkl"):
            with open(path, "rb") as f:
                dataset_rows_path = pkl.load(f)
                dataset_rows.update(dataset_rows_path)
                
        print(f"{len(dataset_rows)} already counted")
    else:
        dataset_rows = {}
    
    
    pbar = tqdm(iterrows)
    all_errors = []
    total_number_of_cells = 0
    
    duplicate_titles = ['Dissection: Body of hippocampus (HiB) - Rostral DG-CA4', 'Retina',
       'Colon', 'Myeloid cells', 'Ileum', 'Airway']
    duplicate_titles_2 = ['retina', 'airway', 'myeloid_cells', 'colon', 'ileum', 'immune_cells']
    
    for row in pbar:
        dataset_title = row[1].dataset_title
        if dataset_title in duplicate_titles:
            dataset_title = row[1].collection_name + row[1].dataset_title

        dataset_title = dataset_title.lower().replace(' ', '_').replace(",", "").replace("/", "")

        if dataset_title in duplicate_titles_2:
            dataset_title = (row[1].collection_name + "_" + dataset_title).lower().replace(' ', '_').replace(",", "").replace("/", "")
        
        print(f"{total_number_of_cells} cells done")
        if dataset_title in dataset_rows:
            paths[dataset_title] = dataset_rows[dataset_title][0]
            all_species[dataset_title] = dataset_rows[dataset_title][1]
            covar_cols[dataset_title] = dataset_rows[dataset_title][2]
            num_cells[dataset_title] = dataset_rows[dataset_title][3]
            num_genes[dataset_title] = dataset_rows[dataset_title][4]
            #print("skipped read of proc")
            
            total_number_of_cells += dataset_rows[dataset_title][3]
            continue # Skip!
        else:
            pbar.set_description(f"{dataset_title} proc")
            try:
                process_row(row, num_genes, num_cells, paths, all_species, covar_cols, dataset_title=dataset_title)
            except:
                print(f"****{dataset_title} ERROR****")
                all_errors.append(dataset_title)
                
                
            pbar.set_description(f"{dataset_title} done")
            
            if dataset_title in paths:
                dataset_rows[dataset_title] = [paths[dataset_title], all_species[dataset_title], covar_cols[dataset_title], num_cells[dataset_title], num_genes[dataset_title], dataset_title]

                total_number_of_cells += dataset_rows[dataset_title][3]

                with open("dataset_rows_mouse_fixed.pkl", "wb") as f:
                    pkl.dump(dataset_rows, f)
                    print("wrote pkl")
            
    # path,species,covar_col,num_cells,names
    
    df["path"] = list(paths.values())
    df["species"] = list(all_species.values())
    df["covar_col"] = list(covar_cols.values())
    df["num_cells"] = list(num_cells.values())
    df["num_genes"] = list(num_genes.values())
    df["names"] = list(paths.keys())

    print(df.head(20))
    print()
    print("Errors:")
    print(all_errors)
    df.to_csv("cxg_datasets.csv", index=False)


================================================
FILE: data_proc/gene_embeddings.py
================================================
"""Helper functions for loading pretrained gene embeddings."""
from pathlib import Path
from typing import Dict, Tuple

import torch

from scanpy import AnnData
import numpy as np
import pandas as pd


EMBEDDING_DIR = Path('model_files/protein_embeddings')
MODEL_TO_SPECIES_TO_GENE_EMBEDDING_PATH = {
    'ESM2': {
        'human': EMBEDDING_DIR / 'Homo_sapiens.GRCh38.gene_symbol_to_embedding_ESM2.pt',
        'mouse': EMBEDDING_DIR / 'Mus_musculus.GRCm39.gene_symbol_to_embedding_ESM2.pt',
        'frog': EMBEDDING_DIR / 'Xenopus_tropicalis.Xenopus_tropicalis_v9.1.gene_symbol_to_embedding_ESM2.pt',
        'zebrafish': EMBEDDING_DIR / 'Danio_rerio.GRCz11.gene_symbol_to_embedding_ESM2.pt',
        "mouse_lemur": EMBEDDING_DIR / "Microcebus_murinus.Mmur_3.0.gene_symbol_to_embedding_ESM2.pt",
        "pig": EMBEDDING_DIR / 'Sus_scrofa.Sscrofa11.1.gene_symbol_to_embedding_ESM2.pt',
        "macaca_fascicularis": EMBEDDING_DIR / 'Macaca_fascicularis.Macaca_fascicularis_6.0.gene_symbol_to_embedding_ESM2.pt',
        "macaca_mulatta": EMBEDDING_DIR / 'Macaca_mulatta.Mmul_10.gene_symbol_to_embedding_ESM2.pt',
    }
}

extra_species = pd.read_csv("./model_files/new_species_protein_embeddings.csv").set_index("species").to_dict()["path"]
MODEL_TO_SPECIES_TO_GENE_EMBEDDING_PATH["ESM2"].update(extra_species) # adds new species


def load_gene_embeddings_adata(adata: AnnData, species: list, embedding_model: str) -> Tuple[AnnData, Dict[str, torch.FloatTensor]]:
    """Loads gene embeddings for all the species/genes in the provided data.

    :param data: An AnnData object containing gene expression data for cells.
    :param species: Species corresponding to this adata
    
    :param embedding_model: The gene embedding model whose embeddings will be loaded.
    :return: A tuple containing:
               - A subset of the data only containing the gene expression for genes with embeddings in all species.
               - A dictionary mapping species name to the corresponding gene embedding matrix (num_genes, embedding_dim).
    """
    # Get species names
    species_names = species
    species_names_set = set(species_names)

    # Get embedding paths for the model
    species_to_gene_embedding_path = MODEL_TO_SPECIES_TO_GENE_EMBEDDING_PATH[embedding_model]
    available_species = set(species_to_gene_embedding_path)

    # Ensure embeddings are available for all species
    if not (species_names_set <= available_species):
        raise ValueError(f'The following species do not have gene embeddings: {species_names_set - available_species}')

    # Load gene embeddings for desired species (and convert gene symbols to lower case)
    species_to_gene_symbol_to_embedding = {
        species: {
            gene_symbol.lower(): gene_embedding
            for gene_symbol, gene_embedding in torch.load(species_to_gene_embedding_path[species]).items()
        }
        for species in species_names
    }

    # Determine which genes to include based on gene expression and embedding availability
    genes_with_embeddings = set.intersection(*[
        set(gene_symbol_to_embedding)
        for gene_symbol_to_embedding in species_to_gene_symbol_to_embedding.values()
    ])
    genes_to_use = {gene for gene in adata.var_names if gene.lower() in genes_with_embeddings}

    # Subset data to only use genes with embeddings
    adata = adata[:, adata.var_names.isin(genes_to_use)]

    # Set up dictionary mapping species to gene embedding matrix (num_genes, embedding_dim)
    species_to_gene_embeddings = {
        species_name: torch.stack([
            species_to_gene_symbol_to_embedding[species_name][gene_symbol.lower()]
            for gene_symbol in adata.var_names
        ])
        for species_name in species_names
    }

    return adata, species_to_gene_embeddings


================================================
FILE: data_proc/generate_reduced_chrom_files.py
================================================
import os
os.environ["OMP_NUM_THREADS"] = "4" # export OMP_NUM_THREADS=4
os.environ["OPENBLAS_NUM_THREADS"] = "4" # export OPENBLAS_NUM_THREADS=4 
os.environ["MKL_NUM_THREADS"] = "4" # export MKL_NUM_THREADS=6
os.environ["VECLIB_MAXIMUM_THREADS"] = "4" # export VECLIB_MAXIMUM_THREADS=4
os.environ["NUMEXPR_NUM_THREADS"] = "4"


import warnings
warnings.filterwarnings("ignore")

import scanpy as sc
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import pickle
import os
import argparse
import logging
import time

from tqdm.auto import tqdm
import matplotlib.pyplot as plt
import pandas as pd

#sc._settings.ScanpyConfig.n_jobs = 6

import math
from typing import Tuple

import torch
from torch import nn, Tensor
import torch.nn.functional as F
from torch.nn import TransformerEncoder, TransformerEncoderLayer
from torch.utils.data import dataset


from accelerate import Accelerator
import anndata
from data_utils import adata_path_to_prot_chrom_starts, get_spec_chrom_csv



from torch.utils.data import dataset
from torch.utils.data import DataLoader, TensorDataset
from scipy.stats import binom




def padding_tensor(sequences):
    """
    :param sequences: list of tensors
    :return:
    """
    num = len(sequences)
    max_len = max([s.size(0) for s in sequences])
    out_dims = (num, max_len, 1280)
    
    
    out_tensor = sequences[0].data.new(*out_dims).fill_(0)
    out_dims2 = (num, max_len)
    
    mask = sequences[0].data.new(*out_dims2).fill_(float('-inf'))
    for i, tensor in enumerate(sequences):
        length = tensor.size(0)
        out_tensor[i, :length] = tensor
        mask[i, :length] = 1
    return out_tensor.permute(1, 0, 2), mask


from pathlib import Path
# ESM1b
'''
EMBEDDING_DIR = Path('/dfs/project/cross-species/data/proteome/embeddings')
human_pe_dir =  EMBEDDING_DIR / 'Homo_sapiens.GRCh38.gene_symbol_to_embedding_ESM1b.pt'
mouse_pe_dir =  EMBEDDING_DIR / 'Mus_musculus.GRCm39.gene_symbol_to_embedding_ESM1b.pt'
lemur_pe_dir =  Path("/dfs/project/cross-species/yanay/data/proteome/embeddings/") / 'Microcebus_murinus.Mmur_3.0.gene_symbol_to_embedding_ESM1b.pt'

'''

# Upgrade to ESM2
EMBEDDING_DIR = Path('/dfs/project/cross-species/data/proteome/embeddings')
EMBEDDING_DIR = Path('/dfs/project/cross-species/yanay/data/proteome/embeddings')

embeddings_paths = {
        'human': EMBEDDING_DIR / 'Homo_sapiens.GRCh38.gene_symbol_to_embedding_ESM2.pt',
        'mouse': EMBEDDING_DIR / 'Mus_musculus.GRCm39.gene_symbol_to_embedding_ESM2.pt',
        'frog': EMBEDDING_DIR / 'Xenopus_tropicalis.Xenopus_tropicalis_v9.1.gene_symbol_to_embedding_ESM2.pt',
        'zebrafish': EMBEDDING_DIR / 'Danio_rerio.GRCz11.gene_symbol_to_embedding_ESM2.pt',
        "mouse_lemur": EMBEDDING_DIR / "Microcebus_murinus.Mmur_3.0.gene_symbol_to_embedding_ESM2.pt",
        "pig": EMBEDDING_DIR / 'Sus_scrofa.Sscrofa11.1.gene_symbol_to_embedding_ESM2.pt',
        "macaca_fascicularis": EMBEDDING_DIR / 'Macaca_fascicularis.Macaca_fascicularis_6.0.gene_symbol_to_embedding_ESM2.pt',
        "macaca_mulatta": EMBEDDING_DIR / 'Macaca_mulatta.Mmul_10.gene_symbol_to_embedding_ESM2.pt',
    }

species_to_pe = {
    species:torch.load(pe_dir) for species, pe_dir in embeddings_paths.items()   
}

species_to_pe = {species:{k.upper(): v for k,v in pe.items()} for species, pe in species_to_pe.items()}

#species_to_keys = {species:list(pe.keys()) for species, pe in species_to_pe.items()}
#species_to_keys = {species:dict(zip(keys, np.arange(len(keys)))) for species, keys in species_to_keys.items()}


#datasets_df = pd.read_csv("/dfs/project/cross-species/yanay/code/UCE/data_proc/full_train_datasets.csv")
datasets_df = pd.read_csv("tissue_datasets.csv")
datasets_df = pd.read_csv("perturb_datasets.csv")
datasets_df = pd.read_csv("../new_perturb_datasets.csv")


#pd.concat((#pd.read_csv("new_datasets.csv"),
                             #pd.read_csv("pbmcs_nohvg.csv"), 
                             #pd.read_csv("lung_nohvg.csv"),
                             #pd.read_csv("new_tabula_datasets.csv"),
                             #pd.read_csv("updated_datasets.csv"),
  #                           #pd.read_csv("sanger_heart_atlas_datasets.csv"),
 #                            pd.read_csv("tissue_datasets.csv")
 #                       ))




#datasets_df = pd.read_csv("cell_cycle_datasets.csv")
#datasets_df = pd.read_csv("spatial_datasets.csv")
#datasets_df = pd.read_csv("perturb_datasets.csv")
#datasets_df = pd.read_csv("ccle_datasets.csv")
#datasets_df = pd.read_csv("pancreas_datasets.csv")



sorted_dataset_names = sorted(datasets_df["names"])
with open("dataset_shapes.pkl", "rb") as f:
    shapes_dict = pickle.load(f)
    

shapes_dict.update({
 "madissoon_novel_lung":(190728, 8000),   
 'flores_cerebellum_human': (20232, 8000),
 'osuch_gut_human': (272310, 8000),
 'msk_ovarian_human': (929690, 8000),
 'htan_vmuc_dis_epi_human': (65084, 8000),
 'htan_vmuc_val_epi_human': (57564, 8000),
 'htan_vmuc_non_epi_human': (9099, 8000),
 'hao_pbmc_3p_human': (161764, 8000),
 'hao_pbmc_5p_human': (49147, 8000),
 'gao_tumors_human': (36111, 8000),
 'swabrick_breast_human': (92427, 8000),
 'wu_cryo_tumors_human': (105662, 8000),
 'cell_line_het_human': (53513, 8000),
 'bi_allen_metastasis_human': (27787, 8000),
 'zheng68k_human': (68579, 8000),
 'zheng68k_12k_human': (68579, 12000),
 'mouse_embryo_ct': (153597, 12000),
 "regev_gtex_heart": (36574, 8000),
 "tabula_sapiens_heart": (11505, 8000),
 "10k_pbmcs":(11990, 12000),
 "epo_ido":(35834,12000),
 'tabula_sapiens_kidney': (9641, 8000),
 'tabula_microcebus_kidney': (14592, 8000),
 'tabula_muris_kidney': (2781, 8000),
 'tabula_muris_senis_kidney': (19610, 8000),
  'immune_human': (33506, 8000)
                   })

for row in datasets_df.iterrows():
    ngenes = row[1].num_genes
    ncells = row[1].num_cells
    name = row[1].names
    if not np.isnan(ngenes):
        shapes_dict[name] = (int(ncells), int(ngenes))
                   
#with open("dataset_shapes.pkl", "wb") as f:
#    pickle.dump(shapes_dict, f)
token_dim = 5120
mmap_dict = {}

root_dir = "/lfs/local/0/yanay/uce_h5s/"
root_dir_census = "/lfs/local/0/yanay/cxg_h5s/"

dataset_to_paths = {r[1]["names"]:root_dir + r[1]["path"].replace(".h5ad", "_proc.h5ad") for r in datasets_df.iterrows()}
for row in datasets_df.iterrows():
    name = row[1].names
    census = row[1].census
    
    if census == "yes":
        dataset_to_paths[name] = dataset_to_paths[name].replace(root_dir, root_dir_census)


datasets_to_species = {r[1]["names"]:r[1]["species"] for r in datasets_df.iterrows()}

#species_to_pe = {"mouse":mouse_pe, "human":human_pe, "mouse_lemur":lemur_pe}

#dataset_to_protein_embeddings_all = {k:species_to_pe[v] for k, v in datasets_to_species.items()}

dataset_to_protein_embeddings = {}


#dataset_to_protein_embeddings_all["madissoon_novel_lung"] = species_to_pe["human"]
datasets_to_species["madissoon_novel_lung"] = "human"
#dataset_to_paths["madissoon_novel_lung"] = "/lfs/local/0/yanay/uce_h5s/madissoon_novel_lung_proc.h5ad"



# New Chrom Based Code
gene_to_chrom_pos = get_spec_chrom_csv()
species_to_chrom_categories = {}

for species in np.unique(gene_to_chrom_pos["species"]):
    species_to_chrom_categories[species] = pd.Categorical(gene_to_chrom_pos["chromosome"]).categories

    
dataset_to_chroms = {}
dataset_to_starts = {}

sorted_species_names = sorted(species_to_pe.keys())
print(sorted_species_names)

if os.path.exists(f"/dfs/project/uce/all_species_pe_tokens.torch"):
    all_pe = torch.load(f"/dfs/project/uce/all_species_pe_tokens.torch")
    with open("/dfs/project/uce/all_species_offsets.pkl", "rb") as f:
        species_to_offsets = pickle.load(f)
    print("Loaded PE", all_pe.shape)
else:
    torch.manual_seed(8)
    MASK_TENSOR = torch.zeros((1, token_dim)) # this is the padding token
    CHROM_TENSOR_LEFT = torch.normal(mean=0, std=1, size=(1, token_dim))
    CHROM_TENSOR_RIGHT = torch.normal(mean=0, std=1, size=(1, token_dim))
    CLS_TENSOR = torch.normal(mean=0, std=1, size=(1, token_dim))
    species_to_offsets = {}

    all_pe = [MASK_TENSOR, CHROM_TENSOR_LEFT, CHROM_TENSOR_RIGHT, CLS_TENSOR]
    offset = len(all_pe) # special tokens at the top!
    for species in sorted_species_names:
        pe_stacked = torch.stack(list(species_to_pe[species].values()))
        all_pe.append(pe_stacked)
        species_to_offsets[species] = offset
        offset += pe_stacked.shape[0]

    all_pe = torch.vstack(all_pe)
    print(all_pe.shape)
    torch.save(all_pe, f"/dfs/project/uce/all_species_pe_tokens.torch")
    with open("/dfs/project/uce/all_species_offsets.pkl", "wb+") as f:
        pickle.dump(species_to_offsets, f)
    print("Saved PE")

# Load in already saved!
if os.path.exists(f"/lfs/local/0/yanay/reduced_datasets_to_pe_chrom_{token_dim}_new.torch"):
    dataset_to_protein_embeddings = torch.load(f"/lfs/local/0/yanay/reduced_datasets_to_pe_chrom_{token_dim}_new.torch")

    with open("/lfs/local/0/yanay/dataset_to_chroms_new.pkl", "rb") as f:
        dataset_to_chroms = pickle.load(f)
    with open("/lfs/local/0/yanay/dataset_to_starts_new.pkl", "rb") as f:
        dataset_to_starts = pickle.load(f)
else:
    dataset_to_protein_embeddings = {}
    dataset_to_chroms = {}
    dataset_to_starts = {}


# Add the new ones
print("creating reduced size protein embeddings file")

redo = True

for dataset, path in tqdm(list(dataset_to_paths.items())):
    if dataset in dataset_to_protein_embeddings.keys() and not redo:
        continue # skip since already procced
    print(dataset)
    adata = sc.read(path)
    dataset_species = datasets_to_species[dataset]
    spec_pe_genes = list(species_to_pe[dataset_species].keys())
    offset = species_to_offsets[dataset_species]
    
    # Get proper idxs
    pe_row_idxs, dataset_chroms, dataset_pos = adata_path_to_prot_chrom_starts(adata, dataset_species, spec_pe_genes, gene_to_chrom_pos, offset)
    # Add to dicts
    dataset_to_chroms[dataset] = dataset_chroms
    dataset_to_starts[dataset] = dataset_pos
    dataset_to_protein_embeddings[dataset] = pe_row_idxs
    
    del adata
# save Dicts and idxs
torch.save(dataset_to_protein_embeddings, f"/lfs/local/0/yanay/reduced_datasets_to_pe_chrom_{token_dim}_new.torch")

with open("/lfs/local/0/yanay/dataset_to_chroms_new.pkl", "wb+") as f:
    pickle.dump(dataset_to_chroms, f)
with open("/lfs/local/0/yanay/dataset_to_starts_new.pkl", "wb+") as f:
    pickle.dump(dataset_to_starts, f)        

================================================
FILE: data_proc/preproc_many_dataset.py
================================================
import os
os.environ["OMP_NUM_THREADS"] = "10" # export OMP_NUM_THREADS=4
os.environ["OPENBLAS_NUM_THREADS"] = "10" # export OPENBLAS_NUM_THREADS=4 
os.environ["MKL_NUM_THREADS"] = "10" # export MKL_NUM_THREADS=6
os.environ["VECLIB_MAXIMUM_THREADS"] = "10" # export VECLIB_MAXIMUM_THREADS=4
os.environ["NUMEXPR_NUM_THREADS"] = "10"



from collections import defaultdict
from typing import Dict, List, Optional, Tuple

import torch
import torch.utils.data as data
import numpy as np
import scanpy as sc
from numpy import array
import subprocess
import os
from tqdm import tqdm
import warnings
warnings.filterwarnings("ignore")


from gene_embeddings import load_gene_embeddings_adata
import pandas as pd
import numpy as np
from scanpy import AnnData
from data_utils import process_raw_anndata

def data_to_torch_X(X):
    if isinstance(X, sc.AnnData):
        X = X.X
    if not isinstance(X, np.ndarray):
            X = X.toarray()
    return torch.from_numpy(X).float()

class SincleCellDataset(data.Dataset):
    def __init__(self,
                expression: torch.tensor, # Subset to hv genes, count data! cells x genes
                protein_embeddings: torch.tensor, # same order as expression, also subset genes x pe
                labels: None, # optional, tensor of labels
                covar_vals: None, # tensor of covar values or none
                ) -> None:
        super(SincleCellDataset, self).__init__()
        
        # Set expression
        self.expression = expression
        
        row_sums = self.expression.sum(1) # UMI Counts
        log_norm_count_adj = torch.log1p(self.expression / (self.expression.sum(1)).unsqueeze(1) * torch.tensor(1000))       
        
        # Set log norm and count adjusted expression
        max_vals, max_idx = torch.max(log_norm_count_adj, dim=0)
        self.expression_mod =  log_norm_count_adj / max_vals
        
        # Calculate dropout likliehoods of each gene
        self.dropout_vec = (self.expression == 0).float().mean(0) # per gene dropout percentages
        
        # Set data info
        self.num_cells = self.expression.shape[0]
        self.num_genes = self.expression.shape[1]
        
        # Set optional label info, including categorical covariate index
        self.covar_vals = covar_vals
        self.labels = labels
        
        # Set protein embeddings
        self.protein_embeddings = protein_embeddings
        
        self.item_mode = "expression"
        if self.covar_vals is not None:
            self.item_mode = "expression+covar"
        
        
    def __getitem__(self, idx):
        if self.item_mode == "expression":
            if isinstance(idx, int):
                if idx < self.num_cells:
                    return self.expression[idx, :]
                else:
                    raise IndexError
            else:
                raise NotImplementedError
        elif self.item_mode == "expression+covar":
            if isinstance(idx, int):
                if idx < self.num_cells:
                    return self.expression[idx, :], self.covar_vals[idx]
                else:
                    raise IndexError
            else:
                raise NotImplementedError
            

    def __len__(self) -> int:
        return self.num_cells

    def get_dim(self) -> Dict[str, int]:
        return self.num_genes


def data_to_torch_X(X):
    if isinstance(X, sc.AnnData):
        X = X.X
    if not isinstance(X, np.ndarray):
            X = X.toarray()
    return torch.from_numpy(X).float()


def anndata_to_sc_dataset(adata:sc.AnnData, 
                                 species:str="human", 
                                 labels:list=[],
                                 covar_col:str=None,
                                 hv_genes:int=12000,
                                 embedding_model="ESM1b",
                                ) -> (SincleCellDataset, AnnData):
    
    # Subset to just genes we have embeddings for
    adata, protein_embeddings = load_gene_embeddings_adata(
        adata=adata,
        species=[species],
        embedding_model=embedding_model
    )
    
    if DO_HVG:
        sc.pp.highly_variable_genes(adata, flavor='seurat_v3', n_top_genes=hv_genes)  # Expects Count Data
    
        hv_index = adata.var["highly_variable"]
        adata = adata[:, hv_index] # Subset to hv genes only
    
        protein_embeddings = protein_embeddings[species][hv_index]
    else:
        protein_embeddings = protein_embeddings[species]
    expression = data_to_torch_X(adata.X)
    
    covar_vals = None
    if len(labels) > 0:
        assert covar_col is None or covar_col in labels, "Covar needs to be in labels" # make sure you keep track of covar column!
        labels = adata.obs.loc[:, labels].values
        
        if covar_col is not None:
            # we have a categorical label to use as covariate
            covar_vals = torch.tensor(pd.Categorical(adata.obs[covar_col]).codes)
    return SincleCellDataset(
        expression=expression,
        protein_embeddings=protein_embeddings,
        labels=labels,
        covar_vals=covar_vals
    ), adata    
    
def proc(args):
    datasets_df = pd.read_csv(args.datasets_df)
    datasets_df["covar_col"] = np.nan
    skip = args.skip
    additional_filter = args.filter
    DO_HVG = args.DO_HVG
    
    num_genes = {}
    num_cells = {}
    
    ir = list(datasets_df.iterrows())
    for i, row in tqdm(ir, total=len(datasets_df)):
        _, ncells, ngenes = process_raw_anndata(row, h5_folder_path, npz_folder_path, scp, skip, additional_filter, root=args.file_root_path)
        if (ncells is not None) and (ngenes is not None):
            num_genes[path] = adata.X.shape[1]
            num_cells[path] = ngenes
        
    if "num_cells" not in datasets_df.columns:
        datasets_df["num_cells"] = 0
    if "num_genes" not in datasets_df.columns:
        datasets_df["num_genes"] = 0
    for k in num_genes.keys():
        ng = num_genes[k]
        nc = num_cells[k]
        datasets_df.loc[datasets_df["path"] == k, "num_cells"] = nc
        datasets_df.loc[datasets_df["path"] == k, "num_genes"] = ng
    # Write with the cells and genes info back to the original path
    datasets_df.to_csv(args.datasets_df, index=False)
if __name__=="__main__":
    # Parse command-line arguments
    
    parser = argparse.ArgumentParser(description='Preproc datasets h5ad datasets.')

    # Define command-line arguments
    parser.add_argument('--scp', type=str, default="", help='Name of a SNAP server to SCP the results to. It should have the same folders as the script is already saving to.')
    parser.add_argument('--h5_folder_path', type=str, default="/lfs/local/0/yanay/uce_h5s/", help='Folder to save H5s to.')
    parser.add_argument('--npz_folder_path', type=str, default="/lfs/local/0/yanay/uce_proc/", help='Folder to save NPZs to.')
    
    
    parser.add_argument('--datasets_df', type=str, default="/dfs/project/uce/new_perturb_datasets.csv", help='Path to datasets csv. Will be overwritten to have the correct num cells and num genes for each dataset.')
    
    parser.add_argument('--filter', type=bool, default=True, help='Should you do an additional gene/cell filtering? This can be a good step since even if you have already done it, subsetting to protein embeddings can make some cells sparser.')
    parser.add_argument('--skip', type=bool, default=True, help='Should you skip datasets that appear to have already been created in the h5 folder?')
    
    parser.add_argument('--DO_HVG', type=bool, default=False, help='Should a HVG subset be done.')
    
    
    parse
    args = parser.parse_args()
    main(args)
    

================================================
FILE: eval_data.py
================================================
"""
Dataloaders

"""

import warnings
warnings.filterwarnings("ignore")
import sys
sys.path.append('../')
from typing import Dict, List, Optional, Tuple, Any
import torch
import numpy as np
import pickle
import torch.utils.data as data


class MultiDatasetSentences(data.Dataset):
    def __init__(self, sorted_dataset_names, shapes_dict, args, 
                 dataset_to_protein_embeddings_path= "/lfs/local/0/yanay/reduced_datasets_to_pe_chrom_5120_new.torch",
                 datasets_to_chroms_path="/lfs/local/0/yanay/dataset_to_chroms_new.pkl",
                 datasets_to_starts_path="/lfs/local/0/yanay/dataset_to_starts_new.pkl",
                 npzs_dir="/lfs/local/0/yanay/uce_proc/") -> None:
        super(MultiDatasetSentences, self).__init__()
        # self.xs = {}
        self.num_cells = {}
        self.num_genes = {}
        self.shapes_dict = shapes_dict
        self.args = args

        self.total_num_cells = 0
        for name in sorted_dataset_names:
            num_cells, num_genes = self.shapes_dict[name]
            # self.xs[name] = X
            self.num_cells[name] = num_cells
            self.num_genes[name] = num_genes

            self.total_num_cells += num_cells

        self.datasets = sorted_dataset_names

        # TODO: preferably not hard-coded here
        self.dataset_to_protein_embeddings = torch.load(dataset_to_protein_embeddings_path)
        with open(datasets_to_chroms_path, "rb") as f:
            self.dataset_to_chroms = pickle.load(f)
        with open(datasets_to_starts_path, "rb") as f:
            self.dataset_to_starts = pickle.load(f)
        
        self.npzs_dir = npzs_dir

    def __getitem__(self, idx):
        if isinstance(idx, int):
            for dataset in sorted(self.datasets):
                if idx < self.num_cells[dataset]:
                    #cts = np.memmap(f"/lfs/local/0/yanay/cxg_npzs/" + f"{dataset}_counts.npz",
                    #        dtype='int64', mode='r', shape=self.shapes_dict[dataset])
                    cts = np.memmap(self.npzs_dir + f"{dataset}_counts.npz", dtype='int64', mode='r', shape=self.shapes_dict[dataset])
                    counts = cts[idx]
                    counts = torch.tensor(counts).unsqueeze(0)
                    weights = torch.log1p(counts)
                    weights = (weights / torch.sum(weights))
                    batch_sentences, mask, seq_len, cell_sentences = \
                        sample_cell_sentences(counts, weights, dataset, self.args,
                            dataset_to_protein_embeddings= self.dataset_to_protein_embeddings,
                            dataset_to_chroms=self.dataset_to_chroms,
                            dataset_to_starts=self.dataset_to_starts)
                    return batch_sentences, mask, idx, seq_len, cell_sentences
                else:
                    idx -= self.num_cells[dataset]
            raise IndexError
        else:
            raise NotImplementedError

    def __len__(self) -> int:
        return self.total_num_cells

    def get_dim(self) -> Dict[str, int]:
        return self.num_genes


class MultiDatasetSentenceCollator(object):
    def __init__(self, args):
        self.pad_length = args.pad_length


    def __call__(self, batch):
        batch_size = len(batch)
        batch_sentences = torch.zeros((batch_size, self.pad_length))
        mask = torch.zeros((batch_size, self.pad_length))
        cell_sentences = torch.zeros((batch_size, self.pad_length))

        idxs = torch.zeros(batch_size)

        i = 0
        max_len = 0
        for bs, msk, idx, seq_len, cs in batch:
            batch_sentences[i, :] = bs
            cell_sentences[i, :] = cs
            max_len = max(max_len, seq_len)
            mask[i, :] = msk
            idxs[i] = idx

            i += 1

        return batch_sentences[:, :max_len] , mask[:, :max_len], idxs, cell_sentences



def sample_cell_sentences(counts, batch_weights, dataset, args,
                          dataset_to_protein_embeddings,
                          dataset_to_chroms,
                          dataset_to_starts):

    dataset_idxs = dataset_to_protein_embeddings[dataset] # get the dataset specific protein embedding idxs
    cell_sentences = torch.zeros((counts.shape[0], args.pad_length)) # init the cell representation as 0s
    mask = torch.zeros((counts.shape[0], args.pad_length)) # start of masking the whole sequence
    chroms = dataset_to_chroms[dataset] # get the dataset specific chroms for each gene
    starts = dataset_to_starts[dataset] # get the dataset specific genomic start locations for each gene

    longest_seq_len = 0 # we need to keep track of this so we can subset the batch at the end

    for c, cell in enumerate(counts):
        weights = batch_weights[c].numpy()
        weights = weights / sum(weights)  # RE NORM after mask
        
        # randomly choose the genes that will make up the sample, weighted by expression, with replacement
        choice_idx = np.random.choice(np.arange(len(weights)),
                                      size=args.sample_size, p=weights,
                                      replace=True)
        choosen_chrom = chroms[choice_idx] # get the sampled genes chromosomes
        # order the genes by chromosome
        chrom_sort = np.argsort(choosen_chrom)  
        choice_idx = choice_idx[chrom_sort]

        # sort the genes by start
        new_chrom = chroms[choice_idx]
        choosen_starts = starts[choice_idx]

        ordered_choice_idx = np.full((args.pad_length),
                                     args.cls_token_idx)  # start with cls
        # i= 0 first token is CLS
        i = 1  # continue on to the rest of the sequence with left bracket being assumed.
        # Shuffle the chroms now, there's no natural order to chromosomes
        uq_chroms = np.unique(new_chrom)
        np.random.shuffle(uq_chroms) # shuffle
        
        # This loop is actually just over one cell
        for chrom in uq_chroms:
            # Open Chrom token
            ordered_choice_idx[i] = int(chrom) + args.CHROM_TOKEN_OFFSET # token of this chromosome # i = 1 next token is a chrom open
            i += 1
            # now sort the genes by start order within the chroms
            loc = np.where(new_chrom == chrom)[0]
            sort_by_start = np.argsort(
                choosen_starts[loc])  # start locations for this chromsome

            to_add = choice_idx[loc[sort_by_start]]
            ordered_choice_idx[i:(i + len(to_add))] = dataset_idxs[to_add]
            i += len(to_add)
            ordered_choice_idx[i] = args.chrom_token_right_idx # add the chrom sep again
            i += 1  # add the closing token again

        longest_seq_len = max(longest_seq_len, i)
        remainder_len = (args.pad_length - i)

        cell_mask = torch.concat((torch.ones(i),
                                  # pay attention to all of these tokens, ignore the rest!
                                  torch.zeros(remainder_len)))

        mask[c, :] = cell_mask

        ordered_choice_idx[i:] = args.pad_token_idx # the remainder of the sequence
        cell_sentences[c, :] = torch.from_numpy(ordered_choice_idx)
        
    cell_sentences_pe = cell_sentences.long() # token indices
    
    return cell_sentences_pe, mask, longest_seq_len, cell_sentences

================================================
FILE: eval_single_anndata.py
================================================
"""
Script for Evaluating a Single AnnData

Parameters:
----------
- `adata_path` (str):
    Full path to the AnnData you want to embed.
- `dir` (str):
    Working folder where all files will be saved.
- `species` (str):
    Species of the AnnData.
- `filter` (bool):
    Additional gene/cell filtering on the AnnData.
- `skip` (bool):
    Skip datasets that appear to have already been created.
- `model_loc` (str):
    Location of pretrained UCE model's weights in a `.torch` file.
- `batch_size` (int):
    Batch size for processing.
- `CXG` (bool):
    Use CXG model.
- `nlayers` (int):
    Number of transformer layers.
- `output_dim` (int):
    Desired output dimension.
- `d_hid` (int):
    Hidden dimension for processing.
- `token_dim` (int):
    Token dimension.
- `spec_chrom_csv_path` (str):
    CSV file mapping genes from each species to their respective chromosomes
    and genomic start positions.
- `token_file` (str):
    `.torch` file containing token/protein embeddings for all tokens.
- `protein_embeddings_dir` (str):
    Directory containing protein embedding `.pt` files for all species.
- `offset_pkl_path` (str):
    `.pkl` file mapping between species and their gene's locations in the `token_file`.
- `pad_length` (int):
    Length to pad the cell sentence to.
- `pad_token_idx` (int):
    Index of the padding token in the `token_file`.
- `chrom_token_left_idx` (int):
    Left chromosome token index
- `chrom_token_right_idx` (int):
    Right chromosome token index
- `cls_token_idx` (int):
    CLS token index in the `token_file`.
- `CHROM_TOKEN_OFFSET` (int):
    Offset index, tokens after this mark are chromosome identifiers.
- `sample_size` (int):
    Number of genes sampled for cell sentence.
- `multi_gpu` (bool):
    Run evaluation on multiple GPUs (using accelerator)    

Returns:
-------
- `dir/{dataset_name}_proc.h5ad`:
    The processed AnnData. Processing involves subsetting it to genes which
    have protein embeddings and then refiltering the dataset by minimum counts.
- `dir/{dataset_name}_chroms.pkl`:
    File mapping the genes in the dataset to their corresponding chromosome
    indices.
- `dir/{dataset_name}_counts.npz`:
    File containing the counts of the AnnData in an easily accessible format.
- `dir/{dataset_name}_shapes_dict.pkl`:
    File containing the shape (ncell x ngene) of the AnnData, used to read the
    `.npz` file.
- `dir/{dataset_name}_pe_idx.torch`:
    File mapping between the genes in the dataset and their index in the tokens file.
- `dir/{dataset_name}_starts.pkl`:
    File mapping between the genes in the dataset and their genomic start locations.

"""


import argparse
from evaluate import AnndataProcessor
from accelerate import Accelerator

def main(args, accelerator):
    processor = AnndataProcessor(args, accelerator)
    processor.preprocess_anndata()
    processor.generate_idxs()
    processor.run_evaluation()


if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description='Embed a single anndata using UCE.')

    # Anndata Processing Arguments
    parser.add_argument('--adata_path', type=str,
                        default=None,
                        help='Full path to the anndata you want to embed.')
    parser.add_argument('--dir', type=str,
                        default="./",
                        help='Working folder where all files will be saved.')
    parser.add_argument('--species', type=str, default="human",
                        help='Species of the anndata.')
    parser.add_argument('--filter', type=bool, default=True,
                        help='Additional gene/cell filtering on the anndata.')
    parser.add_argument('--skip', type=bool, default=True,
                        help='Skip datasets that appear to have already been created.')

    # Model Arguments
    parser.add_argument('--model_loc', type=str,
                        default=None,
                        help='Location of the model.')
    parser.add_argument('--batch_size', type=int, default=25,
                        help='Batch size.')
    parser.add_argument('--pad_length', type=int, default=1536,
                        help='Batch size.')
    parser.add_argument("--pad_token_idx", type=int, default=0,
                        help="PAD token index")
    parser.add_argument("--chrom_token_left_idx", type=int, default=1,
                        help="Chrom token left index")
    parser.add_argument("--chrom_token_right_idx", type=int, default=2,
                        help="Chrom token right index")
    parser.add_argument("--cls_token_idx", type=int, default=3,
                        help="CLS token index")
    parser.add_argument("--CHROM_TOKEN_OFFSET", type=int, default=143574,
                        help="Offset index, tokens after this mark are chromosome identifiers")
    parser.add_argument('--sample_size', type=int, default=1024,
                        help='Number of genes sampled for cell sentence')
    parser.add_argument('--CXG', type=bool, default=True,
                        help='Use CXG model.')
    parser.add_argument('--nlayers', type=int, default=4,
                        help='Number of transformer layers.')
    parser.add_argument('--output_dim', type=int, default=1280,
                        help='Output dimension.')
    parser.add_argument('--d_hid', type=int, default=5120,
                        help='Hidden dimension.')
    parser.add_argument('--token_dim', type=int, default=5120,
                        help='Token dimension.')
    parser.add_argument('--multi_gpu', type=bool, default=False,
                        help='Use multiple GPUs')

    # Misc Arguments
    parser.add_argument("--spec_chrom_csv_path",
                        default="./model_files/species_chrom.csv", type=str,
                        help="CSV Path for species genes to chromosomes and start locations.")
    parser.add_argument("--token_file",
                        default="./model_files/all_tokens.torch", type=str,
                        help="Path for token embeddings.")
    parser.add_argument("--protein_embeddings_dir",
                        default="./model_files/protein_embeddings/", type=str,
                        help="Directory where protein embedding .pt files are stored.")
    parser.add_argument("--offset_pkl_path",
                        default="./model_files/species_offsets.pkl", type=str,
                        help="PKL file which contains offsets for each species.")

    args = parser.parse_args()
    accelerator = Accelerator(project_dir=args.dir)
    main(args, accelerator)


================================================
FILE: evaluate.py
================================================
import os

# os.environ["NCCL_DEBUG"] = "INFO"
os.environ["OMP_NUM_THREADS"] = "12"  # export OMP_NUM_THREADS=4
os.environ["OPENBLAS_NUM_THREADS"] = "12"  # export OPENBLAS_NUM_THREADS=4
os.environ["MKL_NUM_THREADS"] = "12"  # export MKL_NUM_THREADS=6
os.environ["VECLIB_MAXIMUM_THREADS"] = "12"  # export VECLIB_MAXIMUM_THREADS=4
os.environ["NUMEXPR_NUM_THREADS"] = "12"

import warnings

warnings.filterwarnings("ignore")

import scanpy as sc
from tqdm.auto import tqdm
from torch import nn, Tensor

from model import TransformerModel
from eval_data import MultiDatasetSentences, MultiDatasetSentenceCollator
from utils import figshare_download

from torch.utils.data import DataLoader
from data_proc.data_utils import adata_path_to_prot_chrom_starts, \
    get_spec_chrom_csv, process_raw_anndata, get_species_to_pe

import os
import pickle
import pandas as pd
import numpy as np
import torch


class AnndataProcessor:
    def __init__(self, args, accelerator):
        self.args = args
        self.accelerator = accelerator
        self.h5_folder_path = self.args.dir
        self.npz_folder_path = self.args.dir
        self.scp = ""

        # Check if paths exist, if not, create them
        self.check_paths()

        # Set up the anndata
        self.adata_name = self.args.adata_path.split("/")[-1]
        self.adata_root_path = self.args.adata_path.replace(self.adata_name, "")
        self.name = self.adata_name.replace(".h5ad", "")
        self.proc_h5_path = self.h5_folder_path + f"{self.name}_proc.h5ad"
        self.adata = None

        # Set up the row
        row = pd.Series()
        row.path = self.adata_name
        row.covar_col = np.nan
        row.species = self.args.species
        self.row = row

        # Set paths once to be used throughout the class
        self.pe_idx_path = self.args.dir + f"{self.name}_pe_idx.torch"
        self.chroms_path = self.args.dir + f"{self.name}_chroms.pkl"
        self.starts_path = self.args.dir + f"{self.name}_starts.pkl"
        self.shapes_dict_path = self.args.dir + f"{self.name}_shapes_dict.pkl"

    def check_paths(self):
        """
        Check if the paths exist, if not, create them
        """
        figshare_download("https://figshare.com/ndownloader/files/42706558",
                                self.args.spec_chrom_csv_path)
        figshare_download("https://figshare.com/ndownloader/files/42706555",
                                self.args.offset_pkl_path)
        if not os.path.exists(self.args.protein_embeddings_dir):
            figshare_download("https://figshare.com/ndownloader/files/42715213",
                'model_files/protein_embeddings.tar.gz')
        figshare_download("https://figshare.com/ndownloader/files/42706585",
                                self.args.token_file)
        if self.args.adata_path is None:
            print("Using sample AnnData: 10k pbmcs dataset")
            self.args.adata_path = "./data/10k_pbmcs_proc.h5ad"
            figshare_download(
                "https://figshare.com/ndownloader/files/42706966",
                self.args.adata_path)
        if self.args.model_loc is None:
            print("Using sample 4 layer model")
            self.args.model_loc = "./model_files/4layer_model.torch"
            figshare_download(
                "https://figshare.com/ndownloader/files/42706576",
                self.args.model_loc)


    def preprocess_anndata(self):
        if self.accelerator.is_main_process:
            self.adata, num_cells, num_genes = \
                process_raw_anndata(self.row,
                                    self.h5_folder_path,
                                    self.npz_folder_path,
                                    self.scp,
                                    self.args.skip,
                                    self.args.filter,
                                    root=self.adata_root_path)
            if (num_cells is not None) and (num_genes is not None):
                self.save_shapes_dict(self.name, num_cells, num_genes,
                                       self.shapes_dict_path)

            if self.adata is None:
                self.adata = sc.read(self.proc_h5_path)

    def save_shapes_dict(self, name, num_cells, num_genes, shapes_dict_path):
        shapes_dict = {name: (num_cells, num_genes)}
        with open(shapes_dict_path, "wb+") as f:
            pickle.dump(shapes_dict, f)
            print("Wrote Shapes Dict")

    def generate_idxs(self):
        if self.accelerator.is_main_process:
            if os.path.exists(self.pe_idx_path) and \
                    os.path.exists(self.chroms_path) and \
                    os.path.exists(self.starts_path):
                print("PE Idx, Chrom and Starts files already created")

            else:
                species_to_pe = get_species_to_pe(self.args.protein_embeddings_dir)
                with open(self.args.offset_pkl_path, "rb") as f:
                    species_to_offsets = pickle.load(f)

                gene_to_chrom_pos = get_spec_chrom_csv(
                    self.args.spec_chrom_csv_path)
                dataset_species = self.args.species
                spec_pe_genes = list(species_to_pe[dataset_species].keys())
                offset = species_to_offsets[dataset_species]
                pe_row_idxs, dataset_chroms, dataset_pos = adata_path_to_prot_chrom_starts(
                    self.adata, dataset_species, spec_pe_genes, gene_to_chrom_pos, offset)

                # Save to the temp dict
                torch.save({self.name: pe_row_idxs}, self.pe_idx_path)
                with open(self.chroms_path, "wb+") as f:
                    pickle.dump({self.name: dataset_chroms}, f)
                with open(self.starts_path, "wb+") as f:
                    pickle.dump({self.name: dataset_pos}, f)

    def run_evaluation(self):
        self.accelerator.wait_for_everyone()
        with open(self.shapes_dict_path, "rb") as f:
            shapes_dict = pickle.load(f)
        run_eval(self.adata, self.name, self.pe_idx_path, self.chroms_path,
                 self.starts_path, shapes_dict, self.accelerator, self.args)


def get_ESM2_embeddings(args):
    # Load in ESM2 embeddings and special tokens
    all_pe = torch.load(args.token_file)
    if all_pe.shape[0] == 143574:
        torch.manual_seed(23)
        CHROM_TENSORS = torch.normal(mean=0, std=1, size=(1895, args.token_dim))
        # 1895 is the total number of chromosome choices, it is hardcoded for now
        all_pe = torch.vstack(
            (all_pe, CHROM_TENSORS))  # Add the chrom tensors to the end
        all_pe.requires_grad = False

    return all_pe


def padding_tensor(sequences):
    """
    :param sequences: list of tensors
    :return:
    """
    num = len(sequences)
    max_len = max([s.size(0) for s in sequences])
    out_dims = (num, max_len, 1280)

    out_tensor = sequences[0].data.new(*out_dims).fill_(0)
    out_dims2 = (num, max_len)

    mask = sequences[0].data.new(*out_dims2).fill_(float('-inf'))
    for i, tensor in enumerate(sequences):
        length = tensor.size(0)
        out_tensor[i, :length] = tensor
        mask[i, :length] = 1
    return out_tensor.permute(1, 0, 2), mask


def run_eval(adata, name, pe_idx_path, chroms_path, starts_path, shapes_dict,
             accelerator, args):

    #### Set up the model ####
    token_dim = args.token_dim
    emsize = 1280  # embedding dimension
    d_hid = args.d_hid  # dimension of the feedforward network model in nn.TransformerEncoder
    nlayers = args.nlayers  # number of nn.TransformerEncoderLayer in nn.TransformerEncoder
    nhead = 20  # number of heads in nn.MultiheadAttention
    dropout = 0.05  # dropout probability
    model = TransformerModel(token_dim=token_dim, d_model=emsize, nhead=nhead,
                             d_hid=d_hid,
                             nlayers=nlayers, dropout=dropout,
                             output_dim=args.output_dim)
    if args.model_loc is None:
        raise ValueError("Must provide a model location")
    # intialize as empty
    empty_pe = torch.zeros(145469, 5120)
    empty_pe.requires_grad = False
    model.pe_embedding = nn.Embedding.from_pretrained(empty_pe)
    model.load_state_dict(torch.load(args.model_loc, map_location="cpu"),
                          strict=True)
    # Load in the real token embeddings
    all_pe = get_ESM2_embeddings(args)
    # This will make sure that you don't overwrite the tokens in case you're embedding species from the training data
    # We avoid doing that just in case the random seeds are different across different versions. 
    if all_pe.shape[0] != 145469: 
        all_pe.requires_grad = False
        model.pe_embedding = nn.Embedding.from_pretrained(all_pe)
    print(f"Loaded model:\n{args.model_loc}")
    model = model.eval()
    model = accelerator.prepare(model)
    batch_size = args.batch_size

    #### Run the model ####
    # Dataloaders
    dataset = MultiDatasetSentences(sorted_dataset_names=[name],
                                    shapes_dict=shapes_dict,
                                    args=args, npzs_dir=args.dir,
                                    dataset_to_protein_embeddings_path=pe_idx_path,
                                    datasets_to_chroms_path=chroms_path,
                                    datasets_to_starts_path=starts_path
                                    )
    multi_dataset_sentence_collator = MultiDatasetSentenceCollator(args)

    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=False,
                            collate_fn=multi_dataset_sentence_collator,
                            num_workers=0)
    dataloader = accelerator.prepare(dataloader)
    pbar = tqdm(dataloader, disable=not accelerator.is_local_main_process)
    dataset_embeds = []
    with torch.no_grad():
        for batch in pbar:
            batch_sentences, mask, idxs = batch[0], batch[1], batch[2]
            batch_sentences = batch_sentences.permute(1, 0)
            if args.multi_gpu:
                batch_sentences = model.module.pe_embedding(batch_sentences.long())
            else:
                batch_sentences = model.pe_embedding(batch_sentences.long())
            batch_sentences = nn.functional.normalize(batch_sentences,
                                                      dim=2)  # Normalize token outputs now
            _, embedding = model.forward(batch_sentences, mask=mask)
            # Fix for duplicates in last batch
            accelerator.wait_for_everyone()
            embeddings = accelerator.gather_for_metrics((embedding))
            if accelerator.is_main_process:
                dataset_embeds.append(embeddings.detach().cpu().numpy())

    accelerator.wait_for_everyone()
    if accelerator.is_main_process:
        dataset_embeds = np.vstack(dataset_embeds)
        adata.obsm["X_uce"] = dataset_embeds
        write_path = args.dir + f"{name}_uce_adata.h5ad"
        adata.write(write_path)

        print("*****Wrote Anndata to:*****")
        print(write_path)


================================================
FILE: examples/Benchmark Embeddings with scIB.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "6b258384-9a56-4ed0-be6f-db1c94711356",
   "metadata": {},
   "source": [
    "# Large Scale Embedding benchmarks\n",
    "\n",
    "This notebook includes an example showing how to run large scale embedding benchmarks using scIB [(single-cell integration benchmark)](https://www.nature.com/articles/s41592-021-01336-8)\n",
    "\n",
    "We use the GPU accelerated version implemented here: https://github.com/YosefLab/scib-metrics\n",
    "\n",
    "Please follow installation instructions in that repo. \n",
    "\n",
    "*Note: installing Faiss can be difficult and may take some time*\n",
    "\n",
    "*Running the full benchmarking suite on many cells can take many hours, even on GPUs with large amounts of memory, such as A100s, and with many threads*"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ca4ba3a1-5c85-4c7b-8564-f8c5689e9345",
   "metadata": {},
   "source": [
    "## Load Imports and define Benchmark Function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "b9d9fd58-915b-492d-9880-48c37e3859a8",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import scanpy as sc\n",
    "\n",
    "from scib_metrics.benchmark import Benchmarker\n",
    "\n",
    "import faiss\n",
    "\n",
    "from scib_metrics.nearest_neighbors import NeighborsResults\n",
    "\n",
    "# Faiss GPU accelerate nearest neighbors methods\n",
    "def faiss_hnsw_nn(X: np.ndarray, k: int):\n",
    "    \"\"\"Gpu HNSW nearest neighbor search using faiss.\n",
    "\n",
    "    See https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md\n",
    "    for index param details.\n",
    "    \"\"\"\n",
    "    X = np.ascontiguousarray(X, dtype=np.float32)\n",
    "    res = faiss.StandardGpuResources()\n",
    "    M = 32\n",
    "    index = faiss.IndexHNSWFlat(X.shape[1], M, faiss.METRIC_L2)\n",
    "    gpu_index = faiss.index_cpu_to_gpu(res, 0, index)\n",
    "    gpu_index.add(X)\n",
    "    distances, indices = gpu_index.search(X, k)\n",
    "    del index\n",
    "    del gpu_index\n",
    "    # distances are squared\n",
    "    return NeighborsResults(indices=indices, distances=np.sqrt(distances))\n",
    "\n",
    "\n",
    "def faiss_brute_force_nn(X: np.ndarray, k: int):\n",
    "    \"\"\"Gpu brute force nearest neighbor search using faiss.\"\"\"\n",
    "    X = np.ascontiguousarray(X, dtype=np.float32)\n",
    "    res = faiss.StandardGpuResources()\n",
    "    index = faiss.IndexFlatL2(X.shape[1])\n",
    "    gpu_index = faiss.index_cpu_to_gpu(res, 0, index)\n",
    "    gpu_index.add(X)\n",
    "    distances, indices = gpu_index.search(X, k)\n",
    "    del index\n",
    "    del gpu_index\n",
    "    # distances are squared\n",
    "    return NeighborsResults(indices=indices, distances=np.sqrt(distances))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "4c5fb90f-ffa5-4cb9-bf6a-6afce956fc86",
   "metadata": {},
   "outputs": [],
   "source": [
    "import warnings\n",
    "warnings.filterwarnings(\"ignore\")\n",
    "from scib_metrics.benchmark import Benchmarker, BioConservation, BatchCorrection\n",
    "import pandas as pd\n",
    "\n",
    "## Benchmarking Function, returns dataframe of scores\n",
    "def benchmark(ad, label_key=\"cell_type\", batch_key=\"sample_id\", obsm_keys=[\"X_uce\", \"X_scGPT\", \"X_geneformer\"]):\n",
    "    print(f\"Running using CT key:\", label_key)\n",
    "    biocons = BioConservation()\n",
    "    batchcons = BatchCorrection(pcr_comparison=False)\n",
    "    \n",
    "    bm = Benchmarker(\n",
    "        ad,\n",
    "        batch_key=batch_key,\n",
    "        label_key=label_key,\n",
    "        embedding_obsm_keys=obsm_keys,\n",
    "        bio_conservation_metrics=biocons,\n",
    "        batch_correction_metrics=None,\n",
    "        n_jobs=48,\n",
    "    )\n",
    "    bm.prepare(neighbor_computer=faiss_brute_force_nn)\n",
    "    bm.benchmark()\n",
    "    df = bm.get_results(min_max_scale=False)\n",
    "    return df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2f3bb257-21d4-41d5-9726-50b5e7af04b2",
   "metadata": {},
   "source": [
    "### Load in anndata\n",
    "\n",
    "For this example, we will benchmark cells from developing mouse brain.\n",
    "\n",
    "You can download an anndata object with UCE, scGPT and Geneformer embeddings precalulated from [here](https://drive.google.com/drive/folders/1f63fh0ykgEhCrkd_EVvIootBw7LYDVI7)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "35392e93-6ffd-4df6-9609-f85ea6aad4ae",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "AnnData object with n_obs × n_vars = 597668 × 18285\n",
       "    obs: 'n_counts', 'n_genes', 'region', 'age', 'experiment', 'species', 'sex', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'leiden', 'cell_type', 'sex_old', 'abca_class', 'abca_subclass', 'abca_supertype', 'abca_cluster', 'abca_region', 'leiden_old', 'region_dissected', 'biosample_id', 'donor_id', 'species__ontology_label', 'disease', 'disease__ontology_label', 'organ', 'organ__ontology_label', 'library_preparation_protocol', 'library_preparation_protocol__ontology_label', 'cell_type_author', 'cell_type__ontology_label', 'supercluster'\n",
       "    var: 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm', 'highly_variable_nbatches', 'feature_name'\n",
       "    uns: '10x_batch_colors', '_scvi_manager_uuid', '_scvi_uuid', 'age_colors', 'ages_ordered_colors', 'dendrogram_leiden', 'hvg', 'leiden', 'log1p', 'neighbors', 'pca', 'rank_genes_groups', 'region_colors', 'region_dissected_colors', 'regions_ordered_colors', 'replicate_colors', 'sex_colors', 'umap'\n",
       "    obsm: 'X_geneformer', 'X_pca', 'X_scGPT', 'X_scVI', 'X_uce', 'X_umap', 'latent_gene_encoding'\n",
       "    layers: 'counts'\n",
       "    obsp: 'connectivities', 'distances'"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ad = sc.read(\"developing_mouse_brain.h5ad\", cache=True)\n",
    "ad"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "a4cb1a5e-1672-4ba7-b488-036de0e3ff61",
   "metadata": {},
   "outputs": [],
   "source": [
    "cell_type_column = \"supercluster\"\n",
    "batch_column = \"donor_id\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "134f4e09-8e68-43fb-9d12-d87a1b5318c1",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "33"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(ad.obs[cell_type_column].unique()) # Number of unique cell types"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "ac956e69-9a66-4225-adb8-a01a2d6e23bf",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "25"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(ad.obs[batch_column].unique()) # Number of unique batches"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ee280476-4057-4051-b4f1-eb7ee0055e69",
   "metadata": {},
   "source": [
    "# Running the Benchmark\n",
    "\n",
    "Running the benchmark on the full dataset can take a very long time. Instead, we can run on medium sized samples of cells."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "0cae96b8-5be1-4ea5-a919-d16d2205d645",
   "metadata": {},
   "outputs": [],
   "source": [
    "sample_size = 100_000 # number of cells"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "189ad01d-83c0-40e6-ab13-d16ed7eb0c88",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "0d430c0038f84d33915a3d9b211d9608",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/10 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Running using CT key: supercluster\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "Computing neighbors:   0%|                                                                                                                                                                                                              | 0/3 [00:00<?, ?it/s]\u001b[A\n",
      "Computing neighbors:  33%|██████████████████████████████████████████████████████████████████                                                                                                                                    | 1/3 [00:02<00:04,  2.44s/it]\u001b[A\n",
      "Computing neighbors:  67%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                                                  | 2/3 [00:03<00:01,  1.61s/it]\u001b[A\n",
      "Computing neighbors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:04<00:00,  1.50s/it]\u001b[A\n",
      "Embeddings:   0%|\u001b[32m                                                                                                                                                                                                                       \u001b[0m| 0/3 [00:00<?, ?it/s]\u001b[0m\n",
      "Metrics:   0%|\u001b[34m                                                                                                                                                                                                                         \u001b[0m| 0/10 [00:00<?, ?it/s]\u001b[0m\u001b[A\n",
      "Metrics:   0%|\u001b[34m                                                                                                                                                                                      \u001b[0m| 0/10 [00:00<?, ?it/s, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
      "Metrics:  10%|\u001b[34m█████████████████▍                                                                                                                                                            \u001b[0m| 1/10 [00:57<08:33, 57.09s/it, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
      "Metrics:  10%|\u001b[34m████████████████                                                                                                                                                \u001b[0m| 1/10 [00:57<08:33, 57.09s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
      "Metrics:  20%|\u001b[34m████████████████████████████████                                                                                                                                \u001b[0m| 2/10 [01:08<04:01, 30.17s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
      "Metrics:  20%|\u001b[34m██████████████████████████████████▌                                                                                                                                          \u001b[0m| 2/10 [01:08<04:01, 30.17s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
      "Metrics:  30%|\u001b[34m███████████████████████████████████████████████████▉                                                                                                                         \u001b[0m| 3/10 [01:51<04:11, 35.98s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
      "Metrics:  30%|\u001b[34m██████████████████████████████████████████████████████                                                                                                                              \u001b[0m| 3/10 [01:51<04:11, 35.98s/it, Bio conservation: clisi_knn]\u001b[0m\u001b[A\n",
      "Metrics:  40%|\u001b[34m████████████████████████████████████████████████████████████████████████                                                                                                            \u001b[0m| 4/10 [01:52<02:12, 22.15s/it, Bio conservation: clisi_knn]\u001b[0m\u001b[A\n",
      "Metrics:  40%|\u001b[34m█████████████████████████████████████████████████████████████████████▏                                                                                                       \u001b[0m| 4/10 [01:52<02:12, 22.15s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
      "Metrics:  50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████▌                                                                                      \u001b[0m| 5/10 [02:17<01:56, 23.23s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
      "Metrics:  50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████                                                                                          \u001b[0m| 5/10 [02:17<01:56, 23.23s/it, Batch correction: ilisi_knn]\u001b[0m\u001b[A\n",
      "Metrics:  60%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████                                                                        \u001b[0m| 6/10 [02:17<01:01, 15.40s/it, Batch correction: ilisi_knn]\u001b[0m\u001b[A\n",
      "Metrics:  60%|\u001b[34m█████████████████████████████████████████████████████████████████████████████████████████████████████████                                                                      \u001b[0m| 6/10 [02:17<01:01, 15.40s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
      "Metrics:  70%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                                    \u001b[0m| 7/10 [03:39<01:51, 37.27s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
      "Metrics:  70%|\u001b[34m███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                                   \u001b[0m| 7/10 [03:39<01:51, 37.27s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
      "Metrics:  80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                  \u001b[0m| 8/10 [03:40<00:50, 25.49s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
      "Metrics:  80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                   \u001b[0m| 8/10 [03:40<00:50, 25.49s/it, Batch correction: pcr_comparison]\u001b[0m\u001b[A\n",
      "Embeddings:  33%|\u001b[32m████████████████████████████████████████████████████████████████████▋                                                                                                                                         \u001b[0m| 1/3 [03:42<07:25, 222.58s/it]\u001b[0m\u001b[A\n",
      "Metrics:   0%|\u001b[34m                                                                                                                                                                                                                         \u001b[0m| 0/10 [00:00<?, ?it/s]\u001b[0m\u001b[A\n",
      "                                                                                                                                                                                                                                                              \u001b[A\n",
      "Metrics:   0%|\u001b[34m                                                                                                                                                                                      \u001b[0m| 0/10 [00:00<?, ?it/s, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
      "Metrics:  10%|\u001b[34m█████████████████▍                                                                                                                                                            \u001b[0m| 1/10 [00:17<02:38, 17.58s/it, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
      "Metrics:  10%|\u001b[34m████████████████                                                                                                                                                \u001b[0m| 1/10 [00:17<02:38, 17.58s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
      "Metrics:  20%|\u001b[34m████████████████████████████████                                                                                                                                \u001b[0m| 2/10 [00:23<01:28, 11.01s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
      "Metrics:  20%|\u001b[34m██████████████████████████████████▌                                                                                                                                          \u001b[0m| 2/10 [00:24<01:28, 11.01s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
      "Metrics:  30%|\u001b[34m███████████████████████████████████████████████████▉                                                                                                                         \u001b[0m| 3/10 [00:41<01:37, 13.94s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
      "Metrics:  30%|\u001b[34m██████████████████████████████████████████████████████                                                                                                                              \u001b[0m| 3/10 [00:41<01:37, 13.94s/it, Bio conservation: clisi_knn]\u001b[0m\u001b[A\n",
      "Metrics:  40%|\u001b[34m████████████████████████████████████████████████████████████████████████                                                                                                            \u001b[0m| 4/10 [00:41<00:50,  8.48s/it, Bio conservation: clisi_knn]\u001b[0m\u001b[A\n",
      "Metrics:  40%|\u001b[34m█████████████████████████████████████████████████████████████████████▏                                                                                                       \u001b[0m| 4/10 [00:41<00:50,  8.48s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
      "Metrics:  50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████▌                                                                                      \u001b[0m| 5/10 [00:46<00:36,  7.39s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
      "Metrics:  50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████                                                                                          \u001b[0m| 5/10 [00:46<00:36,  7.39s/it, Batch correction: ilisi_knn]\u001b[0m\u001b[A\n",
      "Metrics:  60%|\u001b[34m█████████████████████████████████████████████████████████████████████████████████████████████████████████                                                                      \u001b[0m| 6/10 [00:47<00:29,  7.39s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
      "Metrics:  70%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                                    \u001b[0m| 7/10 [01:39<00:50, 16.92s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
      "Metrics:  70%|\u001b[34m███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                                   \u001b[0m| 7/10 [01:39<00:50, 16.92s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
      "Metrics:  80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                  \u001b[0m| 8/10 [01:39<00:24, 12.49s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
      "Metrics:  80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                   \u001b[0m| 8/10 [01:39<00:24, 12.49s/it, Batch correction: pcr_comparison]\u001b[0m\u001b[A\n",
      "Embeddings:  67%|\u001b[32m█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                                                                    \u001b[0m| 2/3 [05:23<02:30, 150.75s/it]\u001b[0m\u001b[A\n",
      "Metrics:   0%|\u001b[34m                                                                                                                                                                                                                         \u001b[0m| 0/10 [00:00<?, ?it/s]\u001b[0m\u001b[A\n",
      "                                                                                                                                                                                                                                                              \u001b[A\n",
      "Metrics:   0%|\u001b[34m                                                                                                                                                                                      \u001b[0m| 0/10 [00:00<?, ?it/s, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
      "Metrics:  10%|\u001b[34m█████████████████▍                                                                                                                                                            \u001b[0m| 1/10 [00:17<02:38, 17.57s/it, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
      "Metrics:  10%|\u001b[34m████████████████                                                                                                                                                \u001b[0m| 1/10 [00:17<02:38, 17.57s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
      "Metrics:  20%|\u001b[34m████████████████████████████████                                                                                                                                \u001b[0m| 2/10 [00:23<01:27, 10.95s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
      "Metrics:  20%|\u001b[34m██████████████████████████████████▌                                                                                                                                          \u001b[0m| 2/10 [00:23<01:27, 10.95s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
      "Metrics:  30%|\u001b[34m███████████████████████████████████████████████████▉                                                                                                                         \u001b[0m| 3/10 [00:41<01:37, 13.95s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
      "Metrics:  30%|\u001b[34m██████████████████████████████████████████████████████                                                                                                                              \u001b[0m| 3/10 [00:41<01:37, 13.95s/it, Bio conservation: clisi_knn]\u001b[0m\u001b[A\n",
      "Metrics:  40%|\u001b[34m█████████████████████████████████████████████████████████████████████▏                                                                                                       \u001b[0m| 4/10 [00:41<01:23, 13.95s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
      "Metrics:  50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████▌                                                                                      \u001b[0m| 5/10 [00:42<00:32,  6.49s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
      "Metrics:  50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████                                                                                          \u001b[0m| 5/10 [00:42<00:32,  6.49s/it, Batch correction: ilisi_knn]\u001b[0m\u001b[A\n",
      "Metrics:  60%|\u001b[34m█████████████████████████████████████████████████████████████████████████████████████████████████████████                                                                      \u001b[0m| 6/10 [00:43<00:25,  6.49s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
      "Metrics:  70%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                                    \u001b[0m| 7/10 [01:36<00:46, 15.63s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
      "Metrics:  70%|\u001b[34m███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                                   \u001b[0m| 7/10 [01:36<00:46, 15.63s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
      "Metrics:  80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                  \u001b[0m| 8/10 [01:37<00:23, 11.92s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
      "Metrics:  80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                   \u001b[0m| 8/10 [01:37<00:23, 11.92s/it, Batch correction: pcr_comparison]\u001b[0m\u001b[A\n",
      "Embeddings: 100%|\u001b[32m██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\u001b[0m| 3/3 [07:00<00:00, 140.20s/it]\u001b[0m\u001b[A\n",
      "\n",
      "                                                                                                                                                                                                                                                              \u001b[A"
     ]
    },
    {
     "data": {
      "text/plain": [
       "AnnData object with n_obs × n_vars = 100000 × 18285\n",
       "    obs: 'n_counts', 'n_genes', 'region', 'age', 'experiment', 'species', 'sex', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'leiden', 'cell_type', 'sex_old', 'abca_class', 'abca_subclass', 'abca_supertype', 'abca_cluster', 'abca_region', 'leiden_old', 'region_dissected', 'biosample_id', 'donor_id', 'species__ontology_label', 'disease', 'disease__ontology_label', 'organ', 'organ__ontology_label', 'library_preparation_protocol', 'library_preparation_protocol__ontology_label', 'cell_type_author', 'cell_type__ontology_label', 'supercluster'\n",
       "    var: 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm', 'highly_variable_nbatches', 'feature_name'\n",
       "    uns: '10x_batch_colors', '_scvi_manager_uuid', '_scvi_uuid', 'age_colors', 'ages_ordered_colors', 'dendrogram_leiden', 'hvg', 'leiden', 'log1p', 'neighbors', 'pca', 'rank_genes_groups', 'region_colors', 'region_dissected_colors', 'regions_ordered_colors', 'replicate_colors', 'sex_colors', 'umap'\n",
       "    obsm: 'X_geneformer', 'X_pca', 'X_scGPT', 'X_scVI', 'X_uce', 'X_umap', 'latent_gene_encoding'\n",
       "    varm: 'PCs'\n",
       "    layers: 'counts'\n",
       "    obsp: 'connectivities', 'distances'"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Running using CT key: supercluster\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "Computing neighbors:   0%|                                                                                                                                                                                                              | 0/3 [00:00<?, ?it/s]\u001b[A\n",
      "Computing neighbors:  33%|██████████████████████████████████████████████████████████████████                                                                                                                                    | 1/3 [00:01<00:03,  1.97s/it]\u001b[A\n",
      "Computing neighbors:  67%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                                                  | 2/3 [00:03<00:01,  1.43s/it]\u001b[A\n",
      "Computing neighbors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:04<00:00,  1.35s/it]\u001b[A\n",
      "Embeddings:   0%|\u001b[32m                                                                                                                                                                                                                       \u001b[0m| 0/3 [00:00<?, ?it/s]\u001b[0m\n",
      "Metrics:   0%|\u001b[34m                                                                                                                                                                                                                         \u001b[0m| 0/10 [00:00<?, ?it/s]\u001b[0m\u001b[A\n",
      "Metrics:   0%|\u001b[34m                                                                                                                                                                                      \u001b[0m| 0/10 [00:00<?, ?it/s, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
      "Metrics:  10%|\u001b[34m█████████████████▍                                                                                                                                                            \u001b[0m| 1/10 [00:39<05:53, 39.31s/it, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
      "Metrics:  10%|\u001b[34m████████████████                                                                                                                                                \u001b[0m| 1/10 [00:39<05:53, 39.31s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
      "Metrics:  20%|\u001b[34m████████████████████████████████                                                                                                                                \u001b[0m| 2/10 [00:49<02:58, 22.36s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
      "Metrics:  20%|\u001b[34m██████████████████████████████████▌                                                                                                                                          \u001b[0m| 2/10 [00:49<02:58, 22.36s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
      "Metrics:  30%|\u001b[34m███████████████████████████████████████████████████▉                                                                                                                         \u001b[0m| 3/10 [01:29<03:31, 30.17s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
      "Metrics:  30%|\u001b[34m██████████████████████████████████████████████████████                                                                                                                              \u001b[0m| 3/10 [01:29<03:31, 30.17s/it, Bio conservation: clisi_knn]\u001b[0m\u001b[A\n",
      "Metrics:  40%|\u001b[34m████████████████████████████████████████████████████████████████████████                                                                                                            \u001b[0m| 4/10 [01:29<01:49, 18.30s/it, Bio conservation: clisi_knn]\u001b[0m\u001b[A\n",
      "Metrics:  40%|\u001b[34m█████████████████████████████████████████████████████████████████████▏                                                                                                       \u001b[0m| 4/10 [01:29<01:49, 18.30s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
      "Metrics:  50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████▌                                                                                      \u001b[0m| 5/10 [01:56<01:47, 21.46s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
      "Metrics:  50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████                                                                                          \u001b[0m| 5/10 [01:56<01:47, 21.46s/it, Batch correction: ilisi_knn]\u001b[0m\u001b[A\n",
      "Metrics:  60%|\u001b[34m█████████████████████████████████████████████████████████████████████████████████████████████████████████                                                                      \u001b[0m| 6/10 [01:56<01:25, 21.46s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
      "Metrics:  70%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                                    \u001b[0m| 7/10 [02:56<01:17, 25.85s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
      "Metrics:  70%|\u001b[34m███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                                   \u001b[0m| 7/10 [02:56<01:17, 25.85s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
      "Metrics:  80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                  \u001b[0m| 8/10 [02:56<00:38, 19.05s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
      "Metrics:  80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                   \u001b[0m| 8/10 [02:56<00:38, 19.05s/it, Batch correction: pcr_comparison]\u001b[0m\u001b[A\n",
      "Embeddings:  33%|\u001b[32m████████████████████████████████████████████████████████████████████▋                                                                                                                                         \u001b[0m| 1/3 [02:58<05:56, 178.39s/it]\u001b[0m\u001b[A\n",
      "Metrics:   0%|\u001b[34m                                                                                                                                                                                                                         \u001b[0m| 0/10 [00:00<?, ?it/s]\u001b[0m\u001b[A\n",
      "                                                                                                                                                                                                                                                              \u001b[A\n",
      "Metrics:   0%|\u001b[34m                                                                                                                                                                                      \u001b[0m| 0/10 [00:00<?, ?it/s, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
      "Metrics:  10%|\u001b[34m█████████████████▍                                                                                                                                                            \u001b[0m| 1/10 [00:17<02:36, 17.40s/it, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
      "Metrics:  10%|\u001b[34m████████████████                                                                                                                                                \u001b[0m| 1/10 [00:17<02:36, 17.40s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
      "Metrics:  20%|\u001b[34m████████████████████████████████                                                                                                                                \u001b[0m| 2/10 [00:23<01:26, 10.83s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
      "Metrics:  20%|\u001b[34m██████████████████████████████████▌                                                                                                                                          \u001b[0m| 2/10 [00:23<01:26, 10.83s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
      "Metrics:  30%|\u001b[34m███████████████████████████████████████████████████▉                                                                                                                         \u001b[0m| 3/10 [00:40<01:36, 13.77s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
      "Metrics:  30%|\u001b[34m██████████████████████████████████████████████████████                                                                                                                              \u001b[0m| 3/10 [00:40<01:36, 13.77s/it, Bio conservation: clisi_knn]\u001b[0m\u001b[A\n",
      "Metrics:  40%|\u001b[34m████████████████████████████████████████████████████████████████████████                                                                                                            \u001b[0m| 4/10 [00:41<00:50,  8.38s/it, Bio conservation: clisi_knn]\u001b[0m\u001b[A\n",
      "Metrics:  40%|\u001b[34m█████████████████████████████████████████████████████████████████████▏                                                                                                       \u001b[0m| 4/10 [00:41<00:50,  8.38s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
      "Metrics:  50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████▌                                                                                      \u001b[0m| 5/10 [00:47<00:37,  7.54s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
      "Metrics:  50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████                                                                                          \u001b[0m| 5/10 [00:47<00:37,  7.54s/it, Batch correction: ilisi_knn]\u001b[0m\u001b[A\n",
      "Metrics:  60%|\u001b[34m█████████████████████████████████████████████████████████████████████████████████████████████████████████                                                                      \u001b[0m| 6/10 [00:47<00:30,  7.54s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
      "Metrics:  70%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                                    \u001b[0m| 7/10 [01:47<00:57, 19.17s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
      "Metrics:  70%|\u001b[34m███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                                   \u001b[0m| 7/10 [01:47<00:57, 19.17s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
      "Metrics:  80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                  \u001b[0m| 8/10 [01:48<00:28, 14.15s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
      "Metrics:  80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                   \u001b[0m| 8/10 [01:48<00:28, 14.15s/it, Batch correction: pcr_comparison]\u001b[0m\u001b[A\n",
      "Embeddings:  67%|\u001b[32m█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                                                                    \u001b[0m| 2/3 [04:47<02:17, 137.45s/it]\u001b[0m\u001b[A\n",
      "Metrics:   0%|\u001b[34m                                                                                                                                                                                                                         \u001b[0m| 0/10 [00:00<?, ?it/s]\u001b[0m\u001b[A\n",
      "                                                                                                                                                                                                                                                              \u001b[A\n",
      "Metrics:   0%|\u001b[34m                                                                                                                                                                                      \u001b[0m| 0/10 [00:00<?, ?it/s, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
      "Metrics:  10%|\u001b[34m█████████████████▍                                                                                                                                                            \u001b[0m| 1/10 [00:17<02:37, 17.50s/it, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
      "Metrics:  10%|\u001b[34m████████████████                                                                                                                                                \u001b[0m| 1/10 [00:17<02:37, 17.50s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
      "Metrics:  20%|\u001b[34m████████████████████████████████                                                                                                                                \u001b[0m| 2/10 [00:24<01:28, 11.07s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
      "Metrics:  20%|\u001b[34m██████████████████████████████████▌                                                                                                                                          \u001b[0m| 2/10 [00:24<01:28, 11.07s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
      "Metrics:  30%|\u001b[34m███████████████████████████████████████████████████▉                                                                                                                         \u001b[0m| 3/10 [00:41<01:38, 14.04s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
      "Metrics:  30%|\u001b[34m██████████████████████████████████████████████████████                                                                                                                              \u001b[0m| 3/10 [00:41<01:38, 14.04s/it, Bio conservation: clisi_knn]\u001b[0m\u001b[A\n",
      "Metrics:  40%|\u001b[34m████████████████████████████████████████████████████████████████████████                                                                                                            \u001b[0m| 4/10 [00:41<00:51,  8.54s/it, Bio conservation: clisi_knn]\u001b[0m\u001b[A\n",
      "Metrics:  40%|\u001b[34m█████████████████████████████████████████████████████████████████████▏                                                                                                       \u001b[0m| 4/10 [00:41<00:51,  8.54s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
      "Metrics:  50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████▌                                                                                      \u001b[0m| 5/10 [00:43<00:30,  6.03s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
      "Metrics:  50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████                                                                                          \u001b[0m| 5/10 [00:43<00:30,  6.03s/it, Batch correction: ilisi_knn]\u001b[0m\u001b[A\n",
      "Metrics:  60%|\u001b[34m█████████████████████████████████████████████████████████████████████████████████████████████████████████                                                                      \u001b[0m| 6/10 [00:43<00:24,  6.03s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
      "Metrics:  70%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                                    \u001b[0m| 7/10 [01:40<00:52, 17.48s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
      "Metrics:  70%|\u001b[34m███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                                   \u001b[0m| 7/10 [01:40<00:52, 17.48s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
      "Metrics:  80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                  \u001b[0m| 8/10 [01:40<00:25, 12.89s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
      "Metrics:  80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                   \u001b[0m| 8/10 [01:40<00:25, 12.89s/it, Batch correction: pcr_comparison]\u001b[0m\u001b[A\n",
      "Embeddings: 100%|\u001b[32m██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\u001b[0m| 3/3 [06:28<00:00, 129.48s/it]\u001b[0m\u001b[A\n",
      "\n",
      "                                                                                                                                                                                                                                                              \u001b[A"
     ]
    },
    {
     "data": {
      "text/plain": [
       "AnnData object with n_obs × n_vars = 100000 × 18285\n",
       "    obs: 'n_counts', 'n_genes', 'region', 'age', 'experiment', 'species', 'sex', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'leiden', 'cell_type', 'sex_old', 'abca_class', 'abca_subclass', 'abca_supertype', 'abca_cluster', 'abca_region', 'leiden_old', 'region_dissected', 'biosample_id', 'donor_id', 'species__ontology_label', 'disease', 'disease__ontology_label', 'organ', 'organ__ontology_label', 'library_preparation_protocol', 'library_preparation_protocol__ontology_label', 'cell_type_author', 'cell_type__ontology_label', 'supercluster'\n",
       "    var: 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm', 'highly_variable_nbatches', 'feature_name'\n",
       "    uns: '10x_batch_colors', '_scvi_manager_uuid', '_scvi_uuid', 'age_colors', 'ages_ordered_colors', 'dendrogram_leiden', 'hvg', 'leiden', 'log1p', 'neighbors', 'pca', 'rank_genes_groups', 'region_colors', 'region_dissected_colors', 'regions_ordered_colors', 'replicate_colors', 'sex_colors', 'umap'\n",
       "    obsm: 'X_geneformer', 'X_pca', 'X_scGPT', 'X_scVI', 'X_uce', 'X_umap', 'latent_gene_encoding'\n",
       "    varm: 'PCs'\n",
       "    layers: 'counts'\n",
       "    obsp: 'connectivities', 'distances'"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Running using CT key: supercluster\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "Computing neighbors:   0%|                                                                                                                                                                                                              | 0/3 [00:00<?, ?it/s]\u001b[A\n",
      "Computing neighbors:  33%|██████████████████████████████████████████████████████████████████                                                                                                                                    | 1/3 [00:01<00:03,  1.93s/it]\u001b[A\n",
      "Computing neighbors:  67%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                                                  | 2/3 [00:02<00:01,  1.40s/it]\u001b[A\n",
      "Computing neighbors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:03<00:00,  1.32s/it]\u001b[A\n",
      "Embeddings:   0%|\u001b[32m                                                                                                                                                                                                                       \u001b[0m| 0/3 [00:00<?, ?it/s]\u001b[0m\n",
      "Metrics:   0%|\u001b[34m                                                                                                                                                                                                                         \u001b[0m| 0/10 [00:00<?, ?it/s]\u001b[0m\u001b[A\n",
      "Metrics:   0%|\u001b[34m                                                                                                                                                                                      \u001b[0m| 0/10 [00:00<?, ?it/s, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
      "Metrics:  10%|\u001b[34m█████████████████▍                                                                                                                                                            \u001b[0m| 1/10 [00:39<05:53, 39.24s/it, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
      "Metrics:  10%|\u001b[34m████████████████                                                                                                                                                \u001b[0m| 1/10 [00:39<05:53, 39.24s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
      "Metrics:  20%|\u001b[34m████████████████████████████████                                                                                                                                \u001b[0m| 2/10 [00:50<03:03, 22.88s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
      "Metrics:  20%|\u001b[34m██████████████████████████████████▌                                                                                                                                          \u001b[0m| 2/10 [00:50<03:03, 22.88s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
      "Metrics:  30%|\u001b[34m███████████████████████████████████████████████████▉                                                                                                                         \u001b[0m| 3/10 [01:30<03:32, 30.42s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
      "Metrics:  30%|\u001b[34m██████████████████████████████████████████████████████                                                                                                                              \u001b[0m| 3/10 [01:30<03:32, 30.42s/it, Bio conservation: clisi_knn]\u001b[0m\u001b[A\n",
      "Metrics:  40%|\u001b[34m████████████████████████████████████████████████████████████████████████                                                                                                            \u001b[0m| 4/10 [01:30<01:50, 18.45s/it, Bio conservation: clisi_knn]\u001b[0m\u001b[A\n",
      "Metrics:  40%|\u001b[34m█████████████████████████████████████████████████████████████████████▏                                                                                                       \u001b[0m| 4/10 [01:30<01:50, 18.45s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
      "Metrics:  50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████▌                                                                                      \u001b[0m| 5/10 [01:56<01:46, 21.22s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
      "Metrics:  50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████                                                                                          \u001b[0m| 5/10 [01:56<01:46, 21.22s/it, Batch correction: ilisi_knn]\u001b[0m\u001b[A\n",
      "Metrics:  60%|\u001b[34m█████████████████████████████████████████████████████████████████████████████████████████████████████████                                                                      \u001b[0m| 6/10 [01:56<01:24, 21.22s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
      "Metrics:  70%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                                    \u001b[0m| 7/10 [02:58<01:18, 26.20s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
      "Metrics:  70%|\u001b[34m███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                                   \u001b[0m| 7/10 [02:58<01:18, 26.20s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
      "Metrics:  80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                  \u001b[0m| 8/10 [02:58<00:38, 19.30s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
      "Metrics:  80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                   \u001b[0m| 8/10 [02:58<00:38, 19.30s/it, Batch correction: pcr_comparison]\u001b[0m\u001b[A\n",
      "Embeddings:  33%|\u001b[32m████████████████████████████████████████████████████████████████████▋                                                                                                                                         \u001b[0m| 1/3 [03:00<06:00, 180.00s/it]\u001b[0m\u001b[A\n",
      "Metrics:   0%|\u001b[34m                                                                                                                                                                                                                         \u001b[0m| 0/10 [00:00<?, ?it/s]\u001b[0m\u001b[A\n",
      "                                                                                                                                                                                                                                                              \u001b[A\n",
      "Metrics:   0%|\u001b[34m                                                                                                                                                                                      \u001b[0m| 0/10 [00:00<?, ?it/s, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
      "Metrics:  10%|\u001b[34m█████████████████▍                                                                                                                                                            \u001b[0m| 1/10 [00:17<02:36, 17.34s/it, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
      "Metrics:  10%|\u001b[34m████████████████                                                                                                                                                \u001b[0m| 1/10 [00:17<02:36, 17.34s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
      "Metrics:  20%|\u001b[34m████████████████████████████████                                                                                                                                \u001b[0m| 2/10 [00:23<01:26, 10.82s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
      "Metrics:  20%|\u001b[34m██████████████████████████████████▌                                                                                                                                          \u001b[0m| 2/10 [00:23<01:26, 10.82s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
      "Metrics:  30%|\u001b[34m███████████████████████████████████████████████████▉                                                                                                                         \u001b[0m| 3/10 [00:41<01:36, 13.83s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
      "Metrics:  30%|\u001b[34m██████████████████████████████████████████████████████                                                                                                                              \u001b[0m| 3/10 [00:41<01:36, 13.83s/it, Bio conservation: clisi_knn]\u001b[0m\u001b[A\n",
      "Metrics:  40%|\u001b[34m█████████████████████████████████████████████████████████████████████▏                                                                                                       \u001b[0m| 4/10 [00:41<01:22, 13.83s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
      "Metrics:  50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████▌                                                                                      \u001b[0m| 5/10 [00:45<00:37,  7.40s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
      "Metrics:  50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████                                                                                          \u001b[0m| 5/10 [00:45<00:37,  7.40s/it, Batch correction: ilisi_knn]\u001b[0m\u001b[A\n",
      "Metrics:  60%|\u001b[34m█████████████████████████████████████████████████████████████████████████████████████████████████████████                                                                      \u001b[0m| 6/10 [00:46<00:29,  7.40s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
      "Metrics:  70%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                                    \u001b[0m| 7/10 [01:45<00:52, 17.45s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
      "Metrics:  70%|\u001b[34m███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                                   \u001b[0m| 7/10 [01:45<00:52, 17.45s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
      "Metrics:  80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                  \u001b[0m| 8/10 [01:45<00:26, 13.29s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
      "Metrics:  80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                   \u001b[0m| 8/10 [01:45<00:26, 13.29s/it, Batch correction: pcr_comparison]\u001b[0m\u001b[A\n",
      "Embeddings:  67%|\u001b[32m█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                                                                    \u001b[0m| 2/3 [04:46<02:16, 136.79s/it]\u001b[0m\u001b[A\n",
      "Metrics:   0%|\u001b[34m                                                                                                                                                                                                                         \u001b[0m| 0/10 [00:00<?, ?it/s]\u001b[0m\u001b[A\n",
      "                                                                                                                                                                                                                                                              \u001b[A\n",
      "Metrics:   0%|\u001b[34m                                                                                                                                                                                      \u001b[0m| 0/10 [00:00<?, ?it/s, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
      "Metrics:  10%|\u001b[34m█████████████████▍                                                                                                                                                            \u001b[0m| 1/10 [00:17<02:38, 17.58s/it, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
      "Metrics:  10%|\u001b[34m████████████████                                                                                                                                                \u001b[0m| 1/10 [00:17<02:38, 17.58s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
      "Metrics:  20%|\u001b[34m████████████████████████████████                                                                                                                                \u001b[0m| 2/10 [00:23<01:27, 10.88s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
      "Metrics:  20%|\u001b[34m██████████████████████████████████▌                                                                                                                                          \u001b[0m| 2/10 [00:23<01:27, 10.88s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
      "Metrics:  30%|\u001b[34m███████████████████████████████████████████████████▉                                                                                                                         \u001b[0m| 3/10 [00:41<01:36, 13.84s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
      "Metrics:  30%|\u001b[34m██████████████████████████████████████████████████████                                                                                                                              \u001b[0m| 3/10 [00:41<01:36, 13.84s/it, Bio conservation: clisi_knn]\u001b[0m\u001b[A\n",
      "Metrics:  40%|\u001b[34m█████████████████████████████████████████████████████████████████████▏                                                                                                       \u001b[0m| 4/10 [00:41<01:23, 13.84s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
      "Metrics:  50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████▌                                                                                      \u001b[0m| 5/10 [00:42<00:31,  6.35s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
      "Metrics:  50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████                                                                                          \u001b[0m| 5/10 [00:42<00:31,  6.35s/it, Batch correction: ilisi_knn]\u001b[0m\u001b[A\n",
      "Metrics:  60%|\u001b[34m█████████████████████████████████████████████████████████████████████████████████████████████████████████                                                                      \u001b[0m| 6/10 [00:42<00:25,  6.35s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
      "Metrics:  70%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                                    \u001b[0m| 7/10 [01:42<00:50, 16.92s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
      "Metrics:  70%|\u001b[34m███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                                   \u001b[0m| 7/10 [01:42<00:50, 16.92s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
      "Metrics:  80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                  \u001b[0m| 8/10 [01:42<00:25, 12.89s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
      "Metrics:  80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                   \u001b[0m| 8/10 [01:42<00:25, 12.89s/it, Batch correction: pcr_comparison]\u001b[0m\u001b[A\n",
      "Embeddings: 100%|\u001b[32m██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\u001b[0m| 3/3 [06:29<00:00, 129.91s/it]\u001b[0m\u001b[A\n",
      "\n",
      "                                                                                                                                                                                                                                                              \u001b[A"
     ]
    },
    {
     "data": {
      "text/plain": [
       "AnnData object with n_obs × n_vars = 100000 × 18285\n",
       "    obs: 'n_counts', 'n_genes', 'region', 'age', 'experiment', 'species', 'sex', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'leiden', 'cell_type', 'sex_old', 'abca_class', 'abca_subclass', 'abca_supertype', 'abca_cluster', 'abca_region', 'leiden_old', 'region_dissected', 'biosample_id', 'donor_id', 'species__ontology_label', 'disease', 'disease__ontology_label', 'organ', 'organ__ontology_label', 'library_preparation_protocol', 'library_preparation_protocol__ontology_label', 'cell_type_author', 'cell_type__ontology_label', 'supercluster'\n",
       "    var: 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm', 'highly_variable_nbatches', 'feature_name'\n",
       "    uns: '10x_batch_colors', '_scvi_manager_uuid', '_scvi_uuid', 'age_colors', 'ages_ordered_colors', 'dendrogram_leiden', 'hvg', 'leiden', 'log1p', 'neighbors', 'pca', 'rank_genes_groups', 'region_colors', 'region_dissected_colors', 'regions_ordered_colors', 'replicate_colors', 'sex_colors', 'umap'\n",
       "    obsm: 'X_geneformer', 'X_pca', 'X_scGPT', 'X_scVI', 'X_uce', 'X_umap', 'latent_gene_encoding'\n",
       "    varm: 'PCs'\n",
       "    layers: 'counts'\n",
       "    obsp: 'connectivities', 'distances'"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Running using CT key: supercluster\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "Computing neighbors:   0%|                                                                                                                                                                                                              | 0/3 [00:00<?, ?it/s]\u001b[A\n",
      "Computing neighbors:  33%|██████████████████████████████████████████████████████████████████                                                                                                                                    | 1/3 [00:01<00:03,  1.97s/it]\u001b[A\n",
      "Computing neighbors:  67%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                                                  | 2/3 [00:02<00:01,  1.42s/it]\u001b[A\n",
      "Computing neighbors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:04<00:00,  1.34s/it]\u001b[A\n",
      "Embeddings:   0%|\u001b[32m                                                                                                                                                                                                                       \u001b[0m| 0/3 [00:00<?, ?it/s]\u001b[0m\n",
      "Metrics:   0%|\u001b[34m                                                                                                                                                                                                                         \u001b[0m| 0/10 [00:00<?, ?it/s]\u001b[0m\u001b[A\n",
      "Metrics:   0%|\u001b[34m                                                                                                                                                                                      \u001b[0m| 0/10 [00:00<?, ?it/s, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
      "Metrics:  10%|\u001b[34m█████████████████▍                                                                                                                                                            \u001b[0m| 1/10 [00:39<05:51, 39.11s/it, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
      "Metrics:  10%|\u001b[34m████████████████                                                                                                                                                \u001b[0m| 1/10 [00:39<05:51, 39.11s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
      "Metrics:  20%|\u001b[34m████████████████████████████████                                                                                                                                \u001b[0m| 2/10 [00:49<02:58, 22.37s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
      "Metrics:  20%|\u001b[34m██████████████████████████████████▌                                                                                                                                          \u001b[0m| 2/10 [00:49<02:58, 22.37s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
      "Metrics:  30%|\u001b[34m███████████████████████████████████████████████████▉                                                                                                                         \u001b[0m| 3/10 [01:31<03:38, 31.18s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
      "Metrics:  30%|\u001b[34m██████████████████████████████████████████████████████                                                                                                                              \u001b[0m| 3/10 [01:31<03:38, 31.18s/it, Bio conservation: clisi_knn]\u001b[0m\u001b[A\n",
      "Metrics:  40%|\u001b[34m█████████████████████████████████████████████████████████████████████▏                                                                                                       \u001b[0m| 4/10 [01:31<03:07, 31.18s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
      "Metrics:  50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████▌                                                                                      \u001b[0m| 5/10 [01:58<01:45, 21.06s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
      "Metrics:  50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████                                                                                          \u001b[0m| 5/10 [01:58<01:45, 21.06s/it, Batch correction: ilisi_knn]\u001b[0m\u001b[A\n",
      "Metrics:  60%|\u001b[34m█████████████████████████████████████████████████████████████████████████████████████████████████████████                                                                      \u001b[0m| 6/10 [01:58<01:24, 21.06s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
      "Metrics:  70%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                                    \u001b[0m| 7/10 [02:55<01:13, 24.48s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
      "Metrics:  70%|\u001b[34m███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                                   \u001b[0m| 7/10 [02:55<01:13, 24.48s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
      "Metrics:  80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                  \u001b[0m| 8/10 [02:55<00:37, 18.63s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
      "Metrics:  80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                   \u001b[0m| 8/10 [02:55<00:37, 18.63s/it, Batch correction: pcr_comparison]\u001b[0m\u001b[A\n",
      "Embeddings:  33%|\u001b[32m████████████████████████████████████████████████████████████████████▋                                                                                                                                         \u001b[0m| 1/3 [02:57<05:54, 177.02s/it]\u001b[0m\u001b[A\n",
      "Metrics:   0%|\u001b[34m                                                                                                                                                                                                                         \u001b[0m| 0/10 [00:00<?, ?it/s]\u001b[0m\u001b[A\n",
      "                                                                                                                                                                                                                                                              \u001b[A\n",
      "Metrics:   0%|\u001b[34m                                                                                                                                                                                      \u001b[0m| 0/10 [00:00<?, ?it/s, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
      "Metrics:  10%|\u001b[34m█████████████████▍                                                                                                                                                            \u001b[0m| 1/10 [00:17<02:37, 17.45s/it, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
      "Metrics:  10%|\u001b[34m████████████████                                                                                                                                                \u001b[0m| 1/10 [00:17<02:37, 17.45s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
      "Metrics:  20%|\u001b[34m████████████████████████████████                                                                                                                                \u001b[0m| 2/10 [00:24<01:28, 11.09s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
      "Metrics:  20%|\u001b[34m██████████████████████████████████▌                                                                                                                                          \u001b[0m| 2/10 [00:24<01:28, 11.09s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
      "Metrics:  30%|\u001b[34m███████████████████████████████████████████████████▉                                                                                                                         \u001b[0m| 3/10 [00:41<01:37, 13.92s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
      "Metrics:  30%|\u001b[34m██████████████████████████████████████████████████████                                                                                                                              \u001b[0m| 3/10 [00:41<01:37, 13.92s/it, Bio conservation: clisi_knn]\u001b[0m\u001b[A\n",
      "Metrics:  40%|\u001b[34m█████████████████████████████████████████████████████████████████████▏                                                                                                       \u001b[0m| 4/10 [00:41<01:23, 13.92s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
      "Metrics:  50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████▌                                                                                      \u001b[0m| 5/10 [00:45<00:35,  7.16s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
      "Metrics:  50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████                                                                                          \u001b[0m| 5/10 [00:45<00:35,  7.16s/it, Batch correction: ilisi_knn]\u001b[0m\u001b[A\n",
      "Metrics:  60%|\u001b[34m█████████████████████████████████████████████████████████████████████████████████████████████████████████                                                                      \u001b[0m| 6/10 [00:45<00:28,  7.16s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
      "Metrics:  70%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                                    \u001b[0m| 7/10 [01:39<00:47, 15.99s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
      "Metrics:  70%|\u001b[34m███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                                   \u001b[0m| 7/10 [01:39<00:47, 15.99s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
      "Metrics:  80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                  \u001b[0m| 8/10 [01:39<00:24, 12.20s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
      "Metrics:  80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                   \u001b[0m| 8/10 [01:39<00:24, 12.20s/it, Batch correction: pcr_comparison]\u001b[0m\u001b[A\n",
      "Embeddings:  67%|\u001b[32m█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                                                                    \u001b[0m| 2/3 [04:36<02:11, 131.67s/it]\u001b[0m\u001b[A\n",
      "Metrics:   0%|\u001b[34m                                                                                                                                                                                                                         \u001b[0m| 0/10 [00:00<?, ?it/s]\u001b[0m\u001b[A\n",
      "                                                                                                                                                                                                                                                              \u001b[A\n",
      "Metrics:   0%|\u001b[34m                                                                                                                                                                                      \u001b[0m| 0/10 [00:00<?, ?it/s, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
      "Metrics:  10%|\u001b[34m█████████████████▍                                                                                                                                                            \u001b[0m| 1/10 [00:17<02:39, 17.76s/it, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
      "Metrics:  10%|\u001b[34m████████████████                                                                                                                                                \u001b[0m| 1/10 [00:17<02:39, 17.76s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
      "Metrics:  20%|\u001b[34m████████████████████████████████                                                                                                                                \u001b[0m| 2/10 [00:24<01:29, 11.22s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
      "Metrics:  20%|\u001b[34m██████████████████████████████████▌                                                                                                                                          \u001b[0m| 2/10 [00:24<01:29, 11.22s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
      "Metrics:  30%|\u001b[34m███████████████████████████████████████████████████▉                                                                                                                         \u001b[0m| 3/10 [00:41<01:38, 14.11s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
      "Metrics:  30%|\u001b[34m██████████████████████████████████████████████████████                                                                                                                              \u001b[0m| 3/10 [00:41<01:38, 14.11s/it, Bio conservation: clisi_knn]\u001b[0m\u001b[A\n",
      "Metrics:  40%|\u001b[34m████████████████████████████████████████████████████████████████████████                                                                                                            \u001b[0m| 4/10 [00:42<00:51,  8.58s/it, Bio conservation: clisi_knn]\u001b[0m\u001b[A\n",
      "Metrics:  40%|\u001b[34m█████████████████████████████████████████████████████████████████████▏                                                                                                       \u001b[0m| 4/10 [00:42<00:51,  8.58s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
      "Metrics:  50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████▌                                                                                      \u001b[0m| 5/10 [00:43<00:29,  5.98s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
      "Metrics:  50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████                                                                                          \u001b[0m| 5/10 [00:43<00:29,  5.98s/it, Batch correction: ilisi_knn]\u001b[0m\u001b[A\n",
      "Metrics:  60%|\u001b[34m█████████████████████████████████████████████████████████████████████████████████████████████████████████                                                                      \u001b[0m| 6/10 [00:43<00:23,  5.98s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
      "Metrics:  70%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                                    \u001b[0m| 7/10 [01:37<00:50, 16.78s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
      "Metrics:  70%|\u001b[34m███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                                   \u001b[0m| 7/10 [01:37<00:50, 16.78s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
      "Metrics:  80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                  \u001b[0m| 8/10 [01:38<00:24, 12.40s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
      "Metrics:  80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                   \u001b[0m| 8/10 [01:38<00:24, 12.40s/it, Batch correction: pcr_comparison]\u001b[0m\u001b[A\n",
      "Embeddings: 100%|\u001b[32m██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\u001b[0m| 3/3 [06:15<00:00, 125.25s/it]\u001b[0m\u001b[A\n",
      "\n",
      "                                                                                                                                                                                                                                                              \u001b[A"
     ]
    },
    {
     "data": {
      "text/plain": [
       "AnnData object with n_obs × n_vars = 100000 × 18285\n",
       "    obs: 'n_counts', 'n_genes', 'region', 'age', 'experiment', 'species', 'sex', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'leiden', 'cell_type', 'sex_old', 'abca_class', 'abca_subclass', 'abca_supertype', 'abca_cluster', 'abca_region', 'leiden_old', 'region_dissected', 'biosample_id', 'donor_id', 'species__ontology_label', 'disease', 'disease__ontology_label', 'organ', 'organ__ontology_label', 'library_preparation_protocol', 'library_preparation_protocol__ontology_label', 'cell_type_author', 'cell_type__ontology_label', 'supercluster'\n",
       "    var: 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm', 'highly_variable_nbatches', 'feature_name'\n",
       "    uns: '10x_batch_colors', '_scvi_manager_uuid', '_scvi_uuid', 'age_colors', 'ages_ordered_colors', 'dendrogram_leiden', 'hvg', 'leiden', 'log1p', 'neighbors', 'pca', 'rank_genes_groups', 'region_colors', 'region_dissected_colors', 'regions_ordered_colors', 'replicate_colors', 'sex_colors', 'umap'\n",
       "    obsm: 'X_geneformer', 'X_pca', 'X_scGPT', 'X_scVI', 'X_uce', 'X_umap', 'latent_gene_encoding'\n",
       "    varm: 'PCs'\n",
       "    layers: 'counts'\n",
       "    obsp: 'connectivities', 'distances'"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Running using CT key: supercluster\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "Computing neighbors:   0%|                                                                                                                                                                                                              | 0/3 [00:00<?, ?it/s]\u001b[A\n",
      "Computing neighbors:  33%|██████████████████████████████████████████████████████████████████                                                                                                                                    | 1/3 [00:02<00:04,  2.06s/it]\u001b[A\n",
      "Computing neighbors:  67%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                                                  | 2/3 [00:03<00:01,  1.54s/it]\u001b[A\n",
      "Computing neighbors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:04<00:00,  1.48s/it]\u001b[A\n",
      "Embeddings:   0%|\u001b[32m                                                                                                                                                                                                                       \u001b[0m| 0/3 [00:00<?, ?it/s]\u001b[0m\n",
      "Metrics:   0%|\u001b[34m                                                                                                                                                                                                                         \u001b[0m| 0/10 [00:00<?, ?it/s]\u001b[0m\u001b[A\n",
      "Metrics:   0%|\u001b[34m                                                                                                                                                                                      \u001b[0m| 0/10 [00:00<?, ?it/s, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
      "Metrics:  10%|\u001b[34m█████████████████▍                                                                                                                                                            \u001b[0m| 1/10 [00:40<06:01, 40.19s/it, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
      "Metrics:  10%|\u001b[34m████████████████                                                                                                                                                \u001b[0m| 1/10 [00:40<06:01, 40.19s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
      "Metrics:  20%|\u001b[34m████████████████████████████████                                                                                                                                \u001b[0m| 2/10 [00:50<03:02, 22.82s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
      "Metrics:  20%|\u001b[34m██████████████████████████████████▌                                                                                                                                          \u001b[0m| 2/10 [00:50<03:02, 22.82s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
      "Metrics:  30%|\u001b[34m███████████████████████████████████████████████████▉                                                                                                                         \u001b[0m| 3/10 [01:30<03:33, 30.47s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
      "Metrics:  30%|\u001b[34m██████████████████████████████████████████████████████                                                                                                                              \u001b[0m| 3/10 [01:30<03:33, 30.47s/it, Bio conservation: clisi_knn]\u001b[0m\u001b[A\n",
      "Metrics:  40%|\u001b[34m█████████████████████████████████████████████████████████████████████▏                                                                                                       \u001b[0m| 4/10 [01:30<03:02, 30.47s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
      "Metrics:  50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████▌                                                                                      \u001b[0m| 5/10 [01:55<01:41, 20.24s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
      "Metrics:  50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████                                                                                          \u001b[0m| 5/10 [01:55<01:41, 20.24s/it, Batch correction: ilisi_knn]\u001b[0m\u001b[A\n",
      "Metrics:  60%|\u001b[34m█████████████████████████████████████████████████████████████████████████████████████████████████████████                                                                      \u001b[0m| 6/10 [01:55<01:20, 20.24s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
      "Metrics:  70%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                                    \u001b[0m| 7/10 [02:54<01:13, 24.39s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
      "Metrics:  70%|\u001b[34m███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                                   \u001b[0m| 7/10 [02:54<01:13, 24.39s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
      "Metrics:  80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                  \u001b[0m| 8/10 [02:54<00:37, 18.54s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
      "Metrics:  80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                   \u001b[0m| 8/10 [02:54<00:37, 18.54s/it, Batch correction: pcr_comparison]\u001b[0m\u001b[A\n",
      "Embeddings:  33%|\u001b[32m████████████████████████████████████████████████████████████████████▋                                                                                                                                         \u001b[0m| 1/3 [02:55<05:51, 175.81s/it]\u001b[0m\u001b[A\n",
      "Metrics:   0%|\u001b[34m                                                                                                                                                                                                                         \u001b[0m| 0/10 [00:00<?, ?it/s]\u001b[0m\u001b[A\n",
      "                                                                                                                                                                                                                                                              \u001b[A\n",
      "Metrics:   0%|\u001b[34m                                                                                                                                                                                      \u001b[0m| 0/10 [00:00<?, ?it/s, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
      "Metrics:  10%|\u001b[34m█████████████████▍                                                                                                                                                            \u001b[0m| 1/10 [00:17<02:37, 17.52s/it, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
      "Metrics:  10%|\u001b[34m████████████████                                                                                                                                                \u001b[0m| 1/10 [00:17<02:37, 17.52s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
      "Metrics:  20%|\u001b[34m████████████████████████████████                                                                                                                                \u001b[0m| 2/10 [00:23<01:27, 10.89s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
      "Metrics:  20%|\u001b[34m██████████████████████████████████▌                                                                                                                                          \u001b[0m| 2/10 [00:23<01:27, 10.89s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
      "Metrics:  30%|\u001b[34m███████████████████████████████████████████████████▉                                                                                                                         \u001b[0m| 3/10 [00:41<01:37, 13.93s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
      "Metrics:  30%|\u001b[34m██████████████████████████████████████████████████████                                                                                                                              \u001b[0m| 3/10 [00:41<01:37, 13.93s/it, Bio conservation: clisi_knn]\u001b[0m\u001b[A\n",
      "Metrics:  40%|\u001b[34m█████████████████████████████████████████████████████████████████████▏                                                                                                       \u001b[0m| 4/10 [00:41<01:23, 13.93s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
      "Metrics:  50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████▌                                                                                      \u001b[0m| 5/10 [00:46<00:36,  7.38s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
      "Metrics:  50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████                                                                                          \u001b[0m| 5/10 [00:46<00:36,  7.38s/it, Batch correction: ilisi_knn]\u001b[0m\u001b[A\n",
      "Metrics:  60%|\u001b[34m█████████████████████████████████████████████████████████████████████████████████████████████████████████                                                                      \u001b[0m| 6/10 [00:46<00:29,  7.38s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
      "Metrics:  70%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                                    \u001b[0m| 7/10 [01:42<00:50, 16.73s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
      "Metrics:  70%|\u001b[34m███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                                   \u001b[0m| 7/10 [01:42<00:50, 16.73s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
      "Metrics:  80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                  \u001b[0m| 8/10 [01:42<00:25, 12.75s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
      "Metrics:  80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                   \u001b[0m| 8/10 [01:42<00:25, 12.75s/it, Batch correction: pcr_comparison]\u001b[0m\u001b[A\n",
      "Embeddings:  67%|\u001b[32m█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                                                                    \u001b[0m| 2/3 [04:39<02:13, 133.22s/it]\u001b[0m\u001b[A\n",
      "Metrics:   0%|\u001b[34m                                                                                                                                                                                                                         \u001b[0m| 0/10 [00:00<?, ?it/s]\u001b[0m\u001b[A\n",
      "                                                                                                                                                                                                                                                              \u001b[A\n",
      "Metrics:   0%|\u001b[34m                                                                                                                                                                                      \u001b[0m| 0/10 [00:00<?, ?it/s, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
      "Metrics:  10%|\u001b[34m█████████████████▍                                                                                                                                                            \u001b[0m| 1/10 [00:17<02:38, 17.56s/it, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
      "Metrics:  10%|\u001b[34m████████████████                                                                                                                                                \u001b[0m| 1/10 [00:17<02:38, 17.56s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
      "Metrics:  20%|\u001b[34m████████████████████████████████                                                                                                                                \u001b[0m| 2/10 [00:24<01:31, 11.40s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
      "Metrics:  20%|\u001b[34m██████████████████████████████████▌                                                                                                                                          \u001b[0m| 2/10 [00:24<01:31, 11.40s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
      "Metrics:  30%|\u001b[34m███████████████████████████████████████████████████▉                                                                                                                         \u001b[0m| 3/10 [00:42<01:39, 14.16s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
      "Metrics:  30%|\u001b[34m██████████████████████████████████████████████████████                                                                                                                              \u001b[0m| 3/10 [00:42<01:39, 14.16s/it, Bio conservation: clisi_knn]\u001b[0m\u001b[A\n",
      "Metrics:  40%|\u001b[3

Download .txt

gitextract_vi_txbci/

├── LICENSE
├── README.md
├── data_proc/
│   ├── Create New Species Files.ipynb
│   ├── data_utils.py
│   ├── download_proc_czi_cxg.py
│   ├── gene_embeddings.py
│   ├── generate_reduced_chrom_files.py
│   └── preproc_many_dataset.py
├── eval_data.py
├── eval_single_anndata.py
├── evaluate.py
├── examples/
│   ├── Benchmark Embeddings with scIB.ipynb
│   └── Label Transfer Using Logistic Classifier.ipynb
├── model.py
├── model_files/
│   └── new_species_protein_embeddings.csv
├── requirements.txt
└── utils.py

Download .txt

SYMBOL INDEX (56 symbols across 10 files)

FILE: data_proc/data_utils.py
  function data_to_torch_X (line 35) | def data_to_torch_X(X):
  class SincleCellDataset (line 42) | class SincleCellDataset(data.Dataset):
    method __init__ (line 43) | def __init__(self,
    method __getitem__ (line 80) | def __getitem__(self, idx):
    method __len__ (line 99) | def __len__(self) -> int:
    method get_dim (line 102) | def get_dim(self) -> Dict[str, int]:
  function data_to_torch_X (line 106) | def data_to_torch_X(X):
  function anndata_to_sc_dataset (line 114) | def anndata_to_sc_dataset(adata:sc.AnnData,
  function adata_path_to_prot_chrom_starts (line 155) | def adata_path_to_prot_chrom_starts(adata, dataset_species, spec_pe_gene...
  function process_raw_anndata (line 173) | def process_raw_anndata(row, h5_folder_path, npz_folder_path, scp, skip,
  function get_species_to_pe (line 241) | def get_species_to_pe(EMBEDDING_DIR):
  function get_spec_chrom_csv (line 271) | def get_spec_chrom_csv(path="/dfs/project/cross-species/yanay/code/all_t...

FILE: data_proc/download_proc_czi_cxg.py
  function data_to_torch_X (line 29) | def data_to_torch_X(X):
  function istarmap (line 47) | def istarmap(self, func, iterable, chunksize=1):
  function process_row (line 90) | def process_row(row, num_genes, num_cells, paths, all_species, covar_col...

FILE: data_proc/gene_embeddings.py
  function load_gene_embeddings_adata (line 30) | def load_gene_embeddings_adata(adata: AnnData, species: list, embedding_...

FILE: data_proc/generate_reduced_chrom_files.py
  function padding_tensor (line 53) | def padding_tensor(sequences):

FILE: data_proc/preproc_many_dataset.py
  function data_to_torch_X (line 31) | def data_to_torch_X(X):
  class SincleCellDataset (line 38) | class SincleCellDataset(data.Dataset):
    method __init__ (line 39) | def __init__(self,
    method __getitem__ (line 76) | def __getitem__(self, idx):
    method __len__ (line 95) | def __len__(self) -> int:
    method get_dim (line 98) | def get_dim(self) -> Dict[str, int]:
  function data_to_torch_X (line 102) | def data_to_torch_X(X):
  function anndata_to_sc_dataset (line 110) | def anndata_to_sc_dataset(adata:sc.AnnData,
  function proc (line 151) | def proc(args):

FILE: eval_data.py
  class MultiDatasetSentences (line 17) | class MultiDatasetSentences(data.Dataset):
    method __init__ (line 18) | def __init__(self, sorted_dataset_names, shapes_dict, args,
    method __getitem__ (line 50) | def __getitem__(self, idx):
    method __len__ (line 73) | def __len__(self) -> int:
    method get_dim (line 76) | def get_dim(self) -> Dict[str, int]:
  class MultiDatasetSentenceCollator (line 80) | class MultiDatasetSentenceCollator(object):
    method __init__ (line 81) | def __init__(self, args):
    method __call__ (line 85) | def __call__(self, batch):
  function sample_cell_sentences (line 108) | def sample_cell_sentences(counts, batch_weights, dataset, args,

FILE: eval_single_anndata.py
  function main (line 81) | def main(args, accelerator):

FILE: evaluate.py
  class AnndataProcessor (line 33) | class AnndataProcessor:
    method __init__ (line 34) | def __init__(self, args, accelerator):
    method check_paths (line 64) | def check_paths(self):
    method preprocess_anndata (line 91) | def preprocess_anndata(self):
    method save_shapes_dict (line 108) | def save_shapes_dict(self, name, num_cells, num_genes, shapes_dict_path):
    method generate_idxs (line 114) | def generate_idxs(self):
    method run_evaluation (line 141) | def run_evaluation(self):
  function get_ESM2_embeddings (line 149) | def get_ESM2_embeddings(args):
  function padding_tensor (line 163) | def padding_tensor(sequences):
  function run_eval (line 183) | def run_eval(adata, name, pe_idx_path, chroms_path, starts_path, shapes_...

FILE: model.py
  function full_block (line 18) | def full_block(in_features, out_features, p_drop=0.1):
  class PositionalEncoding (line 27) | class PositionalEncoding(nn.Module):
    method __init__ (line 29) | def __init__(self, d_model: int, dropout: float = 0.1, max_len: int = ...
    method forward (line 41) | def forward(self, x: Tensor) -> Tensor:
  class TransformerModel (line 50) | class TransformerModel(nn.Module):
    method __init__ (line 52) | def __init__(self, token_dim: int, d_model: int, nhead: int, d_hid: int,
    method forward (line 92) | def forward(self, src: Tensor, mask: Tensor):
    method predict (line 110) | def predict(self, cell_embedding, gene_embeddings):

FILE: utils.py
  function get_shapes_dict (line 16) | def get_shapes_dict(dataset_path):
  function figshare_download (line 72) | def figshare_download(url, save_path):

Download .json

Condensed preview — 17 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (482K chars).

[
  {
    "path": "LICENSE",
    "chars": 1098,
    "preview": "MIT License\n\nCopyright (c) 2023 Yanay Rosen, Yusuf Roohani, Jure Leskovec\n\nPermission is hereby granted, free of charge,"
  },
  {
    "path": "README.md",
    "chars": 3120,
    "preview": "# Universal Cell Embeddings\n\nThis repo includes a PyTorch [HuggingFace Accelerator](https://huggingface.co/docs/accelera"
  },
  {
    "path": "data_proc/Create New Species Files.ipynb",
    "chars": 21966,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"0e4018ee\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Embedding No"
  },
  {
    "path": "data_proc/data_utils.py",
    "chars": 10027,
    "preview": "import warnings\nwarnings.filterwarnings(\"ignore\")\n\nimport scanpy as sc\nimport torch\n\nfrom torch import nn, Tensor\nimport"
  },
  {
    "path": "data_proc/download_proc_czi_cxg.py",
    "chars": 11935,
    "preview": "import os\nos.environ[\"OMP_NUM_THREADS\"] = \"20\" # export OMP_NUM_THREADS=4\nos.environ[\"OPENBLAS_NUM_THREADS\"] = \"20\" # ex"
  },
  {
    "path": "data_proc/gene_embeddings.py",
    "chars": 3803,
    "preview": "\"\"\"Helper functions for loading pretrained gene embeddings.\"\"\"\nfrom pathlib import Path\nfrom typing import Dict, Tuple\n\n"
  },
  {
    "path": "data_proc/generate_reduced_chrom_files.py",
    "chars": 10521,
    "preview": "import os\nos.environ[\"OMP_NUM_THREADS\"] = \"4\" # export OMP_NUM_THREADS=4\nos.environ[\"OPENBLAS_NUM_THREADS\"] = \"4\" # expo"
  },
  {
    "path": "data_proc/preproc_many_dataset.py",
    "chars": 7689,
    "preview": "import os\nos.environ[\"OMP_NUM_THREADS\"] = \"10\" # export OMP_NUM_THREADS=4\nos.environ[\"OPENBLAS_NUM_THREADS\"] = \"10\" # ex"
  },
  {
    "path": "eval_data.py",
    "chars": 7337,
    "preview": "\"\"\"\nDataloaders\n\n\"\"\"\n\nimport warnings\nwarnings.filterwarnings(\"ignore\")\nimport sys\nsys.path.append('../')\nfrom typing im"
  },
  {
    "path": "eval_single_anndata.py",
    "chars": 6597,
    "preview": "\"\"\"\nScript for Evaluating a Single AnnData\n\nParameters:\n----------\n- `adata_path` (str):\n    Full path to the AnnData yo"
  },
  {
    "path": "evaluate.py",
    "chars": 11046,
    "preview": "import os\n\n# os.environ[\"NCCL_DEBUG\"] = \"INFO\"\nos.environ[\"OMP_NUM_THREADS\"] = \"12\"  # export OMP_NUM_THREADS=4\nos.envir"
  },
  {
    "path": "examples/Benchmark Embeddings with scIB.ipynb",
    "chars": 215162,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"6b258384-9a56-4ed0-be6f-db1c94711356\",\n   \"metadata\": {},\n   \"so"
  },
  {
    "path": "examples/Label Transfer Using Logistic Classifier.ipynb",
    "chars": 65244,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"3f4f1b19-5369-4e4d-9366-b6f07f88b402\",\n   \"metadata\": {},\n   \"so"
  },
  {
    "path": "model.py",
    "chars": 3940,
    "preview": "\"\"\"\nModel class\n\n\"\"\"\n\nimport warnings\nwarnings.filterwarnings(\"ignore\")\nimport math\nfrom torch import nn, Tensor\nfrom to"
  },
  {
    "path": "model_files/new_species_protein_embeddings.csv",
    "chars": 13,
    "preview": "species,path\n"
  },
  {
    "path": "requirements.txt",
    "chars": 135,
    "preview": "numpy==1.26.4\nscipy==1.14.1\npandas==2.2.2\ntqdm==4.66.5\ntorch==2.1.1\nscanpy==1.10.2\naccelerate==0.24.0\nrequests==2.25.1\nu"
  },
  {
    "path": "utils.py",
    "chars": 3487,
    "preview": "\"\"\"\nUtils\n\n\"\"\"\n\nimport warnings\nwarnings.filterwarnings(\"ignore\")\nimport pandas as pd\nimport numpy as np\nimport os\nimpor"
  }
]

About this extraction

This page contains the full source code of the snap-stanford/UCE GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 17 files (374.1 KB), approximately 125.0k tokens, and a symbol index with 56 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo