Showing preview only (470K chars total). Download the full file or copy to clipboard to get everything.
Repository: snap-stanford/UCE
Branch: main
Commit: 8ead6e07af0c
Files: 17
Total size: 374.1 KB
Directory structure:
gitextract_vi_txbci/
├── LICENSE
├── README.md
├── data_proc/
│ ├── Create New Species Files.ipynb
│ ├── data_utils.py
│ ├── download_proc_czi_cxg.py
│ ├── gene_embeddings.py
│ ├── generate_reduced_chrom_files.py
│ └── preproc_many_dataset.py
├── eval_data.py
├── eval_single_anndata.py
├── evaluate.py
├── examples/
│ ├── Benchmark Embeddings with scIB.ipynb
│ └── Label Transfer Using Logistic Classifier.ipynb
├── model.py
├── model_files/
│ └── new_species_protein_embeddings.csv
├── requirements.txt
└── utils.py
================================================
FILE CONTENTS
================================================
================================================
FILE: LICENSE
================================================
MIT License
Copyright (c) 2023 Yanay Rosen, Yusuf Roohani, Jure Leskovec
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
================================================
FILE: README.md
================================================
# Universal Cell Embeddings
This repo includes a PyTorch [HuggingFace Accelerator](https://huggingface.co/docs/accelerate/package_reference/accelerator) implementation of the UCE model, to be used to embed individual anndata datasets.
## Installation
```
pip install -r requirements.txt
```
## Embedding a new dataset
To generate an embedding for a new single-cell RNA sequencing dataset in the AnnData format, use the `eval_single_anndata.py` script.
```
python eval_single_anndata.py --adata_path {path_to_anndata} --dir {output_dir} --species {species} --model_loc {model_loc} --batch_size {batch_size}
```
where
- `adata_path`: a h5ad file. The `.X` slot of the file should be scRNA-seq counts. The `.var_names` slot should correspond to gene names, *not ENSEMBLIDs*.
- `dir`: the working directory in which intermediate and final output files will be saved to skip repeated processing of the same dataset.
- `species`: the species of the dataset you are embedding.
- `model_loc`: the location of the model weights `.torch` file.
- `batch_size`: the per GPU batch size. For the 33 layer model, on a 80GB GPU, you should use 25. For a 4 layer model on the same GPU, you can use 100.
For a sample output on the 10k pbmc dataset, run
```
python eval_single_anndata.py
```
All necessary model files will be downloaded automatically.
**Note**: This script makes use of additional files, which are described in the code documentation. These are downloaded automatically unless already present in the working directory. The script defaults to the pretrained 4-layer model. For running the pretrained 33-layer model from the paper, please download using this [link](https://figshare.com/articles/dataset/Universal_Cell_Embedding_Model_Files/24320806?file=43423236) and set `--nlayers 33`.
## Output
Final evaluated AnnData: `dir/{dataset_name}.h5ad`. This AnnData will be
identical to the proccessed input anndata, but have UCE embeddings added in the `.obsm["X_uce"]` slot.
Please see documentation for information on additional output files. All
outputs from `eval_single_anndata.py` are stored in the `dir` directory.
## Data
You can download processed datasets used in the papere [here](https://drive.google.com/drive/folders/1f63fh0ykgEhCrkd_EVvIootBw7LYDVI7?usp=drive_link)
**Note:** These datasets were embedded using the 33 layer model. Embeddings for the 33 layer model are not compatible with embeddings from the 4 layer model.
## Citing
If you find our paper and code useful, please consider citing the [preprint](https://www.biorxiv.org/content/10.1101/2023.11.28.568918v1):
```
@article{rosen2023universal,
title={Universal Cell Embeddings: A Foundation Model for Cell Biology},
author={Rosen, Yanay and Roohani, Yusuf and Agrawal, Ayush and Samotorcan, Leon and Consortium, Tabula Sapiens and Quake, Stephen R and Leskovec, Jure},
journal={bioRxiv},
pages={2023--11},
year={2023},
publisher={Cold Spring Harbor Laboratory}
}
```
## Analyses
Please see the [reproduce repo](https://github.com/yhr91/uce_reproduce/tree/master) for analyses figures and datasets from the paper.
================================================
FILE: data_proc/Create New Species Files.ipynb
================================================
{
"cells": [
{
"cell_type": "markdown",
"id": "0e4018ee",
"metadata": {},
"source": [
"# Embedding Novel Species\n",
"\n",
"This notebook will create the files you need to embed a novel species that wasn't included in the training data.\n",
"\n",
"To start, you will need to download the ESM2 protein embeddings and the reference proteome for the species.\n",
"\n",
"You can find precalculated ESM2 protein embeddings for many species [here](https://drive.google.com/drive/folders/1_Dz7HS5N3GoOAG6MdhsXWY1nwLoN13DJ?usp=drive_link)\n",
"\n",
"For reference proteomes, you can download them from [here](https://useast.ensembl.org/info/about/species.html).\n",
"\n",
"If there is no protein embedding for the species you are interested in, you can request to have it made via Github or email, or you can create it yourself following instructions [here](https://github.com/snap-stanford/SATURN/tree/main/protein_embeddings)."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "ab368d92",
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pickle as pkl\n",
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "c9a306f3",
"metadata": {},
"outputs": [],
"source": [
"SPECIES_NAME = \"chicken\" # short hand name for this species, will be used in arguments and files\n",
"\n",
"# Path to the species proteome\n",
"SPECIES_PROTEIN_FASTA_PATH = \"../../../SATURN/protein_embeddings/data/Gallus_gallus.bGalGal1.mat.broiler.GRCg7b.pep.all.fa\"\n",
"\n",
"# Path to the ESM2 Embeddings\n",
"SPECIES_PROTEIN_EMBEDDINGS_PATH = \"../model_files/protein_embeddings/Gallus_gallus.bGalGal1.mat.broiler.GRCg7b.pep.all.gene_symbol_to_embedding_ESM2.pt\"\n",
"\n",
"# primary_assembly name, this needs to be matched to the FASTA file\n",
"ASSEMBLY_NAME = \"bGalGal1.mat.broiler.GRCg7b\"\n",
"# NCBI Taxonomy ID, please set this so that if someone else also embeds the same species,\n",
"# randomly generated chromosome tokens will be the same\n",
"TAXONOMY_ID = 9031"
]
},
{
"cell_type": "markdown",
"id": "e5d37e52",
"metadata": {},
"source": [
"You can view the FASTA format here, please confirm the primary_assembly name is correct."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "2ecf1464",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
">ENSGALP00010000002.1 pep primary_assembly:bGalGal1.mat.broiler.GRCg7b:MT:2824:3798:1 gene:ENSGALG00010000007.1 transcript:ENSGALT00010000007.1 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:ND1 description:NADH dehydrogenase subunit 1 [Source:NCBI gene (formerly Entrezgene);Acc:63549479]\r\n",
"MTLPTLTNLLIMTLSYILPILIAVAFLTLVERKILSYMQARKGPNIVGPFGLLQPVADGV\r\n",
"KLFIKEPIRPSTSSPFLFIITPILALLLALTIWVPLPLPFPLADLNLGLLFLLAMSSLTV\r\n",
"YSLLWSGWASNSKYALIGALRAVAQTISYEVTLAIILLSTIMLSGNYTLSTLAITQEPIY\r\n",
"LIFSAWPLAMMWYISTLAETNRAPFDLTEGESELVSGFNVEYAAGPFAMFFLAEYANIML\r\n",
"MNTLTTVLFLNPSFLNLPPELFPIALATKTLLLSSSFLWIRASYPRFRYDQLMHLLWKNF\r\n",
"LPLTLALCLWHTSMPISYAGLPPI\r\n",
">ENSGALP00010000003.1 pep primary_assembly:bGalGal1.mat.broiler.GRCg7b:MT:4015:5053:1 gene:ENSGALG00010000011.1 transcript:ENSGALT00010000011.1 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:ND2 description:NADH dehydrogenase subunit 2 [Source:NCBI gene (formerly Entrezgene);Acc:63549482]\r\n",
"MNPHAKLICTVSLIMGTSITISSNHWILAWTGLEINTLAIIPLISKSHHPRAIEATIKYF\r\n",
"LTQSTASALILFSSMTNAWSTGQWDITQLNHPTSCLMLTMAIAIKLGLVPFHFWFPEVLQ\r\n"
]
}
],
"source": [
"!head {SPECIES_PROTEIN_FASTA_PATH}"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "90540d0b",
"metadata": {},
"outputs": [],
"source": [
"species_to_paths = {\n",
" SPECIES_NAME: SPECIES_PROTEIN_FASTA_PATH,\n",
"}\n",
"\n",
"species_to_ids = {\n",
" SPECIES_NAME: ASSEMBLY_NAME,\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "623b99cf",
"metadata": {},
"outputs": [],
"source": [
"all_pos_def = []\n",
"\n",
"missing_genes = {}\n",
"for species in species_to_ids.keys():\n",
" missing_genes[species] = []\n",
" proteome_path = species_to_paths[species]\n",
" species_id = species_to_ids[species]\n",
"\n",
" with open(proteome_path) as f:\n",
" proteome_lines = f.readlines()\n",
"\n",
" gene_symbol_to_location = {}\n",
" gene_symbol_to_chrom = {}\n",
"\n",
" for line in proteome_lines:\n",
" if line.startswith(\">\"):\n",
" split_line = line.split()\n",
" gene_symbol = [token for token in split_line if token.startswith(\"gene_symbol\")]\n",
" if len(gene_symbol) > 0:\n",
" gene_symbol = gene_symbol[0].split(\":\")\n",
" \n",
" if len(gene_symbol) == 2:\n",
" gene_symbol = gene_symbol[1]\n",
" elif len(gene_symbol) > 2:\n",
" gene_symbol = \":\".join(gene_symbol[1:]) # fix for annoying zebrafish gene names with colons in them\n",
" else:\n",
" 1/0 # something weird happening, throw an error\n",
" \n",
" \n",
" chrom = None\n",
" \n",
" chrom_arr = [token for token in split_line if token.startswith(\"chromosome:\")]\n",
" if len(chrom_arr) > 0:\n",
" chrom = chrom_arr[0].replace(\"chromosome:\", \"\")\n",
" else:\n",
" chrom_arr = [token for token in split_line if token.startswith(\"primary_assembly:\")]\n",
" if len(chrom_arr) > 0:\n",
" chrom = chrom_arr[0].replace(\"primary_assembly:\", \"\")\n",
" else:\n",
" chrom_arr = [token for token in split_line if token.startswith(\"scaffold:\")] \n",
" if len(chrom_arr) > 0:\n",
" chrom = chrom_arr[0].replace(\"scaffold:\", \"\")\n",
" if chrom is not None:\n",
" gene_symbol_to_location[gene_symbol] = chrom.split(\":\")[2]\n",
" gene_symbol_to_chrom[gene_symbol] = chrom.split(\":\")[1]\n",
" else:\n",
" missing_genes[species].append(gene_symbol)\n",
" \n",
"\n",
" positional_df = pd.DataFrame()\n",
" positional_df[\"gene_symbol\"] = [gn.upper() for gn in list(gene_symbol_to_chrom.keys())]\n",
" positional_df[\"chromosome\"] = list(gene_symbol_to_chrom.values())\n",
" positional_df[\"start\"] = list(gene_symbol_to_location.values())\n",
" positional_df = positional_df.sort_values([\"chromosome\", \"start\"])\n",
" #positional_df = positional_df.set_index(\"gene_symbol\")\n",
" positional_df[\"species\"] = species\n",
" all_pos_def.append(positional_df)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "b72887b3",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>gene_symbol</th>\n",
" <th>chromosome</th>\n",
" <th>start</th>\n",
" <th>species</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>2327</th>\n",
" <td>GCC1</td>\n",
" <td>1</td>\n",
" <td>1006145</td>\n",
" <td>chicken</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2502</th>\n",
" <td>NCAM2</td>\n",
" <td>1</td>\n",
" <td>100828671</td>\n",
" <td>chicken</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3084</th>\n",
" <td>ENS-2</td>\n",
" <td>1</td>\n",
" <td>101147482</td>\n",
" <td>chicken</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2331</th>\n",
" <td>DENND6B</td>\n",
" <td>1</td>\n",
" <td>1012031</td>\n",
" <td>chicken</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3973</th>\n",
" <td>MRPL39</td>\n",
" <td>1</td>\n",
" <td>102578362</td>\n",
" <td>chicken</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4722</th>\n",
" <td>CA9</td>\n",
" <td>Z</td>\n",
" <td>9779343</td>\n",
" <td>chicken</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4738</th>\n",
" <td>ARHGEF39</td>\n",
" <td>Z</td>\n",
" <td>9835547</td>\n",
" <td>chicken</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3885</th>\n",
" <td>MRPL17</td>\n",
" <td>Z</td>\n",
" <td>9850679</td>\n",
" <td>chicken</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4172</th>\n",
" <td>CCBE1</td>\n",
" <td>Z</td>\n",
" <td>9852827</td>\n",
" <td>chicken</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3293</th>\n",
" <td>PMAIP1</td>\n",
" <td>Z</td>\n",
" <td>9998272</td>\n",
" <td>chicken</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>13271 rows × 4 columns</p>\n",
"</div>"
],
"text/plain": [
" gene_symbol chromosome start species\n",
"2327 GCC1 1 1006145 chicken\n",
"2502 NCAM2 1 100828671 chicken\n",
"3084 ENS-2 1 101147482 chicken\n",
"2331 DENND6B 1 1012031 chicken\n",
"3973 MRPL39 1 102578362 chicken\n",
"... ... ... ... ...\n",
"4722 CA9 Z 9779343 chicken\n",
"4738 ARHGEF39 Z 9835547 chicken\n",
"3885 MRPL17 Z 9850679 chicken\n",
"4172 CCBE1 Z 9852827 chicken\n",
"3293 PMAIP1 Z 9998272 chicken\n",
"\n",
"[13271 rows x 4 columns]"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"master_pos_def = pd.concat(all_pos_def)\n",
"master_pos_def"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "6d9dac28",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"chicken 13271\n",
"Name: species, dtype: int64"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"master_pos_def[\"species\"].value_counts() # double check how many genes are mapped"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "4a3d45c2",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"chicken: 0\n"
]
}
],
"source": [
"for k, v in missing_genes.items():\n",
" print(f\"{k}: {len(v)}\") # are any genes missing?"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "c59774b1",
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"*********\n",
"chicken\n"
]
},
{
"data": {
"text/plain": [
"1 1785\n",
"2 1169\n",
"3 1067\n",
"4 953\n",
"5 817\n",
"Z 629\n",
"6 458\n",
"8 450\n",
"7 442\n",
"9 382\n",
"10 366\n",
"14 359\n",
"11 327\n",
"15 326\n",
"13 306\n",
"20 298\n",
"12 293\n",
"19 278\n",
"18 274\n",
"17 260\n",
"26 237\n",
"28 237\n",
"27 235\n",
"21 226\n",
"23 214\n",
"25 176\n",
"34 155\n",
"24 149\n",
"22 142\n",
"16 54\n",
"30 52\n",
"38 49\n",
"31 14\n",
"MT 13\n",
"39 10\n",
"JAENSK010000484.1 7\n",
"35 6\n",
"JAENSK010000592.1 6\n",
"W 5\n",
"MU179278.1 5\n",
"MU179279.1 4\n",
"36 3\n",
"JAENSK010000483.1 3\n",
"JAENSK010000585.1 3\n",
"JAENSK010000593.1 2\n",
"MU179258.1 2\n",
"MU179272.1 2\n",
"MU179273.1 2\n",
"JAENSK010000584.1 2\n",
"JAENSK010000656.1 1\n",
"Name: chromosome, dtype: int64"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"*********\n"
]
}
],
"source": [
"# Count genes per chromosome\n",
"for species in species_to_ids.keys():\n",
" print(\"*********\")\n",
" print(species)\n",
" display(master_pos_def[master_pos_def[\"species\"] == species][\"chromosome\"].value_counts().head(50))\n",
" print(\"*********\")"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "541baded",
"metadata": {},
"outputs": [],
"source": [
"master_pos_def.to_csv(f\"{SPECIES_NAME}_to_chrom_pos.csv\", index=False) # Save the DF"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "eabd0e31",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"chicken_to_chrom_pos.csv\n"
]
}
],
"source": [
"# The chromosome file path will be:\n",
"print(f\"{SPECIES_NAME}_to_chrom_pos.csv\")"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "fe1345b1",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"66"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"N_UNIQ_CHROM = len(master_pos_def[master_pos_def[\"species\"] == species][\"chromosome\"].unique())\n",
"N_UNIQ_CHROM"
]
},
{
"cell_type": "markdown",
"id": "e37e277f",
"metadata": {},
"source": [
"# Generate token file"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "d6904975",
"metadata": {},
"outputs": [],
"source": [
"import torch\n",
"import pickle\n",
"token_dim = 5120"
]
},
{
"cell_type": "markdown",
"id": "a2798848",
"metadata": {},
"source": [
"This will create the token file. Please note the offset value."
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "4355dabd",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CHROM_TOKEN_OFFSET: 13275\n",
"Saved PE, offsets file\n"
]
}
],
"source": [
"species_to_offsets = {}\n",
"\n",
"all_pe = torch.load(\"../model_files/all_tokens.torch\")[0:4] # read in existing token file to make sure \n",
"# that special vocab tokens are the same for different seeds\n",
"\n",
"offset = len(all_pe) # special tokens at the top!\n",
"\n",
"PE = torch.load(SPECIES_PROTEIN_EMBEDDINGS_PATH)\n",
"\n",
"pe_stacked = torch.stack(list(PE.values()))\n",
"all_pe = torch.vstack((all_pe, pe_stacked))\n",
"species_to_offsets[species] = offset\n",
"\n",
"print(\"CHROM_TOKEN_OFFSET:\", all_pe.shape[0])\n",
"torch.manual_seed(TAXONOMY_ID)\n",
"CHROM_TENSORS = torch.normal(mean=0, std=1, size=(N_UNIQ_CHROM, 5120)) \n",
"# N_UNIQ_CHROM is the total number of chromosome choices, it is hardcoded for now (for species in the training data)\n",
"all_pe = torch.vstack(\n",
" (all_pe, CHROM_TENSORS)) # Add the chrom tensors to the end\n",
"all_pe.requires_grad = False\n",
"\n",
"\n",
"torch.save(all_pe, f\"{SPECIES_NAME}_pe_tokens.torch\")\n",
"\n",
"with open(f\"{SPECIES_NAME}_offsets.pkl\", \"wb+\") as f:\n",
" pickle.dump(species_to_offsets, f)\n",
"print(\"Saved PE, offsets file\")"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "c26fe491",
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"torch.Size([13341, 5120])"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"all_pe.shape"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "21f937ea",
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"torch.Size([13341, 5120])"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"all_pe.shape"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "5faadace",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"chicken_offsets.pkl\n"
]
}
],
"source": [
"print(f\"{SPECIES_NAME}_offsets.pkl\")"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "6ceac20b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'../model_files/protein_embeddings/Gallus_gallus.bGalGal1.mat.broiler.GRCg7b.pep.all.gene_symbol_to_embedding_ESM2.pt'"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"SPECIES_PROTEIN_EMBEDDINGS_PATH"
]
},
{
"cell_type": "markdown",
"id": "e4697330",
"metadata": {},
"source": [
"# Example evaluation of new species"
]
},
{
"cell_type": "markdown",
"id": "2b72667d",
"metadata": {},
"source": [
"**Note: when you evaluate a new species, you need to change some arguments and modify some files:**\n",
"\n",
"You will need to modify the csv in `model_files/new_species_protein_embeddings.csv` to include the new protein embeddings file you downloaded.\n",
"\n",
"In the file add a row for the new species with the format:\n",
"`species name,full path to protein embedding file`\n",
"\n",
"Please also add this line to the dictionary created on line 247 in the file `data_proc/data_utils.py`.\n",
"\n",
"When you want to embed this new species, you will need to specify these newly created files as arguments.\n",
"- `CHROM_TOKEN_OFFSET`: This tells UCE when the rows corresponding to chromosome tokens starts.\n",
"- `spec_chrom_csv_path`: This is a new csv, created by this script, which maps genes to chromosomes and genomic positions\n",
"- `token_file`: This is a new token file that will work just for this species. The embeddings generated will still be universal though!\n",
"- `offset_pkl_path`: This is another file that maps genes to tokens\n",
"\n",
"\n",
"```\n",
"\n",
"accelerate launch eval_single_anndata.py chicken_heart.h5ad --species=chicken --CHROM_TOKEN_OFFSET=13275 --spec_chrom_csv_path=data_proc/chicken_to_chrom_pos.csv --token_file=data_proc/chicken_pe_tokens.torch --offset_pkl_path=data_proc/chicken_offsets.pkl --dir=... --multi_gpu=True\n",
"\n",
"```"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
================================================
FILE: data_proc/data_utils.py
================================================
import warnings
warnings.filterwarnings("ignore")
import scanpy as sc
import torch
from torch import nn, Tensor
import torch.nn.functional as F
import torch.utils.data as data
import torch.optim as optim
import numpy as np
import pickle
import os
import argparse
import logging
import time
from tqdm.auto import tqdm
import pandas as pd
import math
import anndata
from pathlib import Path
from torch.utils.data import dataset
from torch.utils.data import DataLoader, TensorDataset, dataset
from scipy.stats import binom
from typing import Dict, List, Optional, Tuple
from scanpy import AnnData
from data_proc.gene_embeddings import load_gene_embeddings_adata
def data_to_torch_X(X):
if isinstance(X, sc.AnnData):
X = X.X
if not isinstance(X, np.ndarray):
X = X.toarray()
return torch.from_numpy(X).float()
class SincleCellDataset(data.Dataset):
def __init__(self,
expression: torch.tensor, # Subset to hv genes, count data! cells x genes
protein_embeddings: torch.tensor, # same order as expression, also subset genes x pe
labels: None, # optional, tensor of labels
covar_vals: None, # tensor of covar values or none
) -> None:
super(SincleCellDataset, self).__init__()
# Set expression
self.expression = expression
row_sums = self.expression.sum(1) # UMI Counts
log_norm_count_adj = torch.log1p(self.expression / (self.expression.sum(1)).unsqueeze(1) * torch.tensor(1000))
# Set log norm and count adjusted expression
max_vals, max_idx = torch.max(log_norm_count_adj, dim=0)
self.expression_mod = log_norm_count_adj / max_vals
# Calculate dropout likliehoods of each gene
self.dropout_vec = (self.expression == 0).float().mean(0) # per gene dropout percentages
# Set data info
self.num_cells = self.expression.shape[0]
self.num_genes = self.expression.shape[1]
# Set optional label info, including categorical covariate index
self.covar_vals = covar_vals
self.labels = labels
# Set protein embeddings
self.protein_embeddings = protein_embeddings
self.item_mode = "expression"
if self.covar_vals is not None:
self.item_mode = "expression+covar"
def __getitem__(self, idx):
if self.item_mode == "expression":
if isinstance(idx, int):
if idx < self.num_cells:
return self.expression[idx, :]
else:
raise IndexError
else:
raise NotImplementedError
elif self.item_mode == "expression+covar":
if isinstance(idx, int):
if idx < self.num_cells:
return self.expression[idx, :], self.covar_vals[idx]
else:
raise IndexError
else:
raise NotImplementedError
def __len__(self) -> int:
return self.num_cells
def get_dim(self) -> Dict[str, int]:
return self.num_genes
def data_to_torch_X(X):
if isinstance(X, sc.AnnData):
X = X.X
if not isinstance(X, np.ndarray):
X = X.toarray()
return torch.from_numpy(X).float()
def anndata_to_sc_dataset(adata:sc.AnnData,
species:str="human",
labels:list=[],
covar_col:str=None,
hv_genes=None,
embedding_model="ESM2",
) -> (SincleCellDataset, AnnData):
# Subset to just genes we have embeddings for
adata, protein_embeddings = load_gene_embeddings_adata(
adata=adata,
species=[species],
embedding_model=embedding_model
)
if hv_genes is not None:
sc.pp.highly_variable_genes(adata, flavor='seurat_v3', n_top_genes=hv_genes) # Expects Count Data
hv_index = adata.var["highly_variable"]
adata = adata[:, hv_index] # Subset to hv genes only
protein_embeddings = protein_embeddings[species][hv_index]
else:
protein_embeddings = protein_embeddings[species]
expression = data_to_torch_X(adata.X)
covar_vals = None
if len(labels) > 0:
assert covar_col is None or covar_col in labels, "Covar needs to be in labels" # make sure you keep track of covar column!
labels = adata.obs.loc[:, labels].values
if covar_col is not None:
# we have a categorical label to use as covariate
covar_vals = torch.tensor(pd.Categorical(adata.obs[covar_col]).codes)
return SincleCellDataset(
expression=expression,
protein_embeddings=protein_embeddings,
labels=labels,
covar_vals=covar_vals
), adata
def adata_path_to_prot_chrom_starts(adata, dataset_species, spec_pe_genes, gene_to_chrom_pos, offset):
"""
Given a :path: to an h5ad,
"""
pe_row_idxs = torch.tensor([spec_pe_genes.index(k.upper()) + offset for k in adata.var_names]).long()
print(len(np.unique(pe_row_idxs)))
spec_chrom = gene_to_chrom_pos[gene_to_chrom_pos["species"] == dataset_species].set_index("gene_symbol")
gene_chrom = spec_chrom.loc[[k.upper() for k in adata.var_names]]
dataset_chroms = gene_chrom["spec_chrom"].cat.codes # now this is correctely indexed by species and chromosome
print("Max Code:", max(dataset_chroms))
dataset_pos = gene_chrom["start"].values
return pe_row_idxs, dataset_chroms, dataset_pos
def process_raw_anndata(row, h5_folder_path, npz_folder_path, scp, skip,
additional_filter, root):
path = row.path
if not os.path.isfile(root + "/" + path):
print( "**********************************")
print(f"***********{root + '/' + path} File Missing****")
print( "**********************************")
print(path, root)
return None
name = path.replace(".h5ad", "")
proc_path = path.replace(".h5ad", "_proc.h5ad")
if skip:
if os.path.isfile(h5_folder_path + proc_path):
print(f"{name} already processed. Skipping")
return None, None, None
print(f"Proccessing {name}")
species = row.species
covar_col = row.covar_col
ad = sc.read(root + "/" + path)
labels = []
if "cell_type" in ad.obs.columns:
labels.append("cell_type")
if covar_col is np.nan or np.isnan(covar_col):
covar_col = None
else:
labels.append(covar_col)
if additional_filter:
sc.pp.filter_genes(ad, min_cells=10)
sc.pp.filter_cells(ad, min_genes=25)
dataset, adata = anndata_to_sc_dataset(ad, species=species, labels=labels, covar_col=covar_col, hv_genes=None)
adata = adata.copy()
if additional_filter:
sc.pp.filter_genes(ad, min_cells=10)
sc.pp.filter_cells(ad, min_genes=25)
num_cells = adata.X.shape[0]
num_genes = adata.X.shape[1]
adata_path = h5_folder_path + proc_path
adata.write(adata_path)
arr = data_to_torch_X(adata.X).numpy()
print(arr.max()) # this is a nice check to make sure it's counts
filename = npz_folder_path + f"{name}_counts.npz"
shape = arr.shape
print(name, shape)
fp = np.memmap(filename, dtype='int64', mode='w+', shape=shape)
fp[:] = arr[:]
fp.flush()
if scp != "":
subprocess.call(["scp", filename, f"{scp}:{filename}"])
subprocess.call(["scp", adata_path, f"{scp}:{adata_path}"])
return adata, num_cells, num_genes
def get_species_to_pe(EMBEDDING_DIR):
"""
Given an embedding directory, return all embeddings as a dictionary coded by species.
Note: In the current form, this function is written such that the directory needs all of the following species embeddings.
"""
EMBEDDING_DIR = Path(EMBEDDING_DIR)
embeddings_paths = {
'human': EMBEDDING_DIR / 'Homo_sapiens.GRCh38.gene_symbol_to_embedding_ESM2.pt',
'mouse': EMBEDDING_DIR / 'Mus_musculus.GRCm39.gene_symbol_to_embedding_ESM2.pt',
'frog': EMBEDDING_DIR / 'Xenopus_tropicalis.Xenopus_tropicalis_v9.1.gene_symbol_to_embedding_ESM2.pt',
'zebrafish': EMBEDDING_DIR / 'Danio_rerio.GRCz11.gene_symbol_to_embedding_ESM2.pt',
"mouse_lemur": EMBEDDING_DIR / "Microcebus_murinus.Mmur_3.0.gene_symbol_to_embedding_ESM2.pt",
"pig": EMBEDDING_DIR / 'Sus_scrofa.Sscrofa11.1.gene_symbol_to_embedding_ESM2.pt',
"macaca_fascicularis": EMBEDDING_DIR / 'Macaca_fascicularis.Macaca_fascicularis_6.0.gene_symbol_to_embedding_ESM2.pt',
"macaca_mulatta": EMBEDDING_DIR / 'Macaca_mulatta.Mmul_10.gene_symbol_to_embedding_ESM2.pt',
}
extra_species = pd.read_csv("./model_files/new_species_protein_embeddings.csv").set_index("species").to_dict()["path"]
embeddings_paths.update(extra_species) # adds new species
species_to_pe = {
species:torch.load(pe_dir) for species, pe_dir in embeddings_paths.items()
}
species_to_pe = {species:{k.upper(): v for k,v in pe.items()} for species, pe in species_to_pe.items()}
return species_to_pe
def get_spec_chrom_csv(path="/dfs/project/cross-species/yanay/code/all_to_chrom_pos.csv"):
"""
Get the species to chrom csv file
"""
gene_to_chrom_pos = pd.read_csv(path)
gene_to_chrom_pos["spec_chrom"] = pd.Categorical(gene_to_chrom_pos["species"] + "_" + gene_to_chrom_pos["chromosome"]) # add the spec_chrom list
return gene_to_chrom_pos
================================================
FILE: data_proc/download_proc_czi_cxg.py
================================================
import os
os.environ["OMP_NUM_THREADS"] = "20" # export OMP_NUM_THREADS=4
os.environ["OPENBLAS_NUM_THREADS"] = "20" # export OPENBLAS_NUM_THREADS=4
os.environ["MKL_NUM_THREADS"] = "20" # export MKL_NUM_THREADS=6
os.environ["VECLIB_MAXIMUM_THREADS"] = "20" # export VECLIB_MAXIMUM_THREADS=4
os.environ["NUMEXPR_NUM_THREADS"] = "20"
import warnings
warnings.filterwarnings('ignore')
import cellxgene_census
from tqdm import tqdm
import scanpy as sc
from collections import defaultdict
from typing import Dict, List, Optional, Tuple
import torch
import torch.utils.data as data
import torch
import numpy as np
import scanpy as sc
from numpy import array
import os
import pickle as pkl
import glob
def data_to_torch_X(X):
if isinstance(X, sc.AnnData):
X = X.X
if not isinstance(X, np.ndarray):
X = X.toarray()
return torch.from_numpy(X).float()
import sys
sys.path.append('../')
from gene_embeddings import load_gene_embeddings_adata
import pandas as pd
import numpy as np
from scanpy import AnnData
from multiprocessing import Pool, Process, Manager
import multiprocessing.pool as mpp
# https://stackoverflow.com/questions/57354700/starmap-combined-with-tqdm
def istarmap(self, func, iterable, chunksize=1):
"""starmap-version of imap
"""
if self._state != mpp.RUN:
raise ValueError("Pool not running")
if chunksize < 1:
raise ValueError(
"Chunksize must be 1+, not {0:n}".format(
chunksize))
task_batches = mpp.Pool._get_tasks(func, iterable, chunksize)
result = mpp.IMapIterator(self._cache)
self._taskqueue.put(
(
self._guarded_task_generation(result._job,
mpp.starmapstar,
task_batches),
result._set_length
))
return (item for chunk in result for item in chunk)
mpp.Pool.istarmap = istarmap
VERSION = "2023-04-25"
N_TOP_GENES = 12000
print(cellxgene_census.get_census_version_description(VERSION))
census = cellxgene_census.open_soma(census_version=VERSION)
census_datasets = census["census_info"]["datasets"].read().concat().to_pandas()
# for convenience, indexing on the soma_joinid which links this to other census data.
census_datasets = census_datasets.set_index("soma_joinid")
species_to_readable = {
"Homo sapiens":"human",
"Mus musculus":"mouse"
}
def process_row(row, num_genes, num_cells, paths, all_species, covar_cols, dataset_title, h5_root="/dfs/project/uce/cxg_data/anndatas/", npz_root="/dfs/project/uce/cxg_data/npzs/"):
dataset_id = row[1].dataset_id
#dataset_title = row[1].dataset_title.lower().replace(' ', '_').replace(",", "").replace("/", "")
save_path = h5_root + f"{dataset_title}.h5ad"
no_primary_path = save_path.replace(".h5ad", "_no_primary.h5ad")
proc_path = save_path.replace(".h5ad", "_proc.h5ad")
npz_path = npz_root + f"{dataset_title}_counts.npz"
# Download the anndata
if os.path.exists(no_primary_path):
print("No Primary, skipping")
return
if not os.path.exists(save_path) and not os.path.exists(no_primary_path):
cellxgene_census.download_source_h5ad(
dataset_id, to_path=save_path
)
if os.path.exists(proc_path) and os.path.exists(npz_path):
print("Already Proc")
try:
ad = sc.read(proc_path)
except:
print()
print()
print("Error reading on:", dataset_title)
print()
print()
return
# Get organism
if "organism" in ad.obs.columns:
unique_organisms = list(ad.obs.organism.unique().categories)
unique_organism_str = ", ".join(unique_organisms)
else:
unique_organism_str = "human"
species = species_to_readable.get(unique_organism_str, "human")
# don't need to do hv if already proc
if "sample" in ad.obs.columns:
covar_cols[dataset_title] = "sample"
elif "batch" in ad.obs.columns:
covar_cols[dataset_title] = "batch"
else:
covar_cols[dataset_title] = ""
num_genes[dataset_title] = ad.X.shape[1]
num_cells[dataset_title] = ad.X.shape[0]
paths[dataset_title] = f"{dataset_title}.h5ad"
all_species[dataset_title] = species
return # Skip everything else
# Read the raw AD
ad = sc.read(save_path)
# Change to counts
if not sc._utils.check_nonnegative_integers(ad.X):
# don't have counts yet, need raw
if ad.raw is None:
print("Skipped, no counts")
return
ad.X = ad.raw.X.toarray()
if not sc._utils.check_nonnegative_integers(ad.X):
print("Skipped, no counts")
return
# SUBSET TO primary data
if len(np.unique(ad.obs["is_primary_data"])) >= 1:
primary_data = ad.obs.is_primary_data.value_counts()
ad = ad[ad.obs.is_primary_data]
if ad.X.shape[0] == 0:
print("no primary data")
print(primary_data)
os.rename(save_path, no_primary_path)
return # No primary data
print("has primary data")
# Switch to gene symbols
ad.var["feature_id_orig"] = list(ad.var.index)
ad.var_names = list(ad.var.feature_name)
# Get organism
if "organism" in ad.obs.columns:
unique_organisms = list(ad.obs.organism.unique().categories)
unique_organism_str = ", ".join(unique_organisms)
else:
unique_organism_str = "human"
species = species_to_readable.get(unique_organism_str, "human")
# Filter to gene symbols with protein embeddings
ad, _ = load_gene_embeddings_adata(
adata=ad,
species=[species],
embedding_model="ESM2"
)
ad = ad.copy()
# Simple filtering by counts
sc.pp.filter_cells(ad, min_genes=200)
sc.pp.filter_genes(ad, min_cells=10)
#print(ad)
if "sample" in ad.obs.columns:
try:
sc.pp.highly_variable_genes(ad, flavor="seurat_v3", n_top_genes=N_TOP_GENES, subset=True, batch_key="sample")
except:
try:
sc.pp.highly_variable_genes(ad, flavor="seurat_v3", n_top_genes=N_TOP_GENES, subset=True, batch_key="sample", span=1)
except:
print(f"can't hv gene subset {dataset_title}")
covar_cols[dataset_title] = "sample"
elif "batch" in ad.obs.columns:
try:
sc.pp.highly_variable_genes(ad, flavor="seurat_v3", n_top_genes=N_TOP_GENES, subset=True, batch_key="batch")
except:
try:
sc.pp.highly_variable_genes(ad, flavor="seurat_v3", n_top_genes=N_TOP_GENES, subset=True, batch_key="batch", span=1)
except:
print(f"can't hv gene subset {dataset_title}")
covar_cols[dataset_title] = "batch"
else:
try:
sc.pp.highly_variable_genes(ad, flavor="seurat_v3", n_top_genes=N_TOP_GENES, subset=True)
except:
try:
sc.pp.highly_variable_genes(ad, flavor="seurat_v3", n_top_genes=N_TOP_GENES, subset=True, span=1)
except:
print(f"can't hv gene subset {dataset_title}")
covar_cols[dataset_title] = ""
num_genes[dataset_title] = ad.X.shape[1]
num_cells[dataset_title] = ad.X.shape[0]
paths[dataset_title] = f"{dataset_title}.h5ad"
all_species[dataset_title] = species
print("writing proc")
ad.write(proc_path)
arr = data_to_torch_X(ad.X).numpy()
shape = arr.shape
fp = np.memmap(npz_path, dtype='int64', mode='w+', shape=shape)
fp[:] = arr[:]
fp.flush()
return
if __name__ == '__main__':
'''
manager = Manager()
num_genes = manager.dict()
num_cells = manager.dict()
paths = manager.dict()
all_species = manager.dict()
covar_cols = manager.dict()
'''
num_genes = {}
num_cells = {}
paths = {}
all_species = {}
covar_cols = {}
df = pd.DataFrame()
# Shuffle the dataset
census_datasets = census_datasets#.iloc[270:]
iterrows = list(census_datasets.iterrows())
#p = Pool(8)
#for row in tqdm(iterrows, total=len(census_datasets)):
# p.apply_async(process_row, args=(row, num_genes, num_cells, paths, all_species, covar_cols))
#p.close()
#p.join()
'''
with Pool(1) as p:
nrows = len(iterrows)
inputs = zip(iterrows, [num_genes]*nrows, [num_cells]*nrows, [paths]*nrows, [all_species]*nrows, [covar_cols]*nrows)
for _ in tqdm(p.istarmap(process_row, inputs),
total=nrows):
pass
'''
if os.path.exists("dataset_rows_mouse_fixed.pkl"):
dataset_rows = {}
for path in glob.glob("dataset_rows_mouse_fixed*.pkl"):
with open(path, "rb") as f:
dataset_rows_path = pkl.load(f)
dataset_rows.update(dataset_rows_path)
print(f"{len(dataset_rows)} already counted")
else:
dataset_rows = {}
pbar = tqdm(iterrows)
all_errors = []
total_number_of_cells = 0
duplicate_titles = ['Dissection: Body of hippocampus (HiB) - Rostral DG-CA4', 'Retina',
'Colon', 'Myeloid cells', 'Ileum', 'Airway']
duplicate_titles_2 = ['retina', 'airway', 'myeloid_cells', 'colon', 'ileum', 'immune_cells']
for row in pbar:
dataset_title = row[1].dataset_title
if dataset_title in duplicate_titles:
dataset_title = row[1].collection_name + row[1].dataset_title
dataset_title = dataset_title.lower().replace(' ', '_').replace(",", "").replace("/", "")
if dataset_title in duplicate_titles_2:
dataset_title = (row[1].collection_name + "_" + dataset_title).lower().replace(' ', '_').replace(",", "").replace("/", "")
print(f"{total_number_of_cells} cells done")
if dataset_title in dataset_rows:
paths[dataset_title] = dataset_rows[dataset_title][0]
all_species[dataset_title] = dataset_rows[dataset_title][1]
covar_cols[dataset_title] = dataset_rows[dataset_title][2]
num_cells[dataset_title] = dataset_rows[dataset_title][3]
num_genes[dataset_title] = dataset_rows[dataset_title][4]
#print("skipped read of proc")
total_number_of_cells += dataset_rows[dataset_title][3]
continue # Skip!
else:
pbar.set_description(f"{dataset_title} proc")
try:
process_row(row, num_genes, num_cells, paths, all_species, covar_cols, dataset_title=dataset_title)
except:
print(f"****{dataset_title} ERROR****")
all_errors.append(dataset_title)
pbar.set_description(f"{dataset_title} done")
if dataset_title in paths:
dataset_rows[dataset_title] = [paths[dataset_title], all_species[dataset_title], covar_cols[dataset_title], num_cells[dataset_title], num_genes[dataset_title], dataset_title]
total_number_of_cells += dataset_rows[dataset_title][3]
with open("dataset_rows_mouse_fixed.pkl", "wb") as f:
pkl.dump(dataset_rows, f)
print("wrote pkl")
# path,species,covar_col,num_cells,names
df["path"] = list(paths.values())
df["species"] = list(all_species.values())
df["covar_col"] = list(covar_cols.values())
df["num_cells"] = list(num_cells.values())
df["num_genes"] = list(num_genes.values())
df["names"] = list(paths.keys())
print(df.head(20))
print()
print("Errors:")
print(all_errors)
df.to_csv("cxg_datasets.csv", index=False)
================================================
FILE: data_proc/gene_embeddings.py
================================================
"""Helper functions for loading pretrained gene embeddings."""
from pathlib import Path
from typing import Dict, Tuple
import torch
from scanpy import AnnData
import numpy as np
import pandas as pd
EMBEDDING_DIR = Path('model_files/protein_embeddings')
MODEL_TO_SPECIES_TO_GENE_EMBEDDING_PATH = {
'ESM2': {
'human': EMBEDDING_DIR / 'Homo_sapiens.GRCh38.gene_symbol_to_embedding_ESM2.pt',
'mouse': EMBEDDING_DIR / 'Mus_musculus.GRCm39.gene_symbol_to_embedding_ESM2.pt',
'frog': EMBEDDING_DIR / 'Xenopus_tropicalis.Xenopus_tropicalis_v9.1.gene_symbol_to_embedding_ESM2.pt',
'zebrafish': EMBEDDING_DIR / 'Danio_rerio.GRCz11.gene_symbol_to_embedding_ESM2.pt',
"mouse_lemur": EMBEDDING_DIR / "Microcebus_murinus.Mmur_3.0.gene_symbol_to_embedding_ESM2.pt",
"pig": EMBEDDING_DIR / 'Sus_scrofa.Sscrofa11.1.gene_symbol_to_embedding_ESM2.pt',
"macaca_fascicularis": EMBEDDING_DIR / 'Macaca_fascicularis.Macaca_fascicularis_6.0.gene_symbol_to_embedding_ESM2.pt',
"macaca_mulatta": EMBEDDING_DIR / 'Macaca_mulatta.Mmul_10.gene_symbol_to_embedding_ESM2.pt',
}
}
extra_species = pd.read_csv("./model_files/new_species_protein_embeddings.csv").set_index("species").to_dict()["path"]
MODEL_TO_SPECIES_TO_GENE_EMBEDDING_PATH["ESM2"].update(extra_species) # adds new species
def load_gene_embeddings_adata(adata: AnnData, species: list, embedding_model: str) -> Tuple[AnnData, Dict[str, torch.FloatTensor]]:
"""Loads gene embeddings for all the species/genes in the provided data.
:param data: An AnnData object containing gene expression data for cells.
:param species: Species corresponding to this adata
:param embedding_model: The gene embedding model whose embeddings will be loaded.
:return: A tuple containing:
- A subset of the data only containing the gene expression for genes with embeddings in all species.
- A dictionary mapping species name to the corresponding gene embedding matrix (num_genes, embedding_dim).
"""
# Get species names
species_names = species
species_names_set = set(species_names)
# Get embedding paths for the model
species_to_gene_embedding_path = MODEL_TO_SPECIES_TO_GENE_EMBEDDING_PATH[embedding_model]
available_species = set(species_to_gene_embedding_path)
# Ensure embeddings are available for all species
if not (species_names_set <= available_species):
raise ValueError(f'The following species do not have gene embeddings: {species_names_set - available_species}')
# Load gene embeddings for desired species (and convert gene symbols to lower case)
species_to_gene_symbol_to_embedding = {
species: {
gene_symbol.lower(): gene_embedding
for gene_symbol, gene_embedding in torch.load(species_to_gene_embedding_path[species]).items()
}
for species in species_names
}
# Determine which genes to include based on gene expression and embedding availability
genes_with_embeddings = set.intersection(*[
set(gene_symbol_to_embedding)
for gene_symbol_to_embedding in species_to_gene_symbol_to_embedding.values()
])
genes_to_use = {gene for gene in adata.var_names if gene.lower() in genes_with_embeddings}
# Subset data to only use genes with embeddings
adata = adata[:, adata.var_names.isin(genes_to_use)]
# Set up dictionary mapping species to gene embedding matrix (num_genes, embedding_dim)
species_to_gene_embeddings = {
species_name: torch.stack([
species_to_gene_symbol_to_embedding[species_name][gene_symbol.lower()]
for gene_symbol in adata.var_names
])
for species_name in species_names
}
return adata, species_to_gene_embeddings
================================================
FILE: data_proc/generate_reduced_chrom_files.py
================================================
import os
os.environ["OMP_NUM_THREADS"] = "4" # export OMP_NUM_THREADS=4
os.environ["OPENBLAS_NUM_THREADS"] = "4" # export OPENBLAS_NUM_THREADS=4
os.environ["MKL_NUM_THREADS"] = "4" # export MKL_NUM_THREADS=6
os.environ["VECLIB_MAXIMUM_THREADS"] = "4" # export VECLIB_MAXIMUM_THREADS=4
os.environ["NUMEXPR_NUM_THREADS"] = "4"
import warnings
warnings.filterwarnings("ignore")
import scanpy as sc
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import pickle
import os
import argparse
import logging
import time
from tqdm.auto import tqdm
import matplotlib.pyplot as plt
import pandas as pd
#sc._settings.ScanpyConfig.n_jobs = 6
import math
from typing import Tuple
import torch
from torch import nn, Tensor
import torch.nn.functional as F
from torch.nn import TransformerEncoder, TransformerEncoderLayer
from torch.utils.data import dataset
from accelerate import Accelerator
import anndata
from data_utils import adata_path_to_prot_chrom_starts, get_spec_chrom_csv
from torch.utils.data import dataset
from torch.utils.data import DataLoader, TensorDataset
from scipy.stats import binom
def padding_tensor(sequences):
"""
:param sequences: list of tensors
:return:
"""
num = len(sequences)
max_len = max([s.size(0) for s in sequences])
out_dims = (num, max_len, 1280)
out_tensor = sequences[0].data.new(*out_dims).fill_(0)
out_dims2 = (num, max_len)
mask = sequences[0].data.new(*out_dims2).fill_(float('-inf'))
for i, tensor in enumerate(sequences):
length = tensor.size(0)
out_tensor[i, :length] = tensor
mask[i, :length] = 1
return out_tensor.permute(1, 0, 2), mask
from pathlib import Path
# ESM1b
'''
EMBEDDING_DIR = Path('/dfs/project/cross-species/data/proteome/embeddings')
human_pe_dir = EMBEDDING_DIR / 'Homo_sapiens.GRCh38.gene_symbol_to_embedding_ESM1b.pt'
mouse_pe_dir = EMBEDDING_DIR / 'Mus_musculus.GRCm39.gene_symbol_to_embedding_ESM1b.pt'
lemur_pe_dir = Path("/dfs/project/cross-species/yanay/data/proteome/embeddings/") / 'Microcebus_murinus.Mmur_3.0.gene_symbol_to_embedding_ESM1b.pt'
'''
# Upgrade to ESM2
EMBEDDING_DIR = Path('/dfs/project/cross-species/data/proteome/embeddings')
EMBEDDING_DIR = Path('/dfs/project/cross-species/yanay/data/proteome/embeddings')
embeddings_paths = {
'human': EMBEDDING_DIR / 'Homo_sapiens.GRCh38.gene_symbol_to_embedding_ESM2.pt',
'mouse': EMBEDDING_DIR / 'Mus_musculus.GRCm39.gene_symbol_to_embedding_ESM2.pt',
'frog': EMBEDDING_DIR / 'Xenopus_tropicalis.Xenopus_tropicalis_v9.1.gene_symbol_to_embedding_ESM2.pt',
'zebrafish': EMBEDDING_DIR / 'Danio_rerio.GRCz11.gene_symbol_to_embedding_ESM2.pt',
"mouse_lemur": EMBEDDING_DIR / "Microcebus_murinus.Mmur_3.0.gene_symbol_to_embedding_ESM2.pt",
"pig": EMBEDDING_DIR / 'Sus_scrofa.Sscrofa11.1.gene_symbol_to_embedding_ESM2.pt',
"macaca_fascicularis": EMBEDDING_DIR / 'Macaca_fascicularis.Macaca_fascicularis_6.0.gene_symbol_to_embedding_ESM2.pt',
"macaca_mulatta": EMBEDDING_DIR / 'Macaca_mulatta.Mmul_10.gene_symbol_to_embedding_ESM2.pt',
}
species_to_pe = {
species:torch.load(pe_dir) for species, pe_dir in embeddings_paths.items()
}
species_to_pe = {species:{k.upper(): v for k,v in pe.items()} for species, pe in species_to_pe.items()}
#species_to_keys = {species:list(pe.keys()) for species, pe in species_to_pe.items()}
#species_to_keys = {species:dict(zip(keys, np.arange(len(keys)))) for species, keys in species_to_keys.items()}
#datasets_df = pd.read_csv("/dfs/project/cross-species/yanay/code/UCE/data_proc/full_train_datasets.csv")
datasets_df = pd.read_csv("tissue_datasets.csv")
datasets_df = pd.read_csv("perturb_datasets.csv")
datasets_df = pd.read_csv("../new_perturb_datasets.csv")
#pd.concat((#pd.read_csv("new_datasets.csv"),
#pd.read_csv("pbmcs_nohvg.csv"),
#pd.read_csv("lung_nohvg.csv"),
#pd.read_csv("new_tabula_datasets.csv"),
#pd.read_csv("updated_datasets.csv"),
# #pd.read_csv("sanger_heart_atlas_datasets.csv"),
# pd.read_csv("tissue_datasets.csv")
# ))
#datasets_df = pd.read_csv("cell_cycle_datasets.csv")
#datasets_df = pd.read_csv("spatial_datasets.csv")
#datasets_df = pd.read_csv("perturb_datasets.csv")
#datasets_df = pd.read_csv("ccle_datasets.csv")
#datasets_df = pd.read_csv("pancreas_datasets.csv")
sorted_dataset_names = sorted(datasets_df["names"])
with open("dataset_shapes.pkl", "rb") as f:
shapes_dict = pickle.load(f)
shapes_dict.update({
"madissoon_novel_lung":(190728, 8000),
'flores_cerebellum_human': (20232, 8000),
'osuch_gut_human': (272310, 8000),
'msk_ovarian_human': (929690, 8000),
'htan_vmuc_dis_epi_human': (65084, 8000),
'htan_vmuc_val_epi_human': (57564, 8000),
'htan_vmuc_non_epi_human': (9099, 8000),
'hao_pbmc_3p_human': (161764, 8000),
'hao_pbmc_5p_human': (49147, 8000),
'gao_tumors_human': (36111, 8000),
'swabrick_breast_human': (92427, 8000),
'wu_cryo_tumors_human': (105662, 8000),
'cell_line_het_human': (53513, 8000),
'bi_allen_metastasis_human': (27787, 8000),
'zheng68k_human': (68579, 8000),
'zheng68k_12k_human': (68579, 12000),
'mouse_embryo_ct': (153597, 12000),
"regev_gtex_heart": (36574, 8000),
"tabula_sapiens_heart": (11505, 8000),
"10k_pbmcs":(11990, 12000),
"epo_ido":(35834,12000),
'tabula_sapiens_kidney': (9641, 8000),
'tabula_microcebus_kidney': (14592, 8000),
'tabula_muris_kidney': (2781, 8000),
'tabula_muris_senis_kidney': (19610, 8000),
'immune_human': (33506, 8000)
})
for row in datasets_df.iterrows():
ngenes = row[1].num_genes
ncells = row[1].num_cells
name = row[1].names
if not np.isnan(ngenes):
shapes_dict[name] = (int(ncells), int(ngenes))
#with open("dataset_shapes.pkl", "wb") as f:
# pickle.dump(shapes_dict, f)
token_dim = 5120
mmap_dict = {}
root_dir = "/lfs/local/0/yanay/uce_h5s/"
root_dir_census = "/lfs/local/0/yanay/cxg_h5s/"
dataset_to_paths = {r[1]["names"]:root_dir + r[1]["path"].replace(".h5ad", "_proc.h5ad") for r in datasets_df.iterrows()}
for row in datasets_df.iterrows():
name = row[1].names
census = row[1].census
if census == "yes":
dataset_to_paths[name] = dataset_to_paths[name].replace(root_dir, root_dir_census)
datasets_to_species = {r[1]["names"]:r[1]["species"] for r in datasets_df.iterrows()}
#species_to_pe = {"mouse":mouse_pe, "human":human_pe, "mouse_lemur":lemur_pe}
#dataset_to_protein_embeddings_all = {k:species_to_pe[v] for k, v in datasets_to_species.items()}
dataset_to_protein_embeddings = {}
#dataset_to_protein_embeddings_all["madissoon_novel_lung"] = species_to_pe["human"]
datasets_to_species["madissoon_novel_lung"] = "human"
#dataset_to_paths["madissoon_novel_lung"] = "/lfs/local/0/yanay/uce_h5s/madissoon_novel_lung_proc.h5ad"
# New Chrom Based Code
gene_to_chrom_pos = get_spec_chrom_csv()
species_to_chrom_categories = {}
for species in np.unique(gene_to_chrom_pos["species"]):
species_to_chrom_categories[species] = pd.Categorical(gene_to_chrom_pos["chromosome"]).categories
dataset_to_chroms = {}
dataset_to_starts = {}
sorted_species_names = sorted(species_to_pe.keys())
print(sorted_species_names)
if os.path.exists(f"/dfs/project/uce/all_species_pe_tokens.torch"):
all_pe = torch.load(f"/dfs/project/uce/all_species_pe_tokens.torch")
with open("/dfs/project/uce/all_species_offsets.pkl", "rb") as f:
species_to_offsets = pickle.load(f)
print("Loaded PE", all_pe.shape)
else:
torch.manual_seed(8)
MASK_TENSOR = torch.zeros((1, token_dim)) # this is the padding token
CHROM_TENSOR_LEFT = torch.normal(mean=0, std=1, size=(1, token_dim))
CHROM_TENSOR_RIGHT = torch.normal(mean=0, std=1, size=(1, token_dim))
CLS_TENSOR = torch.normal(mean=0, std=1, size=(1, token_dim))
species_to_offsets = {}
all_pe = [MASK_TENSOR, CHROM_TENSOR_LEFT, CHROM_TENSOR_RIGHT, CLS_TENSOR]
offset = len(all_pe) # special tokens at the top!
for species in sorted_species_names:
pe_stacked = torch.stack(list(species_to_pe[species].values()))
all_pe.append(pe_stacked)
species_to_offsets[species] = offset
offset += pe_stacked.shape[0]
all_pe = torch.vstack(all_pe)
print(all_pe.shape)
torch.save(all_pe, f"/dfs/project/uce/all_species_pe_tokens.torch")
with open("/dfs/project/uce/all_species_offsets.pkl", "wb+") as f:
pickle.dump(species_to_offsets, f)
print("Saved PE")
# Load in already saved!
if os.path.exists(f"/lfs/local/0/yanay/reduced_datasets_to_pe_chrom_{token_dim}_new.torch"):
dataset_to_protein_embeddings = torch.load(f"/lfs/local/0/yanay/reduced_datasets_to_pe_chrom_{token_dim}_new.torch")
with open("/lfs/local/0/yanay/dataset_to_chroms_new.pkl", "rb") as f:
dataset_to_chroms = pickle.load(f)
with open("/lfs/local/0/yanay/dataset_to_starts_new.pkl", "rb") as f:
dataset_to_starts = pickle.load(f)
else:
dataset_to_protein_embeddings = {}
dataset_to_chroms = {}
dataset_to_starts = {}
# Add the new ones
print("creating reduced size protein embeddings file")
redo = True
for dataset, path in tqdm(list(dataset_to_paths.items())):
if dataset in dataset_to_protein_embeddings.keys() and not redo:
continue # skip since already procced
print(dataset)
adata = sc.read(path)
dataset_species = datasets_to_species[dataset]
spec_pe_genes = list(species_to_pe[dataset_species].keys())
offset = species_to_offsets[dataset_species]
# Get proper idxs
pe_row_idxs, dataset_chroms, dataset_pos = adata_path_to_prot_chrom_starts(adata, dataset_species, spec_pe_genes, gene_to_chrom_pos, offset)
# Add to dicts
dataset_to_chroms[dataset] = dataset_chroms
dataset_to_starts[dataset] = dataset_pos
dataset_to_protein_embeddings[dataset] = pe_row_idxs
del adata
# save Dicts and idxs
torch.save(dataset_to_protein_embeddings, f"/lfs/local/0/yanay/reduced_datasets_to_pe_chrom_{token_dim}_new.torch")
with open("/lfs/local/0/yanay/dataset_to_chroms_new.pkl", "wb+") as f:
pickle.dump(dataset_to_chroms, f)
with open("/lfs/local/0/yanay/dataset_to_starts_new.pkl", "wb+") as f:
pickle.dump(dataset_to_starts, f)
================================================
FILE: data_proc/preproc_many_dataset.py
================================================
import os
os.environ["OMP_NUM_THREADS"] = "10" # export OMP_NUM_THREADS=4
os.environ["OPENBLAS_NUM_THREADS"] = "10" # export OPENBLAS_NUM_THREADS=4
os.environ["MKL_NUM_THREADS"] = "10" # export MKL_NUM_THREADS=6
os.environ["VECLIB_MAXIMUM_THREADS"] = "10" # export VECLIB_MAXIMUM_THREADS=4
os.environ["NUMEXPR_NUM_THREADS"] = "10"
from collections import defaultdict
from typing import Dict, List, Optional, Tuple
import torch
import torch.utils.data as data
import numpy as np
import scanpy as sc
from numpy import array
import subprocess
import os
from tqdm import tqdm
import warnings
warnings.filterwarnings("ignore")
from gene_embeddings import load_gene_embeddings_adata
import pandas as pd
import numpy as np
from scanpy import AnnData
from data_utils import process_raw_anndata
def data_to_torch_X(X):
if isinstance(X, sc.AnnData):
X = X.X
if not isinstance(X, np.ndarray):
X = X.toarray()
return torch.from_numpy(X).float()
class SincleCellDataset(data.Dataset):
def __init__(self,
expression: torch.tensor, # Subset to hv genes, count data! cells x genes
protein_embeddings: torch.tensor, # same order as expression, also subset genes x pe
labels: None, # optional, tensor of labels
covar_vals: None, # tensor of covar values or none
) -> None:
super(SincleCellDataset, self).__init__()
# Set expression
self.expression = expression
row_sums = self.expression.sum(1) # UMI Counts
log_norm_count_adj = torch.log1p(self.expression / (self.expression.sum(1)).unsqueeze(1) * torch.tensor(1000))
# Set log norm and count adjusted expression
max_vals, max_idx = torch.max(log_norm_count_adj, dim=0)
self.expression_mod = log_norm_count_adj / max_vals
# Calculate dropout likliehoods of each gene
self.dropout_vec = (self.expression == 0).float().mean(0) # per gene dropout percentages
# Set data info
self.num_cells = self.expression.shape[0]
self.num_genes = self.expression.shape[1]
# Set optional label info, including categorical covariate index
self.covar_vals = covar_vals
self.labels = labels
# Set protein embeddings
self.protein_embeddings = protein_embeddings
self.item_mode = "expression"
if self.covar_vals is not None:
self.item_mode = "expression+covar"
def __getitem__(self, idx):
if self.item_mode == "expression":
if isinstance(idx, int):
if idx < self.num_cells:
return self.expression[idx, :]
else:
raise IndexError
else:
raise NotImplementedError
elif self.item_mode == "expression+covar":
if isinstance(idx, int):
if idx < self.num_cells:
return self.expression[idx, :], self.covar_vals[idx]
else:
raise IndexError
else:
raise NotImplementedError
def __len__(self) -> int:
return self.num_cells
def get_dim(self) -> Dict[str, int]:
return self.num_genes
def data_to_torch_X(X):
if isinstance(X, sc.AnnData):
X = X.X
if not isinstance(X, np.ndarray):
X = X.toarray()
return torch.from_numpy(X).float()
def anndata_to_sc_dataset(adata:sc.AnnData,
species:str="human",
labels:list=[],
covar_col:str=None,
hv_genes:int=12000,
embedding_model="ESM1b",
) -> (SincleCellDataset, AnnData):
# Subset to just genes we have embeddings for
adata, protein_embeddings = load_gene_embeddings_adata(
adata=adata,
species=[species],
embedding_model=embedding_model
)
if DO_HVG:
sc.pp.highly_variable_genes(adata, flavor='seurat_v3', n_top_genes=hv_genes) # Expects Count Data
hv_index = adata.var["highly_variable"]
adata = adata[:, hv_index] # Subset to hv genes only
protein_embeddings = protein_embeddings[species][hv_index]
else:
protein_embeddings = protein_embeddings[species]
expression = data_to_torch_X(adata.X)
covar_vals = None
if len(labels) > 0:
assert covar_col is None or covar_col in labels, "Covar needs to be in labels" # make sure you keep track of covar column!
labels = adata.obs.loc[:, labels].values
if covar_col is not None:
# we have a categorical label to use as covariate
covar_vals = torch.tensor(pd.Categorical(adata.obs[covar_col]).codes)
return SincleCellDataset(
expression=expression,
protein_embeddings=protein_embeddings,
labels=labels,
covar_vals=covar_vals
), adata
def proc(args):
datasets_df = pd.read_csv(args.datasets_df)
datasets_df["covar_col"] = np.nan
skip = args.skip
additional_filter = args.filter
DO_HVG = args.DO_HVG
num_genes = {}
num_cells = {}
ir = list(datasets_df.iterrows())
for i, row in tqdm(ir, total=len(datasets_df)):
_, ncells, ngenes = process_raw_anndata(row, h5_folder_path, npz_folder_path, scp, skip, additional_filter, root=args.file_root_path)
if (ncells is not None) and (ngenes is not None):
num_genes[path] = adata.X.shape[1]
num_cells[path] = ngenes
if "num_cells" not in datasets_df.columns:
datasets_df["num_cells"] = 0
if "num_genes" not in datasets_df.columns:
datasets_df["num_genes"] = 0
for k in num_genes.keys():
ng = num_genes[k]
nc = num_cells[k]
datasets_df.loc[datasets_df["path"] == k, "num_cells"] = nc
datasets_df.loc[datasets_df["path"] == k, "num_genes"] = ng
# Write with the cells and genes info back to the original path
datasets_df.to_csv(args.datasets_df, index=False)
if __name__=="__main__":
# Parse command-line arguments
parser = argparse.ArgumentParser(description='Preproc datasets h5ad datasets.')
# Define command-line arguments
parser.add_argument('--scp', type=str, default="", help='Name of a SNAP server to SCP the results to. It should have the same folders as the script is already saving to.')
parser.add_argument('--h5_folder_path', type=str, default="/lfs/local/0/yanay/uce_h5s/", help='Folder to save H5s to.')
parser.add_argument('--npz_folder_path', type=str, default="/lfs/local/0/yanay/uce_proc/", help='Folder to save NPZs to.')
parser.add_argument('--datasets_df', type=str, default="/dfs/project/uce/new_perturb_datasets.csv", help='Path to datasets csv. Will be overwritten to have the correct num cells and num genes for each dataset.')
parser.add_argument('--filter', type=bool, default=True, help='Should you do an additional gene/cell filtering? This can be a good step since even if you have already done it, subsetting to protein embeddings can make some cells sparser.')
parser.add_argument('--skip', type=bool, default=True, help='Should you skip datasets that appear to have already been created in the h5 folder?')
parser.add_argument('--DO_HVG', type=bool, default=False, help='Should a HVG subset be done.')
parse
args = parser.parse_args()
main(args)
================================================
FILE: eval_data.py
================================================
"""
Dataloaders
"""
import warnings
warnings.filterwarnings("ignore")
import sys
sys.path.append('../')
from typing import Dict, List, Optional, Tuple, Any
import torch
import numpy as np
import pickle
import torch.utils.data as data
class MultiDatasetSentences(data.Dataset):
def __init__(self, sorted_dataset_names, shapes_dict, args,
dataset_to_protein_embeddings_path= "/lfs/local/0/yanay/reduced_datasets_to_pe_chrom_5120_new.torch",
datasets_to_chroms_path="/lfs/local/0/yanay/dataset_to_chroms_new.pkl",
datasets_to_starts_path="/lfs/local/0/yanay/dataset_to_starts_new.pkl",
npzs_dir="/lfs/local/0/yanay/uce_proc/") -> None:
super(MultiDatasetSentences, self).__init__()
# self.xs = {}
self.num_cells = {}
self.num_genes = {}
self.shapes_dict = shapes_dict
self.args = args
self.total_num_cells = 0
for name in sorted_dataset_names:
num_cells, num_genes = self.shapes_dict[name]
# self.xs[name] = X
self.num_cells[name] = num_cells
self.num_genes[name] = num_genes
self.total_num_cells += num_cells
self.datasets = sorted_dataset_names
# TODO: preferably not hard-coded here
self.dataset_to_protein_embeddings = torch.load(dataset_to_protein_embeddings_path)
with open(datasets_to_chroms_path, "rb") as f:
self.dataset_to_chroms = pickle.load(f)
with open(datasets_to_starts_path, "rb") as f:
self.dataset_to_starts = pickle.load(f)
self.npzs_dir = npzs_dir
def __getitem__(self, idx):
if isinstance(idx, int):
for dataset in sorted(self.datasets):
if idx < self.num_cells[dataset]:
#cts = np.memmap(f"/lfs/local/0/yanay/cxg_npzs/" + f"{dataset}_counts.npz",
# dtype='int64', mode='r', shape=self.shapes_dict[dataset])
cts = np.memmap(self.npzs_dir + f"{dataset}_counts.npz", dtype='int64', mode='r', shape=self.shapes_dict[dataset])
counts = cts[idx]
counts = torch.tensor(counts).unsqueeze(0)
weights = torch.log1p(counts)
weights = (weights / torch.sum(weights))
batch_sentences, mask, seq_len, cell_sentences = \
sample_cell_sentences(counts, weights, dataset, self.args,
dataset_to_protein_embeddings= self.dataset_to_protein_embeddings,
dataset_to_chroms=self.dataset_to_chroms,
dataset_to_starts=self.dataset_to_starts)
return batch_sentences, mask, idx, seq_len, cell_sentences
else:
idx -= self.num_cells[dataset]
raise IndexError
else:
raise NotImplementedError
def __len__(self) -> int:
return self.total_num_cells
def get_dim(self) -> Dict[str, int]:
return self.num_genes
class MultiDatasetSentenceCollator(object):
def __init__(self, args):
self.pad_length = args.pad_length
def __call__(self, batch):
batch_size = len(batch)
batch_sentences = torch.zeros((batch_size, self.pad_length))
mask = torch.zeros((batch_size, self.pad_length))
cell_sentences = torch.zeros((batch_size, self.pad_length))
idxs = torch.zeros(batch_size)
i = 0
max_len = 0
for bs, msk, idx, seq_len, cs in batch:
batch_sentences[i, :] = bs
cell_sentences[i, :] = cs
max_len = max(max_len, seq_len)
mask[i, :] = msk
idxs[i] = idx
i += 1
return batch_sentences[:, :max_len] , mask[:, :max_len], idxs, cell_sentences
def sample_cell_sentences(counts, batch_weights, dataset, args,
dataset_to_protein_embeddings,
dataset_to_chroms,
dataset_to_starts):
dataset_idxs = dataset_to_protein_embeddings[dataset] # get the dataset specific protein embedding idxs
cell_sentences = torch.zeros((counts.shape[0], args.pad_length)) # init the cell representation as 0s
mask = torch.zeros((counts.shape[0], args.pad_length)) # start of masking the whole sequence
chroms = dataset_to_chroms[dataset] # get the dataset specific chroms for each gene
starts = dataset_to_starts[dataset] # get the dataset specific genomic start locations for each gene
longest_seq_len = 0 # we need to keep track of this so we can subset the batch at the end
for c, cell in enumerate(counts):
weights = batch_weights[c].numpy()
weights = weights / sum(weights) # RE NORM after mask
# randomly choose the genes that will make up the sample, weighted by expression, with replacement
choice_idx = np.random.choice(np.arange(len(weights)),
size=args.sample_size, p=weights,
replace=True)
choosen_chrom = chroms[choice_idx] # get the sampled genes chromosomes
# order the genes by chromosome
chrom_sort = np.argsort(choosen_chrom)
choice_idx = choice_idx[chrom_sort]
# sort the genes by start
new_chrom = chroms[choice_idx]
choosen_starts = starts[choice_idx]
ordered_choice_idx = np.full((args.pad_length),
args.cls_token_idx) # start with cls
# i= 0 first token is CLS
i = 1 # continue on to the rest of the sequence with left bracket being assumed.
# Shuffle the chroms now, there's no natural order to chromosomes
uq_chroms = np.unique(new_chrom)
np.random.shuffle(uq_chroms) # shuffle
# This loop is actually just over one cell
for chrom in uq_chroms:
# Open Chrom token
ordered_choice_idx[i] = int(chrom) + args.CHROM_TOKEN_OFFSET # token of this chromosome # i = 1 next token is a chrom open
i += 1
# now sort the genes by start order within the chroms
loc = np.where(new_chrom == chrom)[0]
sort_by_start = np.argsort(
choosen_starts[loc]) # start locations for this chromsome
to_add = choice_idx[loc[sort_by_start]]
ordered_choice_idx[i:(i + len(to_add))] = dataset_idxs[to_add]
i += len(to_add)
ordered_choice_idx[i] = args.chrom_token_right_idx # add the chrom sep again
i += 1 # add the closing token again
longest_seq_len = max(longest_seq_len, i)
remainder_len = (args.pad_length - i)
cell_mask = torch.concat((torch.ones(i),
# pay attention to all of these tokens, ignore the rest!
torch.zeros(remainder_len)))
mask[c, :] = cell_mask
ordered_choice_idx[i:] = args.pad_token_idx # the remainder of the sequence
cell_sentences[c, :] = torch.from_numpy(ordered_choice_idx)
cell_sentences_pe = cell_sentences.long() # token indices
return cell_sentences_pe, mask, longest_seq_len, cell_sentences
================================================
FILE: eval_single_anndata.py
================================================
"""
Script for Evaluating a Single AnnData
Parameters:
----------
- `adata_path` (str):
Full path to the AnnData you want to embed.
- `dir` (str):
Working folder where all files will be saved.
- `species` (str):
Species of the AnnData.
- `filter` (bool):
Additional gene/cell filtering on the AnnData.
- `skip` (bool):
Skip datasets that appear to have already been created.
- `model_loc` (str):
Location of pretrained UCE model's weights in a `.torch` file.
- `batch_size` (int):
Batch size for processing.
- `CXG` (bool):
Use CXG model.
- `nlayers` (int):
Number of transformer layers.
- `output_dim` (int):
Desired output dimension.
- `d_hid` (int):
Hidden dimension for processing.
- `token_dim` (int):
Token dimension.
- `spec_chrom_csv_path` (str):
CSV file mapping genes from each species to their respective chromosomes
and genomic start positions.
- `token_file` (str):
`.torch` file containing token/protein embeddings for all tokens.
- `protein_embeddings_dir` (str):
Directory containing protein embedding `.pt` files for all species.
- `offset_pkl_path` (str):
`.pkl` file mapping between species and their gene's locations in the `token_file`.
- `pad_length` (int):
Length to pad the cell sentence to.
- `pad_token_idx` (int):
Index of the padding token in the `token_file`.
- `chrom_token_left_idx` (int):
Left chromosome token index
- `chrom_token_right_idx` (int):
Right chromosome token index
- `cls_token_idx` (int):
CLS token index in the `token_file`.
- `CHROM_TOKEN_OFFSET` (int):
Offset index, tokens after this mark are chromosome identifiers.
- `sample_size` (int):
Number of genes sampled for cell sentence.
- `multi_gpu` (bool):
Run evaluation on multiple GPUs (using accelerator)
Returns:
-------
- `dir/{dataset_name}_proc.h5ad`:
The processed AnnData. Processing involves subsetting it to genes which
have protein embeddings and then refiltering the dataset by minimum counts.
- `dir/{dataset_name}_chroms.pkl`:
File mapping the genes in the dataset to their corresponding chromosome
indices.
- `dir/{dataset_name}_counts.npz`:
File containing the counts of the AnnData in an easily accessible format.
- `dir/{dataset_name}_shapes_dict.pkl`:
File containing the shape (ncell x ngene) of the AnnData, used to read the
`.npz` file.
- `dir/{dataset_name}_pe_idx.torch`:
File mapping between the genes in the dataset and their index in the tokens file.
- `dir/{dataset_name}_starts.pkl`:
File mapping between the genes in the dataset and their genomic start locations.
"""
import argparse
from evaluate import AnndataProcessor
from accelerate import Accelerator
def main(args, accelerator):
processor = AnndataProcessor(args, accelerator)
processor.preprocess_anndata()
processor.generate_idxs()
processor.run_evaluation()
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description='Embed a single anndata using UCE.')
# Anndata Processing Arguments
parser.add_argument('--adata_path', type=str,
default=None,
help='Full path to the anndata you want to embed.')
parser.add_argument('--dir', type=str,
default="./",
help='Working folder where all files will be saved.')
parser.add_argument('--species', type=str, default="human",
help='Species of the anndata.')
parser.add_argument('--filter', type=bool, default=True,
help='Additional gene/cell filtering on the anndata.')
parser.add_argument('--skip', type=bool, default=True,
help='Skip datasets that appear to have already been created.')
# Model Arguments
parser.add_argument('--model_loc', type=str,
default=None,
help='Location of the model.')
parser.add_argument('--batch_size', type=int, default=25,
help='Batch size.')
parser.add_argument('--pad_length', type=int, default=1536,
help='Batch size.')
parser.add_argument("--pad_token_idx", type=int, default=0,
help="PAD token index")
parser.add_argument("--chrom_token_left_idx", type=int, default=1,
help="Chrom token left index")
parser.add_argument("--chrom_token_right_idx", type=int, default=2,
help="Chrom token right index")
parser.add_argument("--cls_token_idx", type=int, default=3,
help="CLS token index")
parser.add_argument("--CHROM_TOKEN_OFFSET", type=int, default=143574,
help="Offset index, tokens after this mark are chromosome identifiers")
parser.add_argument('--sample_size', type=int, default=1024,
help='Number of genes sampled for cell sentence')
parser.add_argument('--CXG', type=bool, default=True,
help='Use CXG model.')
parser.add_argument('--nlayers', type=int, default=4,
help='Number of transformer layers.')
parser.add_argument('--output_dim', type=int, default=1280,
help='Output dimension.')
parser.add_argument('--d_hid', type=int, default=5120,
help='Hidden dimension.')
parser.add_argument('--token_dim', type=int, default=5120,
help='Token dimension.')
parser.add_argument('--multi_gpu', type=bool, default=False,
help='Use multiple GPUs')
# Misc Arguments
parser.add_argument("--spec_chrom_csv_path",
default="./model_files/species_chrom.csv", type=str,
help="CSV Path for species genes to chromosomes and start locations.")
parser.add_argument("--token_file",
default="./model_files/all_tokens.torch", type=str,
help="Path for token embeddings.")
parser.add_argument("--protein_embeddings_dir",
default="./model_files/protein_embeddings/", type=str,
help="Directory where protein embedding .pt files are stored.")
parser.add_argument("--offset_pkl_path",
default="./model_files/species_offsets.pkl", type=str,
help="PKL file which contains offsets for each species.")
args = parser.parse_args()
accelerator = Accelerator(project_dir=args.dir)
main(args, accelerator)
================================================
FILE: evaluate.py
================================================
import os
# os.environ["NCCL_DEBUG"] = "INFO"
os.environ["OMP_NUM_THREADS"] = "12" # export OMP_NUM_THREADS=4
os.environ["OPENBLAS_NUM_THREADS"] = "12" # export OPENBLAS_NUM_THREADS=4
os.environ["MKL_NUM_THREADS"] = "12" # export MKL_NUM_THREADS=6
os.environ["VECLIB_MAXIMUM_THREADS"] = "12" # export VECLIB_MAXIMUM_THREADS=4
os.environ["NUMEXPR_NUM_THREADS"] = "12"
import warnings
warnings.filterwarnings("ignore")
import scanpy as sc
from tqdm.auto import tqdm
from torch import nn, Tensor
from model import TransformerModel
from eval_data import MultiDatasetSentences, MultiDatasetSentenceCollator
from utils import figshare_download
from torch.utils.data import DataLoader
from data_proc.data_utils import adata_path_to_prot_chrom_starts, \
get_spec_chrom_csv, process_raw_anndata, get_species_to_pe
import os
import pickle
import pandas as pd
import numpy as np
import torch
class AnndataProcessor:
def __init__(self, args, accelerator):
self.args = args
self.accelerator = accelerator
self.h5_folder_path = self.args.dir
self.npz_folder_path = self.args.dir
self.scp = ""
# Check if paths exist, if not, create them
self.check_paths()
# Set up the anndata
self.adata_name = self.args.adata_path.split("/")[-1]
self.adata_root_path = self.args.adata_path.replace(self.adata_name, "")
self.name = self.adata_name.replace(".h5ad", "")
self.proc_h5_path = self.h5_folder_path + f"{self.name}_proc.h5ad"
self.adata = None
# Set up the row
row = pd.Series()
row.path = self.adata_name
row.covar_col = np.nan
row.species = self.args.species
self.row = row
# Set paths once to be used throughout the class
self.pe_idx_path = self.args.dir + f"{self.name}_pe_idx.torch"
self.chroms_path = self.args.dir + f"{self.name}_chroms.pkl"
self.starts_path = self.args.dir + f"{self.name}_starts.pkl"
self.shapes_dict_path = self.args.dir + f"{self.name}_shapes_dict.pkl"
def check_paths(self):
"""
Check if the paths exist, if not, create them
"""
figshare_download("https://figshare.com/ndownloader/files/42706558",
self.args.spec_chrom_csv_path)
figshare_download("https://figshare.com/ndownloader/files/42706555",
self.args.offset_pkl_path)
if not os.path.exists(self.args.protein_embeddings_dir):
figshare_download("https://figshare.com/ndownloader/files/42715213",
'model_files/protein_embeddings.tar.gz')
figshare_download("https://figshare.com/ndownloader/files/42706585",
self.args.token_file)
if self.args.adata_path is None:
print("Using sample AnnData: 10k pbmcs dataset")
self.args.adata_path = "./data/10k_pbmcs_proc.h5ad"
figshare_download(
"https://figshare.com/ndownloader/files/42706966",
self.args.adata_path)
if self.args.model_loc is None:
print("Using sample 4 layer model")
self.args.model_loc = "./model_files/4layer_model.torch"
figshare_download(
"https://figshare.com/ndownloader/files/42706576",
self.args.model_loc)
def preprocess_anndata(self):
if self.accelerator.is_main_process:
self.adata, num_cells, num_genes = \
process_raw_anndata(self.row,
self.h5_folder_path,
self.npz_folder_path,
self.scp,
self.args.skip,
self.args.filter,
root=self.adata_root_path)
if (num_cells is not None) and (num_genes is not None):
self.save_shapes_dict(self.name, num_cells, num_genes,
self.shapes_dict_path)
if self.adata is None:
self.adata = sc.read(self.proc_h5_path)
def save_shapes_dict(self, name, num_cells, num_genes, shapes_dict_path):
shapes_dict = {name: (num_cells, num_genes)}
with open(shapes_dict_path, "wb+") as f:
pickle.dump(shapes_dict, f)
print("Wrote Shapes Dict")
def generate_idxs(self):
if self.accelerator.is_main_process:
if os.path.exists(self.pe_idx_path) and \
os.path.exists(self.chroms_path) and \
os.path.exists(self.starts_path):
print("PE Idx, Chrom and Starts files already created")
else:
species_to_pe = get_species_to_pe(self.args.protein_embeddings_dir)
with open(self.args.offset_pkl_path, "rb") as f:
species_to_offsets = pickle.load(f)
gene_to_chrom_pos = get_spec_chrom_csv(
self.args.spec_chrom_csv_path)
dataset_species = self.args.species
spec_pe_genes = list(species_to_pe[dataset_species].keys())
offset = species_to_offsets[dataset_species]
pe_row_idxs, dataset_chroms, dataset_pos = adata_path_to_prot_chrom_starts(
self.adata, dataset_species, spec_pe_genes, gene_to_chrom_pos, offset)
# Save to the temp dict
torch.save({self.name: pe_row_idxs}, self.pe_idx_path)
with open(self.chroms_path, "wb+") as f:
pickle.dump({self.name: dataset_chroms}, f)
with open(self.starts_path, "wb+") as f:
pickle.dump({self.name: dataset_pos}, f)
def run_evaluation(self):
self.accelerator.wait_for_everyone()
with open(self.shapes_dict_path, "rb") as f:
shapes_dict = pickle.load(f)
run_eval(self.adata, self.name, self.pe_idx_path, self.chroms_path,
self.starts_path, shapes_dict, self.accelerator, self.args)
def get_ESM2_embeddings(args):
# Load in ESM2 embeddings and special tokens
all_pe = torch.load(args.token_file)
if all_pe.shape[0] == 143574:
torch.manual_seed(23)
CHROM_TENSORS = torch.normal(mean=0, std=1, size=(1895, args.token_dim))
# 1895 is the total number of chromosome choices, it is hardcoded for now
all_pe = torch.vstack(
(all_pe, CHROM_TENSORS)) # Add the chrom tensors to the end
all_pe.requires_grad = False
return all_pe
def padding_tensor(sequences):
"""
:param sequences: list of tensors
:return:
"""
num = len(sequences)
max_len = max([s.size(0) for s in sequences])
out_dims = (num, max_len, 1280)
out_tensor = sequences[0].data.new(*out_dims).fill_(0)
out_dims2 = (num, max_len)
mask = sequences[0].data.new(*out_dims2).fill_(float('-inf'))
for i, tensor in enumerate(sequences):
length = tensor.size(0)
out_tensor[i, :length] = tensor
mask[i, :length] = 1
return out_tensor.permute(1, 0, 2), mask
def run_eval(adata, name, pe_idx_path, chroms_path, starts_path, shapes_dict,
accelerator, args):
#### Set up the model ####
token_dim = args.token_dim
emsize = 1280 # embedding dimension
d_hid = args.d_hid # dimension of the feedforward network model in nn.TransformerEncoder
nlayers = args.nlayers # number of nn.TransformerEncoderLayer in nn.TransformerEncoder
nhead = 20 # number of heads in nn.MultiheadAttention
dropout = 0.05 # dropout probability
model = TransformerModel(token_dim=token_dim, d_model=emsize, nhead=nhead,
d_hid=d_hid,
nlayers=nlayers, dropout=dropout,
output_dim=args.output_dim)
if args.model_loc is None:
raise ValueError("Must provide a model location")
# intialize as empty
empty_pe = torch.zeros(145469, 5120)
empty_pe.requires_grad = False
model.pe_embedding = nn.Embedding.from_pretrained(empty_pe)
model.load_state_dict(torch.load(args.model_loc, map_location="cpu"),
strict=True)
# Load in the real token embeddings
all_pe = get_ESM2_embeddings(args)
# This will make sure that you don't overwrite the tokens in case you're embedding species from the training data
# We avoid doing that just in case the random seeds are different across different versions.
if all_pe.shape[0] != 145469:
all_pe.requires_grad = False
model.pe_embedding = nn.Embedding.from_pretrained(all_pe)
print(f"Loaded model:\n{args.model_loc}")
model = model.eval()
model = accelerator.prepare(model)
batch_size = args.batch_size
#### Run the model ####
# Dataloaders
dataset = MultiDatasetSentences(sorted_dataset_names=[name],
shapes_dict=shapes_dict,
args=args, npzs_dir=args.dir,
dataset_to_protein_embeddings_path=pe_idx_path,
datasets_to_chroms_path=chroms_path,
datasets_to_starts_path=starts_path
)
multi_dataset_sentence_collator = MultiDatasetSentenceCollator(args)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=False,
collate_fn=multi_dataset_sentence_collator,
num_workers=0)
dataloader = accelerator.prepare(dataloader)
pbar = tqdm(dataloader, disable=not accelerator.is_local_main_process)
dataset_embeds = []
with torch.no_grad():
for batch in pbar:
batch_sentences, mask, idxs = batch[0], batch[1], batch[2]
batch_sentences = batch_sentences.permute(1, 0)
if args.multi_gpu:
batch_sentences = model.module.pe_embedding(batch_sentences.long())
else:
batch_sentences = model.pe_embedding(batch_sentences.long())
batch_sentences = nn.functional.normalize(batch_sentences,
dim=2) # Normalize token outputs now
_, embedding = model.forward(batch_sentences, mask=mask)
# Fix for duplicates in last batch
accelerator.wait_for_everyone()
embeddings = accelerator.gather_for_metrics((embedding))
if accelerator.is_main_process:
dataset_embeds.append(embeddings.detach().cpu().numpy())
accelerator.wait_for_everyone()
if accelerator.is_main_process:
dataset_embeds = np.vstack(dataset_embeds)
adata.obsm["X_uce"] = dataset_embeds
write_path = args.dir + f"{name}_uce_adata.h5ad"
adata.write(write_path)
print("*****Wrote Anndata to:*****")
print(write_path)
================================================
FILE: examples/Benchmark Embeddings with scIB.ipynb
================================================
{
"cells": [
{
"cell_type": "markdown",
"id": "6b258384-9a56-4ed0-be6f-db1c94711356",
"metadata": {},
"source": [
"# Large Scale Embedding benchmarks\n",
"\n",
"This notebook includes an example showing how to run large scale embedding benchmarks using scIB [(single-cell integration benchmark)](https://www.nature.com/articles/s41592-021-01336-8)\n",
"\n",
"We use the GPU accelerated version implemented here: https://github.com/YosefLab/scib-metrics\n",
"\n",
"Please follow installation instructions in that repo. \n",
"\n",
"*Note: installing Faiss can be difficult and may take some time*\n",
"\n",
"*Running the full benchmarking suite on many cells can take many hours, even on GPUs with large amounts of memory, such as A100s, and with many threads*"
]
},
{
"cell_type": "markdown",
"id": "ca4ba3a1-5c85-4c7b-8564-f8c5689e9345",
"metadata": {},
"source": [
"## Load Imports and define Benchmark Function"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "b9d9fd58-915b-492d-9880-48c37e3859a8",
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import scanpy as sc\n",
"\n",
"from scib_metrics.benchmark import Benchmarker\n",
"\n",
"import faiss\n",
"\n",
"from scib_metrics.nearest_neighbors import NeighborsResults\n",
"\n",
"# Faiss GPU accelerate nearest neighbors methods\n",
"def faiss_hnsw_nn(X: np.ndarray, k: int):\n",
" \"\"\"Gpu HNSW nearest neighbor search using faiss.\n",
"\n",
" See https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md\n",
" for index param details.\n",
" \"\"\"\n",
" X = np.ascontiguousarray(X, dtype=np.float32)\n",
" res = faiss.StandardGpuResources()\n",
" M = 32\n",
" index = faiss.IndexHNSWFlat(X.shape[1], M, faiss.METRIC_L2)\n",
" gpu_index = faiss.index_cpu_to_gpu(res, 0, index)\n",
" gpu_index.add(X)\n",
" distances, indices = gpu_index.search(X, k)\n",
" del index\n",
" del gpu_index\n",
" # distances are squared\n",
" return NeighborsResults(indices=indices, distances=np.sqrt(distances))\n",
"\n",
"\n",
"def faiss_brute_force_nn(X: np.ndarray, k: int):\n",
" \"\"\"Gpu brute force nearest neighbor search using faiss.\"\"\"\n",
" X = np.ascontiguousarray(X, dtype=np.float32)\n",
" res = faiss.StandardGpuResources()\n",
" index = faiss.IndexFlatL2(X.shape[1])\n",
" gpu_index = faiss.index_cpu_to_gpu(res, 0, index)\n",
" gpu_index.add(X)\n",
" distances, indices = gpu_index.search(X, k)\n",
" del index\n",
" del gpu_index\n",
" # distances are squared\n",
" return NeighborsResults(indices=indices, distances=np.sqrt(distances))"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "4c5fb90f-ffa5-4cb9-bf6a-6afce956fc86",
"metadata": {},
"outputs": [],
"source": [
"import warnings\n",
"warnings.filterwarnings(\"ignore\")\n",
"from scib_metrics.benchmark import Benchmarker, BioConservation, BatchCorrection\n",
"import pandas as pd\n",
"\n",
"## Benchmarking Function, returns dataframe of scores\n",
"def benchmark(ad, label_key=\"cell_type\", batch_key=\"sample_id\", obsm_keys=[\"X_uce\", \"X_scGPT\", \"X_geneformer\"]):\n",
" print(f\"Running using CT key:\", label_key)\n",
" biocons = BioConservation()\n",
" batchcons = BatchCorrection(pcr_comparison=False)\n",
" \n",
" bm = Benchmarker(\n",
" ad,\n",
" batch_key=batch_key,\n",
" label_key=label_key,\n",
" embedding_obsm_keys=obsm_keys,\n",
" bio_conservation_metrics=biocons,\n",
" batch_correction_metrics=None,\n",
" n_jobs=48,\n",
" )\n",
" bm.prepare(neighbor_computer=faiss_brute_force_nn)\n",
" bm.benchmark()\n",
" df = bm.get_results(min_max_scale=False)\n",
" return df"
]
},
{
"cell_type": "markdown",
"id": "2f3bb257-21d4-41d5-9726-50b5e7af04b2",
"metadata": {},
"source": [
"### Load in anndata\n",
"\n",
"For this example, we will benchmark cells from developing mouse brain.\n",
"\n",
"You can download an anndata object with UCE, scGPT and Geneformer embeddings precalulated from [here](https://drive.google.com/drive/folders/1f63fh0ykgEhCrkd_EVvIootBw7LYDVI7)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "35392e93-6ffd-4df6-9609-f85ea6aad4ae",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"AnnData object with n_obs × n_vars = 597668 × 18285\n",
" obs: 'n_counts', 'n_genes', 'region', 'age', 'experiment', 'species', 'sex', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'leiden', 'cell_type', 'sex_old', 'abca_class', 'abca_subclass', 'abca_supertype', 'abca_cluster', 'abca_region', 'leiden_old', 'region_dissected', 'biosample_id', 'donor_id', 'species__ontology_label', 'disease', 'disease__ontology_label', 'organ', 'organ__ontology_label', 'library_preparation_protocol', 'library_preparation_protocol__ontology_label', 'cell_type_author', 'cell_type__ontology_label', 'supercluster'\n",
" var: 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm', 'highly_variable_nbatches', 'feature_name'\n",
" uns: '10x_batch_colors', '_scvi_manager_uuid', '_scvi_uuid', 'age_colors', 'ages_ordered_colors', 'dendrogram_leiden', 'hvg', 'leiden', 'log1p', 'neighbors', 'pca', 'rank_genes_groups', 'region_colors', 'region_dissected_colors', 'regions_ordered_colors', 'replicate_colors', 'sex_colors', 'umap'\n",
" obsm: 'X_geneformer', 'X_pca', 'X_scGPT', 'X_scVI', 'X_uce', 'X_umap', 'latent_gene_encoding'\n",
" layers: 'counts'\n",
" obsp: 'connectivities', 'distances'"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ad = sc.read(\"developing_mouse_brain.h5ad\", cache=True)\n",
"ad"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "a4cb1a5e-1672-4ba7-b488-036de0e3ff61",
"metadata": {},
"outputs": [],
"source": [
"cell_type_column = \"supercluster\"\n",
"batch_column = \"donor_id\""
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "134f4e09-8e68-43fb-9d12-d87a1b5318c1",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"33"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(ad.obs[cell_type_column].unique()) # Number of unique cell types"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "ac956e69-9a66-4225-adb8-a01a2d6e23bf",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"25"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(ad.obs[batch_column].unique()) # Number of unique batches"
]
},
{
"cell_type": "markdown",
"id": "ee280476-4057-4051-b4f1-eb7ee0055e69",
"metadata": {},
"source": [
"# Running the Benchmark\n",
"\n",
"Running the benchmark on the full dataset can take a very long time. Instead, we can run on medium sized samples of cells."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "0cae96b8-5be1-4ea5-a919-d16d2205d645",
"metadata": {},
"outputs": [],
"source": [
"sample_size = 100_000 # number of cells"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "189ad01d-83c0-40e6-ab13-d16ed7eb0c88",
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "0d430c0038f84d33915a3d9b211d9608",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
" 0%| | 0/10 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Running using CT key: supercluster\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n",
"Computing neighbors: 0%| | 0/3 [00:00<?, ?it/s]\u001b[A\n",
"Computing neighbors: 33%|██████████████████████████████████████████████████████████████████ | 1/3 [00:02<00:04, 2.44s/it]\u001b[A\n",
"Computing neighbors: 67%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 2/3 [00:03<00:01, 1.61s/it]\u001b[A\n",
"Computing neighbors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:04<00:00, 1.50s/it]\u001b[A\n",
"Embeddings: 0%|\u001b[32m \u001b[0m| 0/3 [00:00<?, ?it/s]\u001b[0m\n",
"Metrics: 0%|\u001b[34m \u001b[0m| 0/10 [00:00<?, ?it/s]\u001b[0m\u001b[A\n",
"Metrics: 0%|\u001b[34m \u001b[0m| 0/10 [00:00<?, ?it/s, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
"Metrics: 10%|\u001b[34m█████████████████▍ \u001b[0m| 1/10 [00:57<08:33, 57.09s/it, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
"Metrics: 10%|\u001b[34m████████████████ \u001b[0m| 1/10 [00:57<08:33, 57.09s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
"Metrics: 20%|\u001b[34m████████████████████████████████ \u001b[0m| 2/10 [01:08<04:01, 30.17s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
"Metrics: 20%|\u001b[34m██████████████████████████████████▌ \u001b[0m| 2/10 [01:08<04:01, 30.17s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
"Metrics: 30%|\u001b[34m███████████████████████████████████████████████████▉ \u001b[0m| 3/10 [01:51<04:11, 35.98s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
"Metrics: 30%|\u001b[34m██████████████████████████████████████████████████████ \u001b[0m| 3/10 [01:51<04:11, 35.98s/it, Bio conservation: clisi_knn]\u001b[0m\u001b[A\n",
"Metrics: 40%|\u001b[34m████████████████████████████████████████████████████████████████████████ \u001b[0m| 4/10 [01:52<02:12, 22.15s/it, Bio conservation: clisi_knn]\u001b[0m\u001b[A\n",
"Metrics: 40%|\u001b[34m█████████████████████████████████████████████████████████████████████▏ \u001b[0m| 4/10 [01:52<02:12, 22.15s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
"Metrics: 50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████▌ \u001b[0m| 5/10 [02:17<01:56, 23.23s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
"Metrics: 50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████ \u001b[0m| 5/10 [02:17<01:56, 23.23s/it, Batch correction: ilisi_knn]\u001b[0m\u001b[A\n",
"Metrics: 60%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████ \u001b[0m| 6/10 [02:17<01:01, 15.40s/it, Batch correction: ilisi_knn]\u001b[0m\u001b[A\n",
"Metrics: 60%|\u001b[34m█████████████████████████████████████████████████████████████████████████████████████████████████████████ \u001b[0m| 6/10 [02:17<01:01, 15.40s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
"Metrics: 70%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ \u001b[0m| 7/10 [03:39<01:51, 37.27s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
"Metrics: 70%|\u001b[34m███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ \u001b[0m| 7/10 [03:39<01:51, 37.27s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
"Metrics: 80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ \u001b[0m| 8/10 [03:40<00:50, 25.49s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
"Metrics: 80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ \u001b[0m| 8/10 [03:40<00:50, 25.49s/it, Batch correction: pcr_comparison]\u001b[0m\u001b[A\n",
"Embeddings: 33%|\u001b[32m████████████████████████████████████████████████████████████████████▋ \u001b[0m| 1/3 [03:42<07:25, 222.58s/it]\u001b[0m\u001b[A\n",
"Metrics: 0%|\u001b[34m \u001b[0m| 0/10 [00:00<?, ?it/s]\u001b[0m\u001b[A\n",
" \u001b[A\n",
"Metrics: 0%|\u001b[34m \u001b[0m| 0/10 [00:00<?, ?it/s, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
"Metrics: 10%|\u001b[34m█████████████████▍ \u001b[0m| 1/10 [00:17<02:38, 17.58s/it, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
"Metrics: 10%|\u001b[34m████████████████ \u001b[0m| 1/10 [00:17<02:38, 17.58s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
"Metrics: 20%|\u001b[34m████████████████████████████████ \u001b[0m| 2/10 [00:23<01:28, 11.01s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
"Metrics: 20%|\u001b[34m██████████████████████████████████▌ \u001b[0m| 2/10 [00:24<01:28, 11.01s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
"Metrics: 30%|\u001b[34m███████████████████████████████████████████████████▉ \u001b[0m| 3/10 [00:41<01:37, 13.94s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
"Metrics: 30%|\u001b[34m██████████████████████████████████████████████████████ \u001b[0m| 3/10 [00:41<01:37, 13.94s/it, Bio conservation: clisi_knn]\u001b[0m\u001b[A\n",
"Metrics: 40%|\u001b[34m████████████████████████████████████████████████████████████████████████ \u001b[0m| 4/10 [00:41<00:50, 8.48s/it, Bio conservation: clisi_knn]\u001b[0m\u001b[A\n",
"Metrics: 40%|\u001b[34m█████████████████████████████████████████████████████████████████████▏ \u001b[0m| 4/10 [00:41<00:50, 8.48s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
"Metrics: 50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████▌ \u001b[0m| 5/10 [00:46<00:36, 7.39s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
"Metrics: 50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████ \u001b[0m| 5/10 [00:46<00:36, 7.39s/it, Batch correction: ilisi_knn]\u001b[0m\u001b[A\n",
"Metrics: 60%|\u001b[34m█████████████████████████████████████████████████████████████████████████████████████████████████████████ \u001b[0m| 6/10 [00:47<00:29, 7.39s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
"Metrics: 70%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ \u001b[0m| 7/10 [01:39<00:50, 16.92s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
"Metrics: 70%|\u001b[34m███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ \u001b[0m| 7/10 [01:39<00:50, 16.92s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
"Metrics: 80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ \u001b[0m| 8/10 [01:39<00:24, 12.49s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
"Metrics: 80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ \u001b[0m| 8/10 [01:39<00:24, 12.49s/it, Batch correction: pcr_comparison]\u001b[0m\u001b[A\n",
"Embeddings: 67%|\u001b[32m█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ \u001b[0m| 2/3 [05:23<02:30, 150.75s/it]\u001b[0m\u001b[A\n",
"Metrics: 0%|\u001b[34m \u001b[0m| 0/10 [00:00<?, ?it/s]\u001b[0m\u001b[A\n",
" \u001b[A\n",
"Metrics: 0%|\u001b[34m \u001b[0m| 0/10 [00:00<?, ?it/s, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
"Metrics: 10%|\u001b[34m█████████████████▍ \u001b[0m| 1/10 [00:17<02:38, 17.57s/it, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
"Metrics: 10%|\u001b[34m████████████████ \u001b[0m| 1/10 [00:17<02:38, 17.57s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
"Metrics: 20%|\u001b[34m████████████████████████████████ \u001b[0m| 2/10 [00:23<01:27, 10.95s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
"Metrics: 20%|\u001b[34m██████████████████████████████████▌ \u001b[0m| 2/10 [00:23<01:27, 10.95s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
"Metrics: 30%|\u001b[34m███████████████████████████████████████████████████▉ \u001b[0m| 3/10 [00:41<01:37, 13.95s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
"Metrics: 30%|\u001b[34m██████████████████████████████████████████████████████ \u001b[0m| 3/10 [00:41<01:37, 13.95s/it, Bio conservation: clisi_knn]\u001b[0m\u001b[A\n",
"Metrics: 40%|\u001b[34m█████████████████████████████████████████████████████████████████████▏ \u001b[0m| 4/10 [00:41<01:23, 13.95s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
"Metrics: 50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████▌ \u001b[0m| 5/10 [00:42<00:32, 6.49s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
"Metrics: 50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████ \u001b[0m| 5/10 [00:42<00:32, 6.49s/it, Batch correction: ilisi_knn]\u001b[0m\u001b[A\n",
"Metrics: 60%|\u001b[34m█████████████████████████████████████████████████████████████████████████████████████████████████████████ \u001b[0m| 6/10 [00:43<00:25, 6.49s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
"Metrics: 70%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ \u001b[0m| 7/10 [01:36<00:46, 15.63s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
"Metrics: 70%|\u001b[34m███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ \u001b[0m| 7/10 [01:36<00:46, 15.63s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
"Metrics: 80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ \u001b[0m| 8/10 [01:37<00:23, 11.92s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
"Metrics: 80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ \u001b[0m| 8/10 [01:37<00:23, 11.92s/it, Batch correction: pcr_comparison]\u001b[0m\u001b[A\n",
"Embeddings: 100%|\u001b[32m██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\u001b[0m| 3/3 [07:00<00:00, 140.20s/it]\u001b[0m\u001b[A\n",
"\n",
" \u001b[A"
]
},
{
"data": {
"text/plain": [
"AnnData object with n_obs × n_vars = 100000 × 18285\n",
" obs: 'n_counts', 'n_genes', 'region', 'age', 'experiment', 'species', 'sex', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'leiden', 'cell_type', 'sex_old', 'abca_class', 'abca_subclass', 'abca_supertype', 'abca_cluster', 'abca_region', 'leiden_old', 'region_dissected', 'biosample_id', 'donor_id', 'species__ontology_label', 'disease', 'disease__ontology_label', 'organ', 'organ__ontology_label', 'library_preparation_protocol', 'library_preparation_protocol__ontology_label', 'cell_type_author', 'cell_type__ontology_label', 'supercluster'\n",
" var: 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm', 'highly_variable_nbatches', 'feature_name'\n",
" uns: '10x_batch_colors', '_scvi_manager_uuid', '_scvi_uuid', 'age_colors', 'ages_ordered_colors', 'dendrogram_leiden', 'hvg', 'leiden', 'log1p', 'neighbors', 'pca', 'rank_genes_groups', 'region_colors', 'region_dissected_colors', 'regions_ordered_colors', 'replicate_colors', 'sex_colors', 'umap'\n",
" obsm: 'X_geneformer', 'X_pca', 'X_scGPT', 'X_scVI', 'X_uce', 'X_umap', 'latent_gene_encoding'\n",
" varm: 'PCs'\n",
" layers: 'counts'\n",
" obsp: 'connectivities', 'distances'"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Running using CT key: supercluster\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n",
"Computing neighbors: 0%| | 0/3 [00:00<?, ?it/s]\u001b[A\n",
"Computing neighbors: 33%|██████████████████████████████████████████████████████████████████ | 1/3 [00:01<00:03, 1.97s/it]\u001b[A\n",
"Computing neighbors: 67%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 2/3 [00:03<00:01, 1.43s/it]\u001b[A\n",
"Computing neighbors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:04<00:00, 1.35s/it]\u001b[A\n",
"Embeddings: 0%|\u001b[32m \u001b[0m| 0/3 [00:00<?, ?it/s]\u001b[0m\n",
"Metrics: 0%|\u001b[34m \u001b[0m| 0/10 [00:00<?, ?it/s]\u001b[0m\u001b[A\n",
"Metrics: 0%|\u001b[34m \u001b[0m| 0/10 [00:00<?, ?it/s, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
"Metrics: 10%|\u001b[34m█████████████████▍ \u001b[0m| 1/10 [00:39<05:53, 39.31s/it, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
"Metrics: 10%|\u001b[34m████████████████ \u001b[0m| 1/10 [00:39<05:53, 39.31s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
"Metrics: 20%|\u001b[34m████████████████████████████████ \u001b[0m| 2/10 [00:49<02:58, 22.36s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
"Metrics: 20%|\u001b[34m██████████████████████████████████▌ \u001b[0m| 2/10 [00:49<02:58, 22.36s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
"Metrics: 30%|\u001b[34m███████████████████████████████████████████████████▉ \u001b[0m| 3/10 [01:29<03:31, 30.17s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
"Metrics: 30%|\u001b[34m██████████████████████████████████████████████████████ \u001b[0m| 3/10 [01:29<03:31, 30.17s/it, Bio conservation: clisi_knn]\u001b[0m\u001b[A\n",
"Metrics: 40%|\u001b[34m████████████████████████████████████████████████████████████████████████ \u001b[0m| 4/10 [01:29<01:49, 18.30s/it, Bio conservation: clisi_knn]\u001b[0m\u001b[A\n",
"Metrics: 40%|\u001b[34m█████████████████████████████████████████████████████████████████████▏ \u001b[0m| 4/10 [01:29<01:49, 18.30s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
"Metrics: 50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████▌ \u001b[0m| 5/10 [01:56<01:47, 21.46s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
"Metrics: 50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████ \u001b[0m| 5/10 [01:56<01:47, 21.46s/it, Batch correction: ilisi_knn]\u001b[0m\u001b[A\n",
"Metrics: 60%|\u001b[34m█████████████████████████████████████████████████████████████████████████████████████████████████████████ \u001b[0m| 6/10 [01:56<01:25, 21.46s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
"Metrics: 70%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ \u001b[0m| 7/10 [02:56<01:17, 25.85s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
"Metrics: 70%|\u001b[34m███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ \u001b[0m| 7/10 [02:56<01:17, 25.85s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
"Metrics: 80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ \u001b[0m| 8/10 [02:56<00:38, 19.05s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
"Metrics: 80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ \u001b[0m| 8/10 [02:56<00:38, 19.05s/it, Batch correction: pcr_comparison]\u001b[0m\u001b[A\n",
"Embeddings: 33%|\u001b[32m████████████████████████████████████████████████████████████████████▋ \u001b[0m| 1/3 [02:58<05:56, 178.39s/it]\u001b[0m\u001b[A\n",
"Metrics: 0%|\u001b[34m \u001b[0m| 0/10 [00:00<?, ?it/s]\u001b[0m\u001b[A\n",
" \u001b[A\n",
"Metrics: 0%|\u001b[34m \u001b[0m| 0/10 [00:00<?, ?it/s, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
"Metrics: 10%|\u001b[34m█████████████████▍ \u001b[0m| 1/10 [00:17<02:36, 17.40s/it, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
"Metrics: 10%|\u001b[34m████████████████ \u001b[0m| 1/10 [00:17<02:36, 17.40s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
"Metrics: 20%|\u001b[34m████████████████████████████████ \u001b[0m| 2/10 [00:23<01:26, 10.83s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
"Metrics: 20%|\u001b[34m██████████████████████████████████▌ \u001b[0m| 2/10 [00:23<01:26, 10.83s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
"Metrics: 30%|\u001b[34m███████████████████████████████████████████████████▉ \u001b[0m| 3/10 [00:40<01:36, 13.77s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
"Metrics: 30%|\u001b[34m██████████████████████████████████████████████████████ \u001b[0m| 3/10 [00:40<01:36, 13.77s/it, Bio conservation: clisi_knn]\u001b[0m\u001b[A\n",
"Metrics: 40%|\u001b[34m████████████████████████████████████████████████████████████████████████ \u001b[0m| 4/10 [00:41<00:50, 8.38s/it, Bio conservation: clisi_knn]\u001b[0m\u001b[A\n",
"Metrics: 40%|\u001b[34m█████████████████████████████████████████████████████████████████████▏ \u001b[0m| 4/10 [00:41<00:50, 8.38s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
"Metrics: 50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████▌ \u001b[0m| 5/10 [00:47<00:37, 7.54s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
"Metrics: 50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████ \u001b[0m| 5/10 [00:47<00:37, 7.54s/it, Batch correction: ilisi_knn]\u001b[0m\u001b[A\n",
"Metrics: 60%|\u001b[34m█████████████████████████████████████████████████████████████████████████████████████████████████████████ \u001b[0m| 6/10 [00:47<00:30, 7.54s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
"Metrics: 70%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ \u001b[0m| 7/10 [01:47<00:57, 19.17s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
"Metrics: 70%|\u001b[34m███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ \u001b[0m| 7/10 [01:47<00:57, 19.17s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
"Metrics: 80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ \u001b[0m| 8/10 [01:48<00:28, 14.15s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
"Metrics: 80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ \u001b[0m| 8/10 [01:48<00:28, 14.15s/it, Batch correction: pcr_comparison]\u001b[0m\u001b[A\n",
"Embeddings: 67%|\u001b[32m█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ \u001b[0m| 2/3 [04:47<02:17, 137.45s/it]\u001b[0m\u001b[A\n",
"Metrics: 0%|\u001b[34m \u001b[0m| 0/10 [00:00<?, ?it/s]\u001b[0m\u001b[A\n",
" \u001b[A\n",
"Metrics: 0%|\u001b[34m \u001b[0m| 0/10 [00:00<?, ?it/s, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
"Metrics: 10%|\u001b[34m█████████████████▍ \u001b[0m| 1/10 [00:17<02:37, 17.50s/it, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
"Metrics: 10%|\u001b[34m████████████████ \u001b[0m| 1/10 [00:17<02:37, 17.50s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
"Metrics: 20%|\u001b[34m████████████████████████████████ \u001b[0m| 2/10 [00:24<01:28, 11.07s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
"Metrics: 20%|\u001b[34m██████████████████████████████████▌ \u001b[0m| 2/10 [00:24<01:28, 11.07s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
"Metrics: 30%|\u001b[34m███████████████████████████████████████████████████▉ \u001b[0m| 3/10 [00:41<01:38, 14.04s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
"Metrics: 30%|\u001b[34m██████████████████████████████████████████████████████ \u001b[0m| 3/10 [00:41<01:38, 14.04s/it, Bio conservation: clisi_knn]\u001b[0m\u001b[A\n",
"Metrics: 40%|\u001b[34m████████████████████████████████████████████████████████████████████████ \u001b[0m| 4/10 [00:41<00:51, 8.54s/it, Bio conservation: clisi_knn]\u001b[0m\u001b[A\n",
"Metrics: 40%|\u001b[34m█████████████████████████████████████████████████████████████████████▏ \u001b[0m| 4/10 [00:41<00:51, 8.54s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
"Metrics: 50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████▌ \u001b[0m| 5/10 [00:43<00:30, 6.03s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
"Metrics: 50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████ \u001b[0m| 5/10 [00:43<00:30, 6.03s/it, Batch correction: ilisi_knn]\u001b[0m\u001b[A\n",
"Metrics: 60%|\u001b[34m█████████████████████████████████████████████████████████████████████████████████████████████████████████ \u001b[0m| 6/10 [00:43<00:24, 6.03s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
"Metrics: 70%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ \u001b[0m| 7/10 [01:40<00:52, 17.48s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
"Metrics: 70%|\u001b[34m███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ \u001b[0m| 7/10 [01:40<00:52, 17.48s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
"Metrics: 80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ \u001b[0m| 8/10 [01:40<00:25, 12.89s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
"Metrics: 80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ \u001b[0m| 8/10 [01:40<00:25, 12.89s/it, Batch correction: pcr_comparison]\u001b[0m\u001b[A\n",
"Embeddings: 100%|\u001b[32m██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\u001b[0m| 3/3 [06:28<00:00, 129.48s/it]\u001b[0m\u001b[A\n",
"\n",
" \u001b[A"
]
},
{
"data": {
"text/plain": [
"AnnData object with n_obs × n_vars = 100000 × 18285\n",
" obs: 'n_counts', 'n_genes', 'region', 'age', 'experiment', 'species', 'sex', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'leiden', 'cell_type', 'sex_old', 'abca_class', 'abca_subclass', 'abca_supertype', 'abca_cluster', 'abca_region', 'leiden_old', 'region_dissected', 'biosample_id', 'donor_id', 'species__ontology_label', 'disease', 'disease__ontology_label', 'organ', 'organ__ontology_label', 'library_preparation_protocol', 'library_preparation_protocol__ontology_label', 'cell_type_author', 'cell_type__ontology_label', 'supercluster'\n",
" var: 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm', 'highly_variable_nbatches', 'feature_name'\n",
" uns: '10x_batch_colors', '_scvi_manager_uuid', '_scvi_uuid', 'age_colors', 'ages_ordered_colors', 'dendrogram_leiden', 'hvg', 'leiden', 'log1p', 'neighbors', 'pca', 'rank_genes_groups', 'region_colors', 'region_dissected_colors', 'regions_ordered_colors', 'replicate_colors', 'sex_colors', 'umap'\n",
" obsm: 'X_geneformer', 'X_pca', 'X_scGPT', 'X_scVI', 'X_uce', 'X_umap', 'latent_gene_encoding'\n",
" varm: 'PCs'\n",
" layers: 'counts'\n",
" obsp: 'connectivities', 'distances'"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Running using CT key: supercluster\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n",
"Computing neighbors: 0%| | 0/3 [00:00<?, ?it/s]\u001b[A\n",
"Computing neighbors: 33%|██████████████████████████████████████████████████████████████████ | 1/3 [00:01<00:03, 1.93s/it]\u001b[A\n",
"Computing neighbors: 67%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 2/3 [00:02<00:01, 1.40s/it]\u001b[A\n",
"Computing neighbors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:03<00:00, 1.32s/it]\u001b[A\n",
"Embeddings: 0%|\u001b[32m \u001b[0m| 0/3 [00:00<?, ?it/s]\u001b[0m\n",
"Metrics: 0%|\u001b[34m \u001b[0m| 0/10 [00:00<?, ?it/s]\u001b[0m\u001b[A\n",
"Metrics: 0%|\u001b[34m \u001b[0m| 0/10 [00:00<?, ?it/s, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
"Metrics: 10%|\u001b[34m█████████████████▍ \u001b[0m| 1/10 [00:39<05:53, 39.24s/it, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
"Metrics: 10%|\u001b[34m████████████████ \u001b[0m| 1/10 [00:39<05:53, 39.24s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
"Metrics: 20%|\u001b[34m████████████████████████████████ \u001b[0m| 2/10 [00:50<03:03, 22.88s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
"Metrics: 20%|\u001b[34m██████████████████████████████████▌ \u001b[0m| 2/10 [00:50<03:03, 22.88s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
"Metrics: 30%|\u001b[34m███████████████████████████████████████████████████▉ \u001b[0m| 3/10 [01:30<03:32, 30.42s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
"Metrics: 30%|\u001b[34m██████████████████████████████████████████████████████ \u001b[0m| 3/10 [01:30<03:32, 30.42s/it, Bio conservation: clisi_knn]\u001b[0m\u001b[A\n",
"Metrics: 40%|\u001b[34m████████████████████████████████████████████████████████████████████████ \u001b[0m| 4/10 [01:30<01:50, 18.45s/it, Bio conservation: clisi_knn]\u001b[0m\u001b[A\n",
"Metrics: 40%|\u001b[34m█████████████████████████████████████████████████████████████████████▏ \u001b[0m| 4/10 [01:30<01:50, 18.45s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
"Metrics: 50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████▌ \u001b[0m| 5/10 [01:56<01:46, 21.22s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
"Metrics: 50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████ \u001b[0m| 5/10 [01:56<01:46, 21.22s/it, Batch correction: ilisi_knn]\u001b[0m\u001b[A\n",
"Metrics: 60%|\u001b[34m█████████████████████████████████████████████████████████████████████████████████████████████████████████ \u001b[0m| 6/10 [01:56<01:24, 21.22s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
"Metrics: 70%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ \u001b[0m| 7/10 [02:58<01:18, 26.20s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
"Metrics: 70%|\u001b[34m███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ \u001b[0m| 7/10 [02:58<01:18, 26.20s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
"Metrics: 80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ \u001b[0m| 8/10 [02:58<00:38, 19.30s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
"Metrics: 80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ \u001b[0m| 8/10 [02:58<00:38, 19.30s/it, Batch correction: pcr_comparison]\u001b[0m\u001b[A\n",
"Embeddings: 33%|\u001b[32m████████████████████████████████████████████████████████████████████▋ \u001b[0m| 1/3 [03:00<06:00, 180.00s/it]\u001b[0m\u001b[A\n",
"Metrics: 0%|\u001b[34m \u001b[0m| 0/10 [00:00<?, ?it/s]\u001b[0m\u001b[A\n",
" \u001b[A\n",
"Metrics: 0%|\u001b[34m \u001b[0m| 0/10 [00:00<?, ?it/s, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
"Metrics: 10%|\u001b[34m█████████████████▍ \u001b[0m| 1/10 [00:17<02:36, 17.34s/it, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
"Metrics: 10%|\u001b[34m████████████████ \u001b[0m| 1/10 [00:17<02:36, 17.34s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
"Metrics: 20%|\u001b[34m████████████████████████████████ \u001b[0m| 2/10 [00:23<01:26, 10.82s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
"Metrics: 20%|\u001b[34m██████████████████████████████████▌ \u001b[0m| 2/10 [00:23<01:26, 10.82s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
"Metrics: 30%|\u001b[34m███████████████████████████████████████████████████▉ \u001b[0m| 3/10 [00:41<01:36, 13.83s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
"Metrics: 30%|\u001b[34m██████████████████████████████████████████████████████ \u001b[0m| 3/10 [00:41<01:36, 13.83s/it, Bio conservation: clisi_knn]\u001b[0m\u001b[A\n",
"Metrics: 40%|\u001b[34m█████████████████████████████████████████████████████████████████████▏ \u001b[0m| 4/10 [00:41<01:22, 13.83s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
"Metrics: 50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████▌ \u001b[0m| 5/10 [00:45<00:37, 7.40s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
"Metrics: 50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████ \u001b[0m| 5/10 [00:45<00:37, 7.40s/it, Batch correction: ilisi_knn]\u001b[0m\u001b[A\n",
"Metrics: 60%|\u001b[34m█████████████████████████████████████████████████████████████████████████████████████████████████████████ \u001b[0m| 6/10 [00:46<00:29, 7.40s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
"Metrics: 70%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ \u001b[0m| 7/10 [01:45<00:52, 17.45s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
"Metrics: 70%|\u001b[34m███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ \u001b[0m| 7/10 [01:45<00:52, 17.45s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
"Metrics: 80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ \u001b[0m| 8/10 [01:45<00:26, 13.29s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
"Metrics: 80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ \u001b[0m| 8/10 [01:45<00:26, 13.29s/it, Batch correction: pcr_comparison]\u001b[0m\u001b[A\n",
"Embeddings: 67%|\u001b[32m█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ \u001b[0m| 2/3 [04:46<02:16, 136.79s/it]\u001b[0m\u001b[A\n",
"Metrics: 0%|\u001b[34m \u001b[0m| 0/10 [00:00<?, ?it/s]\u001b[0m\u001b[A\n",
" \u001b[A\n",
"Metrics: 0%|\u001b[34m \u001b[0m| 0/10 [00:00<?, ?it/s, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
"Metrics: 10%|\u001b[34m█████████████████▍ \u001b[0m| 1/10 [00:17<02:38, 17.58s/it, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
"Metrics: 10%|\u001b[34m████████████████ \u001b[0m| 1/10 [00:17<02:38, 17.58s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
"Metrics: 20%|\u001b[34m████████████████████████████████ \u001b[0m| 2/10 [00:23<01:27, 10.88s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
"Metrics: 20%|\u001b[34m██████████████████████████████████▌ \u001b[0m| 2/10 [00:23<01:27, 10.88s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
"Metrics: 30%|\u001b[34m███████████████████████████████████████████████████▉ \u001b[0m| 3/10 [00:41<01:36, 13.84s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
"Metrics: 30%|\u001b[34m██████████████████████████████████████████████████████ \u001b[0m| 3/10 [00:41<01:36, 13.84s/it, Bio conservation: clisi_knn]\u001b[0m\u001b[A\n",
"Metrics: 40%|\u001b[34m█████████████████████████████████████████████████████████████████████▏ \u001b[0m| 4/10 [00:41<01:23, 13.84s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
"Metrics: 50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████▌ \u001b[0m| 5/10 [00:42<00:31, 6.35s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
"Metrics: 50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████ \u001b[0m| 5/10 [00:42<00:31, 6.35s/it, Batch correction: ilisi_knn]\u001b[0m\u001b[A\n",
"Metrics: 60%|\u001b[34m█████████████████████████████████████████████████████████████████████████████████████████████████████████ \u001b[0m| 6/10 [00:42<00:25, 6.35s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
"Metrics: 70%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ \u001b[0m| 7/10 [01:42<00:50, 16.92s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
"Metrics: 70%|\u001b[34m███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ \u001b[0m| 7/10 [01:42<00:50, 16.92s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
"Metrics: 80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ \u001b[0m| 8/10 [01:42<00:25, 12.89s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
"Metrics: 80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ \u001b[0m| 8/10 [01:42<00:25, 12.89s/it, Batch correction: pcr_comparison]\u001b[0m\u001b[A\n",
"Embeddings: 100%|\u001b[32m██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\u001b[0m| 3/3 [06:29<00:00, 129.91s/it]\u001b[0m\u001b[A\n",
"\n",
" \u001b[A"
]
},
{
"data": {
"text/plain": [
"AnnData object with n_obs × n_vars = 100000 × 18285\n",
" obs: 'n_counts', 'n_genes', 'region', 'age', 'experiment', 'species', 'sex', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'leiden', 'cell_type', 'sex_old', 'abca_class', 'abca_subclass', 'abca_supertype', 'abca_cluster', 'abca_region', 'leiden_old', 'region_dissected', 'biosample_id', 'donor_id', 'species__ontology_label', 'disease', 'disease__ontology_label', 'organ', 'organ__ontology_label', 'library_preparation_protocol', 'library_preparation_protocol__ontology_label', 'cell_type_author', 'cell_type__ontology_label', 'supercluster'\n",
" var: 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm', 'highly_variable_nbatches', 'feature_name'\n",
" uns: '10x_batch_colors', '_scvi_manager_uuid', '_scvi_uuid', 'age_colors', 'ages_ordered_colors', 'dendrogram_leiden', 'hvg', 'leiden', 'log1p', 'neighbors', 'pca', 'rank_genes_groups', 'region_colors', 'region_dissected_colors', 'regions_ordered_colors', 'replicate_colors', 'sex_colors', 'umap'\n",
" obsm: 'X_geneformer', 'X_pca', 'X_scGPT', 'X_scVI', 'X_uce', 'X_umap', 'latent_gene_encoding'\n",
" varm: 'PCs'\n",
" layers: 'counts'\n",
" obsp: 'connectivities', 'distances'"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Running using CT key: supercluster\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n",
"Computing neighbors: 0%| | 0/3 [00:00<?, ?it/s]\u001b[A\n",
"Computing neighbors: 33%|██████████████████████████████████████████████████████████████████ | 1/3 [00:01<00:03, 1.97s/it]\u001b[A\n",
"Computing neighbors: 67%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 2/3 [00:02<00:01, 1.42s/it]\u001b[A\n",
"Computing neighbors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:04<00:00, 1.34s/it]\u001b[A\n",
"Embeddings: 0%|\u001b[32m \u001b[0m| 0/3 [00:00<?, ?it/s]\u001b[0m\n",
"Metrics: 0%|\u001b[34m \u001b[0m| 0/10 [00:00<?, ?it/s]\u001b[0m\u001b[A\n",
"Metrics: 0%|\u001b[34m \u001b[0m| 0/10 [00:00<?, ?it/s, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
"Metrics: 10%|\u001b[34m█████████████████▍ \u001b[0m| 1/10 [00:39<05:51, 39.11s/it, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
"Metrics: 10%|\u001b[34m████████████████ \u001b[0m| 1/10 [00:39<05:51, 39.11s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
"Metrics: 20%|\u001b[34m████████████████████████████████ \u001b[0m| 2/10 [00:49<02:58, 22.37s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
"Metrics: 20%|\u001b[34m██████████████████████████████████▌ \u001b[0m| 2/10 [00:49<02:58, 22.37s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
"Metrics: 30%|\u001b[34m███████████████████████████████████████████████████▉ \u001b[0m| 3/10 [01:31<03:38, 31.18s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
"Metrics: 30%|\u001b[34m██████████████████████████████████████████████████████ \u001b[0m| 3/10 [01:31<03:38, 31.18s/it, Bio conservation: clisi_knn]\u001b[0m\u001b[A\n",
"Metrics: 40%|\u001b[34m█████████████████████████████████████████████████████████████████████▏ \u001b[0m| 4/10 [01:31<03:07, 31.18s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
"Metrics: 50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████▌ \u001b[0m| 5/10 [01:58<01:45, 21.06s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
"Metrics: 50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████ \u001b[0m| 5/10 [01:58<01:45, 21.06s/it, Batch correction: ilisi_knn]\u001b[0m\u001b[A\n",
"Metrics: 60%|\u001b[34m█████████████████████████████████████████████████████████████████████████████████████████████████████████ \u001b[0m| 6/10 [01:58<01:24, 21.06s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
"Metrics: 70%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ \u001b[0m| 7/10 [02:55<01:13, 24.48s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
"Metrics: 70%|\u001b[34m███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ \u001b[0m| 7/10 [02:55<01:13, 24.48s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
"Metrics: 80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ \u001b[0m| 8/10 [02:55<00:37, 18.63s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
"Metrics: 80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ \u001b[0m| 8/10 [02:55<00:37, 18.63s/it, Batch correction: pcr_comparison]\u001b[0m\u001b[A\n",
"Embeddings: 33%|\u001b[32m████████████████████████████████████████████████████████████████████▋ \u001b[0m| 1/3 [02:57<05:54, 177.02s/it]\u001b[0m\u001b[A\n",
"Metrics: 0%|\u001b[34m \u001b[0m| 0/10 [00:00<?, ?it/s]\u001b[0m\u001b[A\n",
" \u001b[A\n",
"Metrics: 0%|\u001b[34m \u001b[0m| 0/10 [00:00<?, ?it/s, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
"Metrics: 10%|\u001b[34m█████████████████▍ \u001b[0m| 1/10 [00:17<02:37, 17.45s/it, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
"Metrics: 10%|\u001b[34m████████████████ \u001b[0m| 1/10 [00:17<02:37, 17.45s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
"Metrics: 20%|\u001b[34m████████████████████████████████ \u001b[0m| 2/10 [00:24<01:28, 11.09s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
"Metrics: 20%|\u001b[34m██████████████████████████████████▌ \u001b[0m| 2/10 [00:24<01:28, 11.09s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
"Metrics: 30%|\u001b[34m███████████████████████████████████████████████████▉ \u001b[0m| 3/10 [00:41<01:37, 13.92s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
"Metrics: 30%|\u001b[34m██████████████████████████████████████████████████████ \u001b[0m| 3/10 [00:41<01:37, 13.92s/it, Bio conservation: clisi_knn]\u001b[0m\u001b[A\n",
"Metrics: 40%|\u001b[34m█████████████████████████████████████████████████████████████████████▏ \u001b[0m| 4/10 [00:41<01:23, 13.92s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
"Metrics: 50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████▌ \u001b[0m| 5/10 [00:45<00:35, 7.16s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
"Metrics: 50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████ \u001b[0m| 5/10 [00:45<00:35, 7.16s/it, Batch correction: ilisi_knn]\u001b[0m\u001b[A\n",
"Metrics: 60%|\u001b[34m█████████████████████████████████████████████████████████████████████████████████████████████████████████ \u001b[0m| 6/10 [00:45<00:28, 7.16s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
"Metrics: 70%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ \u001b[0m| 7/10 [01:39<00:47, 15.99s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
"Metrics: 70%|\u001b[34m███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ \u001b[0m| 7/10 [01:39<00:47, 15.99s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
"Metrics: 80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ \u001b[0m| 8/10 [01:39<00:24, 12.20s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
"Metrics: 80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ \u001b[0m| 8/10 [01:39<00:24, 12.20s/it, Batch correction: pcr_comparison]\u001b[0m\u001b[A\n",
"Embeddings: 67%|\u001b[32m█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ \u001b[0m| 2/3 [04:36<02:11, 131.67s/it]\u001b[0m\u001b[A\n",
"Metrics: 0%|\u001b[34m \u001b[0m| 0/10 [00:00<?, ?it/s]\u001b[0m\u001b[A\n",
" \u001b[A\n",
"Metrics: 0%|\u001b[34m \u001b[0m| 0/10 [00:00<?, ?it/s, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
"Metrics: 10%|\u001b[34m█████████████████▍ \u001b[0m| 1/10 [00:17<02:39, 17.76s/it, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
"Metrics: 10%|\u001b[34m████████████████ \u001b[0m| 1/10 [00:17<02:39, 17.76s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
"Metrics: 20%|\u001b[34m████████████████████████████████ \u001b[0m| 2/10 [00:24<01:29, 11.22s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
"Metrics: 20%|\u001b[34m██████████████████████████████████▌ \u001b[0m| 2/10 [00:24<01:29, 11.22s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
"Metrics: 30%|\u001b[34m███████████████████████████████████████████████████▉ \u001b[0m| 3/10 [00:41<01:38, 14.11s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
"Metrics: 30%|\u001b[34m██████████████████████████████████████████████████████ \u001b[0m| 3/10 [00:41<01:38, 14.11s/it, Bio conservation: clisi_knn]\u001b[0m\u001b[A\n",
"Metrics: 40%|\u001b[34m████████████████████████████████████████████████████████████████████████ \u001b[0m| 4/10 [00:42<00:51, 8.58s/it, Bio conservation: clisi_knn]\u001b[0m\u001b[A\n",
"Metrics: 40%|\u001b[34m█████████████████████████████████████████████████████████████████████▏ \u001b[0m| 4/10 [00:42<00:51, 8.58s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
"Metrics: 50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████▌ \u001b[0m| 5/10 [00:43<00:29, 5.98s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
"Metrics: 50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████ \u001b[0m| 5/10 [00:43<00:29, 5.98s/it, Batch correction: ilisi_knn]\u001b[0m\u001b[A\n",
"Metrics: 60%|\u001b[34m█████████████████████████████████████████████████████████████████████████████████████████████████████████ \u001b[0m| 6/10 [00:43<00:23, 5.98s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
"Metrics: 70%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ \u001b[0m| 7/10 [01:37<00:50, 16.78s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
"Metrics: 70%|\u001b[34m███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ \u001b[0m| 7/10 [01:37<00:50, 16.78s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
"Metrics: 80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ \u001b[0m| 8/10 [01:38<00:24, 12.40s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
"Metrics: 80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ \u001b[0m| 8/10 [01:38<00:24, 12.40s/it, Batch correction: pcr_comparison]\u001b[0m\u001b[A\n",
"Embeddings: 100%|\u001b[32m██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\u001b[0m| 3/3 [06:15<00:00, 125.25s/it]\u001b[0m\u001b[A\n",
"\n",
" \u001b[A"
]
},
{
"data": {
"text/plain": [
"AnnData object with n_obs × n_vars = 100000 × 18285\n",
" obs: 'n_counts', 'n_genes', 'region', 'age', 'experiment', 'species', 'sex', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'leiden', 'cell_type', 'sex_old', 'abca_class', 'abca_subclass', 'abca_supertype', 'abca_cluster', 'abca_region', 'leiden_old', 'region_dissected', 'biosample_id', 'donor_id', 'species__ontology_label', 'disease', 'disease__ontology_label', 'organ', 'organ__ontology_label', 'library_preparation_protocol', 'library_preparation_protocol__ontology_label', 'cell_type_author', 'cell_type__ontology_label', 'supercluster'\n",
" var: 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm', 'highly_variable_nbatches', 'feature_name'\n",
" uns: '10x_batch_colors', '_scvi_manager_uuid', '_scvi_uuid', 'age_colors', 'ages_ordered_colors', 'dendrogram_leiden', 'hvg', 'leiden', 'log1p', 'neighbors', 'pca', 'rank_genes_groups', 'region_colors', 'region_dissected_colors', 'regions_ordered_colors', 'replicate_colors', 'sex_colors', 'umap'\n",
" obsm: 'X_geneformer', 'X_pca', 'X_scGPT', 'X_scVI', 'X_uce', 'X_umap', 'latent_gene_encoding'\n",
" varm: 'PCs'\n",
" layers: 'counts'\n",
" obsp: 'connectivities', 'distances'"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Running using CT key: supercluster\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n",
"Computing neighbors: 0%| | 0/3 [00:00<?, ?it/s]\u001b[A\n",
"Computing neighbors: 33%|██████████████████████████████████████████████████████████████████ | 1/3 [00:02<00:04, 2.06s/it]\u001b[A\n",
"Computing neighbors: 67%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 2/3 [00:03<00:01, 1.54s/it]\u001b[A\n",
"Computing neighbors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:04<00:00, 1.48s/it]\u001b[A\n",
"Embeddings: 0%|\u001b[32m \u001b[0m| 0/3 [00:00<?, ?it/s]\u001b[0m\n",
"Metrics: 0%|\u001b[34m \u001b[0m| 0/10 [00:00<?, ?it/s]\u001b[0m\u001b[A\n",
"Metrics: 0%|\u001b[34m \u001b[0m| 0/10 [00:00<?, ?it/s, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
"Metrics: 10%|\u001b[34m█████████████████▍ \u001b[0m| 1/10 [00:40<06:01, 40.19s/it, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
"Metrics: 10%|\u001b[34m████████████████ \u001b[0m| 1/10 [00:40<06:01, 40.19s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
"Metrics: 20%|\u001b[34m████████████████████████████████ \u001b[0m| 2/10 [00:50<03:02, 22.82s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
"Metrics: 20%|\u001b[34m██████████████████████████████████▌ \u001b[0m| 2/10 [00:50<03:02, 22.82s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
"Metrics: 30%|\u001b[34m███████████████████████████████████████████████████▉ \u001b[0m| 3/10 [01:30<03:33, 30.47s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
"Metrics: 30%|\u001b[34m██████████████████████████████████████████████████████ \u001b[0m| 3/10 [01:30<03:33, 30.47s/it, Bio conservation: clisi_knn]\u001b[0m\u001b[A\n",
"Metrics: 40%|\u001b[34m█████████████████████████████████████████████████████████████████████▏ \u001b[0m| 4/10 [01:30<03:02, 30.47s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
"Metrics: 50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████▌ \u001b[0m| 5/10 [01:55<01:41, 20.24s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
"Metrics: 50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████ \u001b[0m| 5/10 [01:55<01:41, 20.24s/it, Batch correction: ilisi_knn]\u001b[0m\u001b[A\n",
"Metrics: 60%|\u001b[34m█████████████████████████████████████████████████████████████████████████████████████████████████████████ \u001b[0m| 6/10 [01:55<01:20, 20.24s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
"Metrics: 70%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ \u001b[0m| 7/10 [02:54<01:13, 24.39s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
"Metrics: 70%|\u001b[34m███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ \u001b[0m| 7/10 [02:54<01:13, 24.39s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
"Metrics: 80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ \u001b[0m| 8/10 [02:54<00:37, 18.54s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
"Metrics: 80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ \u001b[0m| 8/10 [02:54<00:37, 18.54s/it, Batch correction: pcr_comparison]\u001b[0m\u001b[A\n",
"Embeddings: 33%|\u001b[32m████████████████████████████████████████████████████████████████████▋ \u001b[0m| 1/3 [02:55<05:51, 175.81s/it]\u001b[0m\u001b[A\n",
"Metrics: 0%|\u001b[34m \u001b[0m| 0/10 [00:00<?, ?it/s]\u001b[0m\u001b[A\n",
" \u001b[A\n",
"Metrics: 0%|\u001b[34m \u001b[0m| 0/10 [00:00<?, ?it/s, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
"Metrics: 10%|\u001b[34m█████████████████▍ \u001b[0m| 1/10 [00:17<02:37, 17.52s/it, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
"Metrics: 10%|\u001b[34m████████████████ \u001b[0m| 1/10 [00:17<02:37, 17.52s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
"Metrics: 20%|\u001b[34m████████████████████████████████ \u001b[0m| 2/10 [00:23<01:27, 10.89s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
"Metrics: 20%|\u001b[34m██████████████████████████████████▌ \u001b[0m| 2/10 [00:23<01:27, 10.89s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
"Metrics: 30%|\u001b[34m███████████████████████████████████████████████████▉ \u001b[0m| 3/10 [00:41<01:37, 13.93s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
"Metrics: 30%|\u001b[34m██████████████████████████████████████████████████████ \u001b[0m| 3/10 [00:41<01:37, 13.93s/it, Bio conservation: clisi_knn]\u001b[0m\u001b[A\n",
"Metrics: 40%|\u001b[34m█████████████████████████████████████████████████████████████████████▏ \u001b[0m| 4/10 [00:41<01:23, 13.93s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
"Metrics: 50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████▌ \u001b[0m| 5/10 [00:46<00:36, 7.38s/it, Batch correction: silhouette_batch]\u001b[0m\u001b[A\n",
"Metrics: 50%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████ \u001b[0m| 5/10 [00:46<00:36, 7.38s/it, Batch correction: ilisi_knn]\u001b[0m\u001b[A\n",
"Metrics: 60%|\u001b[34m█████████████████████████████████████████████████████████████████████████████████████████████████████████ \u001b[0m| 6/10 [00:46<00:29, 7.38s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
"Metrics: 70%|\u001b[34m██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ \u001b[0m| 7/10 [01:42<00:50, 16.73s/it, Batch correction: kbet_per_label]\u001b[0m\u001b[A\n",
"Metrics: 70%|\u001b[34m███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ \u001b[0m| 7/10 [01:42<00:50, 16.73s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
"Metrics: 80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ \u001b[0m| 8/10 [01:42<00:25, 12.75s/it, Batch correction: graph_connectivity]\u001b[0m\u001b[A\n",
"Metrics: 80%|\u001b[34m████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ \u001b[0m| 8/10 [01:42<00:25, 12.75s/it, Batch correction: pcr_comparison]\u001b[0m\u001b[A\n",
"Embeddings: 67%|\u001b[32m█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ \u001b[0m| 2/3 [04:39<02:13, 133.22s/it]\u001b[0m\u001b[A\n",
"Metrics: 0%|\u001b[34m \u001b[0m| 0/10 [00:00<?, ?it/s]\u001b[0m\u001b[A\n",
" \u001b[A\n",
"Metrics: 0%|\u001b[34m \u001b[0m| 0/10 [00:00<?, ?it/s, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
"Metrics: 10%|\u001b[34m█████████████████▍ \u001b[0m| 1/10 [00:17<02:38, 17.56s/it, Bio conservation: isolated_labels]\u001b[0m\u001b[A\n",
"Metrics: 10%|\u001b[34m████████████████ \u001b[0m| 1/10 [00:17<02:38, 17.56s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
"Metrics: 20%|\u001b[34m████████████████████████████████ \u001b[0m| 2/10 [00:24<01:31, 11.40s/it, Bio conservation: nmi_ari_cluster_labels_kmeans]\u001b[0m\u001b[A\n",
"Metrics: 20%|\u001b[34m██████████████████████████████████▌ \u001b[0m| 2/10 [00:24<01:31, 11.40s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
"Metrics: 30%|\u001b[34m███████████████████████████████████████████████████▉ \u001b[0m| 3/10 [00:42<01:39, 14.16s/it, Bio conservation: silhouette_label]\u001b[0m\u001b[A\n",
"Metrics: 30%|\u001b[34m██████████████████████████████████████████████████████ \u001b[0m| 3/10 [00:42<01:39, 14.16s/it, Bio conservation: clisi_knn]\u001b[0m\u001b[A\n",
"Metrics: 40%|\u001b[3
gitextract_vi_txbci/ ├── LICENSE ├── README.md ├── data_proc/ │ ├── Create New Species Files.ipynb │ ├── data_utils.py │ ├── download_proc_czi_cxg.py │ ├── gene_embeddings.py │ ├── generate_reduced_chrom_files.py │ └── preproc_many_dataset.py ├── eval_data.py ├── eval_single_anndata.py ├── evaluate.py ├── examples/ │ ├── Benchmark Embeddings with scIB.ipynb │ └── Label Transfer Using Logistic Classifier.ipynb ├── model.py ├── model_files/ │ └── new_species_protein_embeddings.csv ├── requirements.txt └── utils.py
SYMBOL INDEX (56 symbols across 10 files)
FILE: data_proc/data_utils.py
function data_to_torch_X (line 35) | def data_to_torch_X(X):
class SincleCellDataset (line 42) | class SincleCellDataset(data.Dataset):
method __init__ (line 43) | def __init__(self,
method __getitem__ (line 80) | def __getitem__(self, idx):
method __len__ (line 99) | def __len__(self) -> int:
method get_dim (line 102) | def get_dim(self) -> Dict[str, int]:
function data_to_torch_X (line 106) | def data_to_torch_X(X):
function anndata_to_sc_dataset (line 114) | def anndata_to_sc_dataset(adata:sc.AnnData,
function adata_path_to_prot_chrom_starts (line 155) | def adata_path_to_prot_chrom_starts(adata, dataset_species, spec_pe_gene...
function process_raw_anndata (line 173) | def process_raw_anndata(row, h5_folder_path, npz_folder_path, scp, skip,
function get_species_to_pe (line 241) | def get_species_to_pe(EMBEDDING_DIR):
function get_spec_chrom_csv (line 271) | def get_spec_chrom_csv(path="/dfs/project/cross-species/yanay/code/all_t...
FILE: data_proc/download_proc_czi_cxg.py
function data_to_torch_X (line 29) | def data_to_torch_X(X):
function istarmap (line 47) | def istarmap(self, func, iterable, chunksize=1):
function process_row (line 90) | def process_row(row, num_genes, num_cells, paths, all_species, covar_col...
FILE: data_proc/gene_embeddings.py
function load_gene_embeddings_adata (line 30) | def load_gene_embeddings_adata(adata: AnnData, species: list, embedding_...
FILE: data_proc/generate_reduced_chrom_files.py
function padding_tensor (line 53) | def padding_tensor(sequences):
FILE: data_proc/preproc_many_dataset.py
function data_to_torch_X (line 31) | def data_to_torch_X(X):
class SincleCellDataset (line 38) | class SincleCellDataset(data.Dataset):
method __init__ (line 39) | def __init__(self,
method __getitem__ (line 76) | def __getitem__(self, idx):
method __len__ (line 95) | def __len__(self) -> int:
method get_dim (line 98) | def get_dim(self) -> Dict[str, int]:
function data_to_torch_X (line 102) | def data_to_torch_X(X):
function anndata_to_sc_dataset (line 110) | def anndata_to_sc_dataset(adata:sc.AnnData,
function proc (line 151) | def proc(args):
FILE: eval_data.py
class MultiDatasetSentences (line 17) | class MultiDatasetSentences(data.Dataset):
method __init__ (line 18) | def __init__(self, sorted_dataset_names, shapes_dict, args,
method __getitem__ (line 50) | def __getitem__(self, idx):
method __len__ (line 73) | def __len__(self) -> int:
method get_dim (line 76) | def get_dim(self) -> Dict[str, int]:
class MultiDatasetSentenceCollator (line 80) | class MultiDatasetSentenceCollator(object):
method __init__ (line 81) | def __init__(self, args):
method __call__ (line 85) | def __call__(self, batch):
function sample_cell_sentences (line 108) | def sample_cell_sentences(counts, batch_weights, dataset, args,
FILE: eval_single_anndata.py
function main (line 81) | def main(args, accelerator):
FILE: evaluate.py
class AnndataProcessor (line 33) | class AnndataProcessor:
method __init__ (line 34) | def __init__(self, args, accelerator):
method check_paths (line 64) | def check_paths(self):
method preprocess_anndata (line 91) | def preprocess_anndata(self):
method save_shapes_dict (line 108) | def save_shapes_dict(self, name, num_cells, num_genes, shapes_dict_path):
method generate_idxs (line 114) | def generate_idxs(self):
method run_evaluation (line 141) | def run_evaluation(self):
function get_ESM2_embeddings (line 149) | def get_ESM2_embeddings(args):
function padding_tensor (line 163) | def padding_tensor(sequences):
function run_eval (line 183) | def run_eval(adata, name, pe_idx_path, chroms_path, starts_path, shapes_...
FILE: model.py
function full_block (line 18) | def full_block(in_features, out_features, p_drop=0.1):
class PositionalEncoding (line 27) | class PositionalEncoding(nn.Module):
method __init__ (line 29) | def __init__(self, d_model: int, dropout: float = 0.1, max_len: int = ...
method forward (line 41) | def forward(self, x: Tensor) -> Tensor:
class TransformerModel (line 50) | class TransformerModel(nn.Module):
method __init__ (line 52) | def __init__(self, token_dim: int, d_model: int, nhead: int, d_hid: int,
method forward (line 92) | def forward(self, src: Tensor, mask: Tensor):
method predict (line 110) | def predict(self, cell_embedding, gene_embeddings):
FILE: utils.py
function get_shapes_dict (line 16) | def get_shapes_dict(dataset_path):
function figshare_download (line 72) | def figshare_download(url, save_path):
Condensed preview — 17 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (482K chars).
[
{
"path": "LICENSE",
"chars": 1098,
"preview": "MIT License\n\nCopyright (c) 2023 Yanay Rosen, Yusuf Roohani, Jure Leskovec\n\nPermission is hereby granted, free of charge,"
},
{
"path": "README.md",
"chars": 3120,
"preview": "# Universal Cell Embeddings\n\nThis repo includes a PyTorch [HuggingFace Accelerator](https://huggingface.co/docs/accelera"
},
{
"path": "data_proc/Create New Species Files.ipynb",
"chars": 21966,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"id\": \"0e4018ee\",\n \"metadata\": {},\n \"source\": [\n \"# Embedding No"
},
{
"path": "data_proc/data_utils.py",
"chars": 10027,
"preview": "import warnings\nwarnings.filterwarnings(\"ignore\")\n\nimport scanpy as sc\nimport torch\n\nfrom torch import nn, Tensor\nimport"
},
{
"path": "data_proc/download_proc_czi_cxg.py",
"chars": 11935,
"preview": "import os\nos.environ[\"OMP_NUM_THREADS\"] = \"20\" # export OMP_NUM_THREADS=4\nos.environ[\"OPENBLAS_NUM_THREADS\"] = \"20\" # ex"
},
{
"path": "data_proc/gene_embeddings.py",
"chars": 3803,
"preview": "\"\"\"Helper functions for loading pretrained gene embeddings.\"\"\"\nfrom pathlib import Path\nfrom typing import Dict, Tuple\n\n"
},
{
"path": "data_proc/generate_reduced_chrom_files.py",
"chars": 10521,
"preview": "import os\nos.environ[\"OMP_NUM_THREADS\"] = \"4\" # export OMP_NUM_THREADS=4\nos.environ[\"OPENBLAS_NUM_THREADS\"] = \"4\" # expo"
},
{
"path": "data_proc/preproc_many_dataset.py",
"chars": 7689,
"preview": "import os\nos.environ[\"OMP_NUM_THREADS\"] = \"10\" # export OMP_NUM_THREADS=4\nos.environ[\"OPENBLAS_NUM_THREADS\"] = \"10\" # ex"
},
{
"path": "eval_data.py",
"chars": 7337,
"preview": "\"\"\"\nDataloaders\n\n\"\"\"\n\nimport warnings\nwarnings.filterwarnings(\"ignore\")\nimport sys\nsys.path.append('../')\nfrom typing im"
},
{
"path": "eval_single_anndata.py",
"chars": 6597,
"preview": "\"\"\"\nScript for Evaluating a Single AnnData\n\nParameters:\n----------\n- `adata_path` (str):\n Full path to the AnnData yo"
},
{
"path": "evaluate.py",
"chars": 11046,
"preview": "import os\n\n# os.environ[\"NCCL_DEBUG\"] = \"INFO\"\nos.environ[\"OMP_NUM_THREADS\"] = \"12\" # export OMP_NUM_THREADS=4\nos.envir"
},
{
"path": "examples/Benchmark Embeddings with scIB.ipynb",
"chars": 215162,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"id\": \"6b258384-9a56-4ed0-be6f-db1c94711356\",\n \"metadata\": {},\n \"so"
},
{
"path": "examples/Label Transfer Using Logistic Classifier.ipynb",
"chars": 65244,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"id\": \"3f4f1b19-5369-4e4d-9366-b6f07f88b402\",\n \"metadata\": {},\n \"so"
},
{
"path": "model.py",
"chars": 3940,
"preview": "\"\"\"\nModel class\n\n\"\"\"\n\nimport warnings\nwarnings.filterwarnings(\"ignore\")\nimport math\nfrom torch import nn, Tensor\nfrom to"
},
{
"path": "model_files/new_species_protein_embeddings.csv",
"chars": 13,
"preview": "species,path\n"
},
{
"path": "requirements.txt",
"chars": 135,
"preview": "numpy==1.26.4\nscipy==1.14.1\npandas==2.2.2\ntqdm==4.66.5\ntorch==2.1.1\nscanpy==1.10.2\naccelerate==0.24.0\nrequests==2.25.1\nu"
},
{
"path": "utils.py",
"chars": 3487,
"preview": "\"\"\"\nUtils\n\n\"\"\"\n\nimport warnings\nwarnings.filterwarnings(\"ignore\")\nimport pandas as pd\nimport numpy as np\nimport os\nimpor"
}
]
About this extraction
This page contains the full source code of the snap-stanford/UCE GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 17 files (374.1 KB), approximately 125.0k tokens, and a symbol index with 56 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.