Safetensors
soumyatghosh commited on
Commit
4527b5f
·
verified ·
1 Parent(s): f901faa

Upload folder using huggingface_hub

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. .DS_Store +0 -0
  2. .gitattributes +3 -0
  3. README.md +314 -0
  4. configs/README.md +3 -0
  5. data/sample_data.h5ad +3 -0
  6. data/sample_data_metadata.json +43 -0
  7. poetry.lock +0 -0
  8. pyproject.toml +178 -0
  9. scripts/preprocess_sample_data.sh +37 -0
  10. scripts/tokenize_sample_data.sh +39 -0
  11. teddy/.DS_Store +0 -0
  12. teddy/__init__.py +0 -0
  13. teddy/data_processing/__init__.py +0 -0
  14. teddy/data_processing/preprocessing/README.md +55 -0
  15. teddy/data_processing/preprocessing/__init__.py +0 -0
  16. teddy/data_processing/preprocessing/preprocess.py +516 -0
  17. teddy/data_processing/tokenization/README.md +58 -0
  18. teddy/data_processing/tokenization/__init__.py +0 -0
  19. teddy/data_processing/tokenization/tokenization.py +419 -0
  20. teddy/data_processing/utils/__init__.py +0 -0
  21. teddy/data_processing/utils/bio_annotations/__init__.py +0 -0
  22. teddy/data_processing/utils/bio_annotations/calculate_biostats.py +99 -0
  23. teddy/data_processing/utils/bio_annotations/data/all_filtered.json +1227 -0
  24. teddy/data_processing/utils/bio_annotations/data/mappings/all_filtered_cell_mapping.json +862 -0
  25. teddy/data_processing/utils/bio_annotations/data/mappings/all_filtered_disease_mapping.json +125 -0
  26. teddy/data_processing/utils/bio_annotations/data/mappings/all_filtered_sex_mapping.json +5 -0
  27. teddy/data_processing/utils/bio_annotations/data/mappings/all_filtered_tissue_mapping.json +415 -0
  28. teddy/data_processing/utils/bio_annotations/data/sampling_probs_for_collator/all_filtered_cell_probs.json +15 -0
  29. teddy/data_processing/utils/bio_annotations/data/sampling_probs_for_collator/all_filtered_disease_probs.json +13 -0
  30. teddy/data_processing/utils/bio_annotations/data/sampling_probs_for_collator/all_filtered_sex_probs.json +5 -0
  31. teddy/data_processing/utils/bio_annotations/data/sampling_probs_for_collator/all_filtered_tissue_probs.json +20 -0
  32. teddy/data_processing/utils/bio_annotations/data/sampling_probs_for_collator/cell_probs_for_classification.json +15 -0
  33. teddy/data_processing/utils/bio_annotations/data/sampling_probs_for_collator/disease_probs_for_classification.json +13 -0
  34. teddy/data_processing/utils/bio_annotations/data/sampling_probs_for_collator/sex_probs_for_classification.json +5 -0
  35. teddy/data_processing/utils/bio_annotations/data/sampling_probs_for_collator/tissue_probs_for_classification.json +20 -0
  36. teddy/data_processing/utils/gene_mapping/__init__.py +0 -0
  37. teddy/data_processing/utils/gene_mapping/data/2407_ensembl_processed.txt +0 -0
  38. teddy/data_processing/utils/gene_mapping/data/2407_hgnc_mapping.any2any.txt +3 -0
  39. teddy/data_processing/utils/gene_mapping/data/2407_mouse_gene_mapping.txt +0 -0
  40. teddy/data_processing/utils/gene_mapping/data/human_mapping.txt +3 -0
  41. teddy/data_processing/utils/gene_mapping/data/mouse_to_human_orthologs.one2one.txt +0 -0
  42. teddy/data_processing/utils/gene_mapping/gene_mapper.py +629 -0
  43. teddy/data_processing/utils/medians/data/teddy_gene_medians.json +0 -0
  44. teddy/models/.DS_Store +0 -0
  45. teddy/models/__init__.py +0 -0
  46. teddy/models/classification_heads.py +285 -0
  47. teddy/models/model_directory.py +53 -0
  48. teddy/models/teddy_g/.DS_Store +0 -0
  49. teddy/models/teddy_g/160M/added_tokens.json +7 -0
  50. teddy/models/teddy_g/160M/config.json +26 -0
.DS_Store ADDED
Binary file (6.15 kB). View file
 
.gitattributes CHANGED
@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ data/sample_data.h5ad filter=lfs diff=lfs merge=lfs -text
37
+ teddy/data_processing/utils/gene_mapping/data/2407_hgnc_mapping.any2any.txt filter=lfs diff=lfs merge=lfs -text
38
+ teddy/data_processing/utils/gene_mapping/data/human_mapping.txt filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,314 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # TEDDY: A Family of Foundation Models for Single-Cell Biology
2
+
3
+ This repository provides open-source code and configurations supporting the **TEDDY** project, as described in:
4
+
5
+ > **[TEDDY: A FAMILY OF FOUNDATION MODELS FOR UNDERSTANDING SINGLE CELL BIOLOGY](https://arxiv.org/abs/2503.03485)**
6
+
7
+ TEDDY leverages large-scale single-cell RNA sequencing (scRNA-seq) data (~116 million cells) to train transformer-based models. These models capture disease-related signals and generalize to diverse downstream tasks, including cross-donor and cross-disease classification.
8
+
9
+ ---
10
+
11
+ ## Table of Contents
12
+
13
+ 1. [Introduction](#introduction)
14
+ 2. [Project Goals & Paper Summary](#project-goals--paper-summary)
15
+ 3. [Pipeline Overview](#pipeline-overview)
16
+ 4. [Installation](#installation--setup)
17
+ 5. [Detailed Steps](#detailed-steps)
18
+ 1. [Preprocessing & Tokenization](#1-preprocessing--tokenization)
19
+ 2. [Loading TEDDY Models](#2-loading-teddy-models)
20
+ 6. [Running sample scripts on sample data](#running-sample-scripts-on-sample-data)
21
+ 7. [Running unit tests with pytest](#running-unit-tests-with-pytest)
22
+ 8. [Reference](#reference)
23
+
24
+ ---
25
+
26
+ ## Introduction
27
+
28
+ Single-cell RNA sequencing data can span hundreds of millions of cells, each expressing thousands of genes. **TEDDY** (*T*ransformer for *E*nabling *D*rug *D*iscovery) adapts masked language modeling and ontology classification to gene expression. By scaling both **data volume** and **model capacity** (up to ~400M parameters), TEDDY learns robust biological features that generalize to **unseen diseases**, **unseen donors**, and more.
29
+
30
+ ---
31
+
32
+ ## Project Goals & Paper Summary
33
+
34
+ Refer to the [paper](https://arxiv.org/abs/2503.03485) for full technical details. Key highlights:
35
+
36
+ - **Data**: 116 million single cells, spanning multiple tissues, diseases, and human/mouse species.
37
+ - **Models**:
38
+ - **TEDDY-G** (rank-based encoding)
39
+ - **TEDDY-X** (binned encoding)
40
+ - Sizes range from 10M to 400M parameters.
41
+ - **Annotation Supervision**: Additional labels (disease, tissue, cell type, etc.) further refine model representations.
42
+ - **Benchmarks**: “Held-out donors” and “held-out diseases” classification tasks showed significant gains over alternative foundation models.
43
+
44
+ (Note: This release only includes the most performant models: TEDDY-G 70M, TEDDY-G 160M, and TEDDY-G 400M)
45
+ ---
46
+
47
+ ## Pipeline Overview
48
+
49
+ **TEDDY** pipeline involves three steps:
50
+
51
+ 1. **Preprocessing**
52
+ - Load `.h5ad` files, remove low-quality cells, normalize expression counts to 10000,
53
+ and median normalize.
54
+ - Outputs a “processed” `.h5ad` file.
55
+
56
+ 2. **Tokenization**
57
+ - Converts each cell’s expression profile into integer tokens or rank-based embeddings.
58
+ - Can embed metadata tokens that can be used as ontologies in the model (e.g., `<disease>`, `<tissue_type>`, `<sex>`, `<cell_type>`) if needed.
59
+
60
+ 3. **Model Inference and Training**
61
+ - Uses the tokenized dataset to generate embeddigns for cells and genes.
62
+ - Uses the tokenized dataset for masked language modeling plus ontology classification.
63
+ - Model config examples live in dedicated config files for relevant architectures.
64
+
65
+ ---
66
+
67
+ ## Installation & Setup
68
+
69
+ **Building Your Environment**
70
+
71
+ ### 1. Clone the Repository
72
+ First, clone the repository to your local machine:
73
+ ```bash
74
+ git clone XXX (update with the final public link)
75
+ cd teddy-models
76
+ ```
77
+
78
+ ### 2. Environment Setup
79
+
80
+ - Fine-tuning and pretraining of these models were conducted on GPUs, so ensure your instance is properly configured before working with large datasets.
81
+
82
+ - Ensure you have ***Python 3.11.10*** installed. You can use `pyenv` to manage Python versions:
83
+ ```bash
84
+ pyenv install 3.11.10
85
+ pyenv local 3.11.10
86
+ ```
87
+ - More details on how to use `pyenv`: [pyenv documentation](https://github.com/pyenv/pyenv)
88
+
89
+
90
+ - If you don’t already have ***Poetry*** installed, you can install it using the following command:
91
+ ```bash
92
+ curl -sSL https://install.python-poetry.org | python3 -
93
+ export PATH="/PATH/TO/YOUR/USER/.local/bin:$PATH"
94
+ ```
95
+ - Check that poetry uses the correct python version:
96
+ ```bash
97
+ pyenv which python
98
+ ```
99
+ - Change to correct version by running:
100
+ ```bash
101
+ poetry env use /PATH/TO/YOUR/USER/.pyenv/versions/3.11.10/bin/python
102
+ ```
103
+ - Run the following command to build the project and install its dependencies:
104
+ ```bash
105
+ poetry build
106
+ poetry install
107
+ ```
108
+ - Once the setup is complete, you can use the package.
109
+
110
+ ---
111
+
112
+ ## Detailed Steps
113
+
114
+ There are three ways to run **Preprocessing** and **Tokenization**:
115
+ 1. **Directly in Python** (importing the scripts)
116
+ 2. **Command-Line Arguments** (using flags)
117
+ 3. **JSON Config Files** (loading a `.json` with your parameters)
118
+
119
+ ### 1. Preprocessing & Tokenization
120
+ #### Directly in Python
121
+
122
+ Detailed [README.md for Preprocessing](teddy/data_processing/preprocessing/README.md) and [README.md for Tokenization](teddy/data_processing/tokenization/README.md) can be found in the related module folders.
123
+
124
+ **Preprocessing example**:
125
+
126
+ ```
127
+ from teddy.data_processing.preprocessing.preprocess import preprocess
128
+
129
+ preprocessing_config = {
130
+ "min_gene_counts": 225,
131
+ "remove_assays": ["10x5' v1", "10x3' v1"],
132
+ "max_mitochondrial_prop": 10,
133
+ "remove_cell_types": [],
134
+ "hvg_method": None,
135
+ "normalized_total": 10000,
136
+ "median_dict": "teddy/data_processing/utils/medians/data/teddy_gene_medians.json",
137
+ "log1p": False,
138
+ "compute_medians": False,
139
+ "median_column": "index",
140
+ "reference_id_only": False,
141
+ "load_dir": "<PATH_TO_RAW_DATA_PARENT>",
142
+ "save_dir": "<PATH_TO_PROCESSED_DATA_PARENT>",
143
+ }
144
+
145
+ preprocess(
146
+ data_path="data/RAW_SAMPLES/my_data.h5ad",
147
+ metadata_path="data/RAW_SAMPLES/my_data_metadata.json",
148
+ hyperparameters=preprocessing_config
149
+ )
150
+ ```
151
+
152
+ The above preprocessing arguments were used to preprocess the corpus used for pretraining TEDDY models.
153
+
154
+ **Tokenization example**:
155
+
156
+ ```
157
+ from teddy.data_processing.tokenization.tokenization import tokenize
158
+
159
+ tokenizer_config = {
160
+ "tokenizer_name_or_path": "teddy/models/teddy_g/400M",
161
+ "gene_id_column": "index",
162
+ "bio_annotations": True,
163
+ "disease_mapping": "teddy/data_processing/utils/bio_annotations/data/mappings/all_filtered_disease_mapping.json",
164
+ "tissue_mapping": "teddy/data_processing/utils/bio_annotations/data/mappings/all_filtered_tissue_mapping.json",
165
+ "cell_mapping": "teddy/data_processing/utils/bio_annotations/data/mappings/all_filtered_cell_mapping.json",
166
+ "sex_mapping": "teddy/data_processing/utils/bio_annotations/data/mappings/all_filtered_sex_mapping.json",
167
+ "max_shard_samples": 500,
168
+ "max_seq_len": 2048,
169
+ "pad_length": 2048,
170
+ "add_cls": False,
171
+ "bins": 0,
172
+ "continuous_rank": True,
173
+ "truncation_method": "max",
174
+ "add_disease_annotation": False,
175
+ "include_zero_genes": False,
176
+ "load_dir": "<PATH_TO_PROCESSED_DATA_PARENT>",
177
+ "save_dir": "<PATH_TO_TOKENIZED_DATA>"
178
+ }
179
+
180
+ tokenize(
181
+ data_path="outputs/preprocessed/my_data_preprocessed.h5ad",
182
+ metadata_path="outputs/preprocessed/my_data_preprocessed_metadata.json",
183
+ hyperparameters=tokenizer_config
184
+ )
185
+ ```
186
+
187
+ Above tokenization arguments were used for the Teddy models.
188
+
189
+ #### By Saving a `config.json` and Running It with Bash
190
+
191
+ Example **preprocess_config.json**:
192
+
193
+ ```
194
+ {
195
+ "min_gene_counts": null,
196
+ "remove_assays": [],
197
+ "max_mitochondrial_prop": null,
198
+ "remove_cell_types": [],
199
+ "hvg_method": null,
200
+ "normalized_total": 10000,
201
+ "median_dict": "teddy/data_processing/utils/medians/data/teddy_gene_medians.json",
202
+ "log1p": false,
203
+ "compute_medians": false,
204
+ "median_column": "index",
205
+ "reference_id_only": false,
206
+ "load_dir": "<PATH_TO_RAW_DATA_PARENT>",
207
+ "save_dir": "<PATH_TO_PROCESSED_DATA_PARENT>"
208
+ }
209
+ ```
210
+ Run:
211
+ ```
212
+ python teddy/data/preprocessing/preprocess.py \
213
+ --data_path data/RAW_SAMPLES/my_data.h5ad \
214
+ --metadata_path data/RAW_SAMPLES/my_data_metadata.json \
215
+ --config preprocess_config.json
216
+ ```
217
+ (Same idea for tokenization, e.g. tokenize_config.json, then `--config tokenize_config.json`.)
218
+
219
+ #### By Creating a `.sh` File and Executing It (With Poetry)
220
+
221
+ You can find an example in **scripts/preprocess_sample_data.sh**:
222
+ ```
223
+ #!/bin/bash -l
224
+
225
+ # (Optional) Activate your Poetry environment
226
+ poetry shell
227
+
228
+ # 1) Generate a JSON config file on the fly
229
+ cat <<EOF > configs/my_preprocess_config.json
230
+ {
231
+ "load_dir": "data",
232
+ "save_dir": "data/processed",
233
+
234
+ "min_gene_counts": null,
235
+ "remove_assays": [],
236
+ "max_mitochondrial_prop": null,
237
+ "remove_cell_types": [],
238
+ "hvg_method": null,
239
+ "normalized_total": null,
240
+
241
+ "median_dict": "teddy/data_processing/utils/medians/data/teddy_gene_medians.json",
242
+ "log1p": false,
243
+ "compute_medians": false,
244
+ "median_column": "index",
245
+
246
+ "reference_id_only": false
247
+ }
248
+ EOF
249
+
250
+ # 2) Call preprocess.py, explicitly passing data_path, metadata_path, and config_path
251
+ python teddy/data_processing/preprocessing/preprocess.py \
252
+ --data_path data/sample_data.h5ad \
253
+ --metadata_path data/sample_data_metadata.json \
254
+ --config_path my_preprocess_config.json
255
+ ```
256
+ Then do:
257
+ ```
258
+ chmod +x preprocess_sample_data.sh
259
+ ./preprocess_sample_data.sh
260
+ ```
261
+ ---
262
+ You can override any parameter by specifying command-line arguments, editing the `.json`, or updating the Python dictionary.
263
+
264
+ (Same idea for tokenization, e.g. use example in **scripts/tokenize_sample_data.sh**)
265
+
266
+ ### 2. Loading TEDDY Models
267
+
268
+ If you want to load a trained TEDDY model in your Python code, you can do so with the following snippet:
269
+
270
+ ```
271
+ from teddy.models.model_directory import get_architecture, model_dict
272
+
273
+ model_name_or_path = 'teddy/models/teddy_g/400M' # or local path to model files
274
+ arch = get_architecture(model_name_or_path)
275
+ config_cls = model_dict[arch]["config_cls"]
276
+ model_cls = model_dict[arch]["model_cls"]
277
+
278
+ # Load the configuration and model
279
+ config = config_cls.from_pretrained(model_name_or_path)
280
+ model = model_cls.from_pretrained(model_name_or_path, config=config)
281
+ # model is now ready for inference or further fine-tuning
282
+ ```
283
+ You can then perform inference, fine-tuning, or evaluation with the model object as needed.
284
+
285
+ ## Running sample scripts on sample data
286
+
287
+ In the `scripts` directory of this repository, sample code has been included with which to preprocess and tokenize the sample data in the `data` directory. To switch this out for your own data, simply replace the data within the `data` directory with your data and rename file paths within the scripts as needed.
288
+
289
+ To run the scripts included, run the following commands from the root of the `teddy-models` repository.
290
+
291
+ ```
292
+ chmod +x scripts/*
293
+ ./scripts/preprocess_sample_data.sh
294
+ ./scripts/tokenize_sample_data.sh
295
+ ```
296
+
297
+ ## Running unit tests with pytest
298
+
299
+ To run the unit tests in the repository, you can run `poetry run pytest`. The tests should all pass, but receiving runtime warnings is expected behavior with the simulated data for the tests.
300
+
301
+ ## Reference
302
+
303
+ Reference to cite when you use TEDDY:
304
+
305
+ ```
306
+ @misc{chevalier2025teddyfamilyfoundationmodels,
307
+ title={TEDDY: A Family Of Foundation Models For Understanding Single Cell Biology},
308
+ author={Alexis Chevalier and Soumya Ghosh and Urvi Awasthi and James Watkins and Julia Bieniewska and Nichita Mitrea and Olga Kotova and Kirill Shkura and Andrew Noble and Michael Steinbaugh and Julien Delile and Christoph Meier and Leonid Zhukov and Iya Khalil and Srayanta Mukherjee and Judith Mueller},
309
+ year={2025},
310
+ eprint={2503.03485},
311
+ archivePrefix={arXiv},
312
+ primaryClass={cs.LG},
313
+ url={https://arxiv.org/abs/2503.03485},
314
+ ```
configs/README.md ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ # Configuration files
2
+
3
+ This project includes scripts that generate and utilize configuration files. These configuration files are essential for the proper functioning of the scripts and are automatically saved in this designated directory.
data/sample_data.h5ad ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:15c61493446545284e86f6dfd276a6ba6825ca15e4a1bf2333be3149de6bc330
3
+ size 36585033
data/sample_data_metadata.json ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "donor_id": [
3
+ "Subject2",
4
+ "Subject3",
5
+ "Subject1",
6
+ "CTRL-2",
7
+ "CTRL-4"
8
+ ],
9
+ "cell_count": 3538,
10
+ "organism": [
11
+ "Homo sapiens"
12
+ ],
13
+ "sampling_parameters": [
14
+ {
15
+ "train_prop": 1.0,
16
+ "test_prop": 0.0,
17
+ "test_donors_prop": 0.0,
18
+ "train_donors": [
19
+ "Subject4",
20
+ "Subject5",
21
+ "Subject6",
22
+ "Subject7",
23
+ "Subject8",
24
+ "CTRL-1",
25
+ "CTRL-3",
26
+ "CTRL-5",
27
+ "CTRL-6",
28
+ "CTRL-7",
29
+ "CTRL-8"
30
+ ],
31
+ "test_donors": [
32
+ "Subject1",
33
+ "Subject2",
34
+ "Subject3",
35
+ "CTRL-2",
36
+ "CTRL-4"
37
+ ],
38
+ "include_nonhuman": false,
39
+ "load_dir": "public/data/cellxgene_single_datasets/alzheimers/transcriptomiscs_tangle_neurons/original/inhibitory/",
40
+ "save_dir": "public/data/cellxgene_single_datasets/paper_evaluations/alzheimers/transcriptomiscs_tangle_neurons/inhibitory/sampled/split1/"
41
+ }
42
+ ]
43
+ }
poetry.lock ADDED
The diff for this file is too large to render. See raw diff
 
pyproject.toml ADDED
@@ -0,0 +1,178 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [tool.poetry]
2
+ name = "teddy"
3
+ version = "0.1.0"
4
+ description = "A module for fine-tuning and preprocessing foundational models."
5
+ authors = ["Olga Kotova <[email protected]>"]
6
+ license = "MIT"
7
+ readme = "README.md"
8
+
9
+ [tool.poetry.dependencies]
10
+ python = "3.11.10"
11
+ accelerate = "0.30.1"
12
+ aiohttp = "3.9.5"
13
+ aiosignal = "1.3.1"
14
+ alembic = "1.13.2"
15
+ aniso8601 = "9.0.1"
16
+ anndata = "0.10.7"
17
+ attrs = "23.2.0"
18
+ azure-common = "1.1.28"
19
+ azure-core = "1.30.1"
20
+ azure-mgmt-core = "1.4.0"
21
+ azure-mgmt-storage = "21.1.0"
22
+ azure-storage-blob = "12.20.0"
23
+ beautifulsoup4 = "4.12.3"
24
+ blinker = "1.8.2"
25
+ boto3 = "1.34.112"
26
+ botocore = "1.34.112"
27
+ cachetools = "5.3.3"
28
+ certifi = "2024.7.4"
29
+ cffi = "1.16.0"
30
+ charset-normalizer = "3.3.2"
31
+ click = "8.1.7"
32
+ cloudpickle = "3.0.0"
33
+ contourpy = "1.2.1"
34
+ cryptography = "44.0.0"
35
+ cycler = "0.12.1"
36
+ datasets = "2.19.1"
37
+ deprecated = "1.2.14"
38
+ dill = "0.3.8"
39
+ docker = "7.1.0"
40
+ docker-pycreds = "0.4.0"
41
+ fabric = "3.2.2"
42
+ filelock = "3.14.0"
43
+ flask = "3.0.3"
44
+ fonttools = "4.51.0"
45
+ frozenlist = "1.4.1"
46
+ fsspec = "2024.3.1"
47
+ gdown = "5.2.0"
48
+ gitdb = "4.0.11"
49
+ gitpython = "3.1.43"
50
+ graphene = "3.3"
51
+ graphql-core = "3.2.3"
52
+ graphql-relay = "3.2.0"
53
+ greenlet = "3.0.3"
54
+ gunicorn = "22.0.0"
55
+ h5py = "3.11.0"
56
+ huggingface-hub = "0.23.1"
57
+ hyperopt = "0.1.2"
58
+ idna = "3.7"
59
+ igraph = "0.11.5"
60
+ isodate = "0.6.1"
61
+ itsdangerous = "2.2.0"
62
+ jinja2 = "3.1.4"
63
+ jmespath = "1.0.1"
64
+ joblib = "1.4.2"
65
+ kiwisolver = "1.4.5"
66
+ legacy-api-wrap = "1.4"
67
+ leidenalg = "0.10.2"
68
+ llvmlite = "0.42.0"
69
+ mako = "1.3.5"
70
+ markdown = "3.6"
71
+ markupsafe = "2.1.5"
72
+ matplotlib = "3.9.0"
73
+ mlflow = "2.16.0"
74
+ mpmath = "1.3.0"
75
+ multidict = "6.0.5"
76
+ multiprocess = "0.70.16"
77
+ natsort = "8.4.0"
78
+ networkx = "3.3"
79
+ numba = "0.59.1"
80
+ numpy = "1.26.4"
81
+ opentelemetry-api = "1.25.0"
82
+ opentelemetry-sdk = "1.25.0"
83
+ opentelemetry-semantic-conventions = "0.46b0"
84
+ pandas = "2.2.2"
85
+ patsy = "0.5.6"
86
+ pillow = "10.3.0"
87
+ protobuf = "4.25.3"
88
+ psutil = "5.9.8"
89
+ pyarrow = "15.0.2"
90
+ pycparser = "2.22"
91
+ pydot = "2.0.0"
92
+ pymongo = "4.7.2"
93
+ pynndescent = "0.5.12"
94
+ pyparsing = "3.1.2"
95
+ pysocks = "1.7.1"
96
+ python-box = "7.1.1"
97
+ python-dateutil = "2.9.0.post0"
98
+ pytz = "2024.1"
99
+ pyyaml = "6.0.1"
100
+ regex = "2024.5.15"
101
+ requests = "2.32.2"
102
+ s3transfer = "0.10.1"
103
+ safetensors = "0.4.3"
104
+ scanpy = "1.10.1"
105
+ scib = "1.1.5"
106
+ scikit-learn = "1.5.0"
107
+ scikit-misc = "0.3.1"
108
+ scipy = "1.13.0"
109
+ scvi = "0.6.8"
110
+ seaborn = "0.13.2"
111
+ sentry-sdk = "2.8.0"
112
+ session-info = "1.0.0"
113
+ setproctitle = "1.3.3"
114
+ smmap = "5.0.1"
115
+ soupsieve = "2.5"
116
+ sqlalchemy = "2.0.31"
117
+ sqlparse = "0.5.0"
118
+ statsmodels = "0.14.2"
119
+ sympy = "1.12"
120
+ texttable = "1.7.0"
121
+ threadpoolctl = "3.5.0"
122
+ tokenizers = "0.19.1"
123
+ torch = "^2.3.0 || >=2.0.1"
124
+ torchtext = "^0.18.0 || >=0.15.2"
125
+ torchvision = "^0.18.0 || >=0.15.2"
126
+ tqdm = "4.66.4"
127
+ transformers = "4.41.0"
128
+ tzdata = "2024.1"
129
+ umap-learn = "0.5.6"
130
+ urllib3 = "2.2.2"
131
+ wandb = "0.17.0"
132
+ werkzeug = "3.0.6"
133
+ wrapt = "1.16.0"
134
+ xxhash = "3.4.1"
135
+ yarl = "1.9.4"
136
+ jupyter = "^1.1.1"
137
+ ipykernel = "^6.29.5"
138
+ tensorboard = "^2.19.0"
139
+ pydantic = "^2.10.6"
140
+
141
+ [tool.poetry.group.dev.dependencies]
142
+ pytest = "^7.0"
143
+ black = "^24.3"
144
+ isort = "^5.0"
145
+ ruff = "^0.0.286"
146
+ pre-commit = "^4.0.1"
147
+
148
+ [build-system]
149
+ requires = ["poetry-core>=1.0.0"]
150
+ build-backend = "poetry.core.masonry.api"
151
+
152
+ [tool.black]
153
+ skip-string-normalization = true
154
+ line-length = 120
155
+
156
+ [tool.ruff]
157
+ # Same as Black.
158
+ line-length = 120
159
+ exclude = ["jupyter_notebook_config.py"]
160
+ select = [
161
+ "E", # pycodestyle errors (settings from FastAPI, thanks, @tiangolo!)
162
+ "W", # pycodestyle warnings
163
+ "F", # pyflakes
164
+ "I", # isort
165
+ "C", # flake8-comprehensions
166
+ "B", # flake8-bugbear
167
+ ]
168
+ ignore = [
169
+ "E501", # line too long, handled by black
170
+ "C901", # too complex
171
+ ]
172
+
173
+ [tool.ruff.isort]
174
+ order-by-type = true
175
+ relative-imports-order = "closest-to-furthest"
176
+ extra-standard-library = ["typing"]
177
+ section-order = ["future", "standard-library", "third-party", "first-party", "local-folder"]
178
+ known-first-party = []
scripts/preprocess_sample_data.sh ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash -l
2
+
3
+ # (Optional) Activate your Poetry environment
4
+ poetry shell
5
+
6
+ # Generate a timestamp string (e.g., 20230404123056)
7
+ TS=$(date '+%Y%m%d%H%M%S')
8
+
9
+ CONFIG_FILE="configs/preprocessing_config_${TS}.json"
10
+
11
+ # 1) Generate a JSON config file on the fly
12
+ cat <<EOF > "$CONFIG_FILE"
13
+ {
14
+ "load_dir": "data",
15
+ "save_dir": "data/processed",
16
+
17
+ "min_gene_counts": null,
18
+ "remove_assays": [],
19
+ "max_mitochondrial_prop": null,
20
+ "remove_cell_types": [],
21
+ "hvg_method": null,
22
+ "normalized_total": null,
23
+
24
+ "median_dict": "teddy/data_processing/utils/medians/data/teddy_gene_medians.json",
25
+ "log1p": false,
26
+ "compute_medians": false,
27
+ "median_column": "index",
28
+
29
+ "reference_id_only": false
30
+ }
31
+ EOF
32
+
33
+ # 2) Call preprocess.py, explicitly passing data_path, metadata_path, and config_path
34
+ python teddy/data_processing/preprocessing/preprocess.py \
35
+ --data_path data/sample_data.h5ad \
36
+ --metadata_path data/sample_data_metadata.json \
37
+ --config_path "$CONFIG_FILE"
scripts/tokenize_sample_data.sh ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash -l
2
+
3
+ # Activate the Poetry environment (adjust this if needed)
4
+ poetry shell
5
+
6
+ # Generate a timestamp string (e.g., 20230404123056)
7
+ TS=$(date '+%Y%m%d%H%M%S')
8
+
9
+ CONFIG_FILE="configs/tokenization_config_${TS}.json"
10
+
11
+ # Create the config file containing your tokenization arguments
12
+ cat <<EOF > "$CONFIG_FILE"
13
+ {
14
+ "tokenizer_name_or_path": "teddy/models/teddy_g/400M",
15
+ "gene_id_column": "index",
16
+ "bio_annotations": true,
17
+ "disease_mapping": "teddy/data_processing/utils/bio_annotations/data/mappings/all_filtered_disease_mapping.json",
18
+ "tissue_mapping": "teddy/data_processing/utils/bio_annotations/data/mappings/all_filtered_tissue_mapping.json",
19
+ "cell_mapping": "teddy/data_processing/utils/bio_annotations/data/mappings/all_filtered_cell_mapping.json",
20
+ "sex_mapping": "teddy/data_processing/utils/bio_annotations/data/mappings/all_filtered_sex_mapping.json",
21
+ "max_shard_samples": 500,
22
+ "max_seq_len": 2048,
23
+ "pad_length": 2048,
24
+ "add_cls": false,
25
+ "bins": 0,
26
+ "continuous_rank": true,
27
+ "add_disease_annotation": false,
28
+ "include_zero_genes": false,
29
+ "load_dir": "data/processed",
30
+ "save_dir": "data/tokenized"
31
+ }
32
+ EOF
33
+
34
+ # Execute the tokenization.py script with three arguments:
35
+ # --data_path, --metadata_path, and --config_path
36
+ python teddy/data_processing/tokenization/tokenization.py \
37
+ --data_path data/processed/sample_data.h5ad \
38
+ --metadata_path data/processed/sample_data_metadata.json \
39
+ --config_path "$CONFIG_FILE"
teddy/.DS_Store ADDED
Binary file (6.15 kB). View file
 
teddy/__init__.py ADDED
File without changes
teddy/data_processing/__init__.py ADDED
File without changes
teddy/data_processing/preprocessing/README.md ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # PreprocessReadMe.md
2
+
3
+ The `preprocess.py` script is designed to preprocess gene expression data for use in our models. It takes in `data.raw.X` or `data.X` data, applies various preprocessing techniques, and prepares it for training or inference.
4
+
5
+ # General Workflow
6
+ The script follows these main steps:
7
+ 0. **Load Data and Metadata**: The script starts by loading the gene expression data from an AnnData file and metadata from a JSON file.
8
+ 1. **Set Raw Layer**: It checks if the `data.raw.X` layer is set in the AnnData object. If not, it sets it based on the integer counts in the `data.X`.
9
+ 2. **Initialize Processed Layer**: It initializes the `data.layer['processed']` in the AnnData object, which is the layer that will be affected by preprocessing.
10
+ 3. **Filter Genes by Reference ID**: It filters genes based on reference IDs if specified in the hyperparameters.
11
+ 4. **Remove Assays**: It removes specified assays from the data.
12
+ 5. **Filter Cells by Gene Counts**: It filters out cells with gene counts below a specified threshold.
13
+ 6. **Filter Cells by Mitochondrial Fraction**: It removes cells with a high mitochondrial gene fraction.
14
+ 7. **Filter Highly Variable Genes**: It filters genes to retain only highly variable ones using specified methods.
15
+ 8. **Normalize Data**: It normalizes the data by applying row (gene level) normalization and scaling.
16
+ 9. **Scale Columns by Median**: It scales columns based on median values from a specified dictionary.
17
+ 10. **Log Transform**: It applies a log+1 transformation to the data.
18
+ 11. **Compute Medians**: It computes and saves medians of the processed data if specified.
19
+ 12. **Update Metadata**: It updates the metadata with cell counts and processing arguments.
20
+ 13. **Save and Cleanup**: It saves the processed data and metadata to disk and performs garbage collection.
21
+
22
+
23
+ # Preprocessing Arguments
24
+ The script uses several preprocessing arguments to control its behavior. Here is an explanation of each argument and the steps they influence:
25
+
26
+ - `reference_id_only`
27
+ - Description: Specifies whether to filter genes by reference ID.
28
+ - Impact: If enabled, the script filters genes based on reference IDs.
29
+ - `remove_assays`
30
+ - Description: List of assays to remove from the data.
31
+ - Impact: The script removes specified assays from the data.
32
+ - `min_gene_counts`
33
+ - Description: Minimum gene counts required for cells to be retained.
34
+ - Impact: The script filters out cells with gene counts below this threshold.
35
+ - `max_mitochondrial_prop`
36
+ - Description: Maximum mitochondrial gene fraction allowed for cells.
37
+ - Impact: The script removes cells with a mitochondrial gene fraction above this threshold.
38
+ - `hvg_method`
39
+ - Description: Method to use for filtering highly variable genes.
40
+ - Impact: The script filters genes to retain only highly variable ones using the specified method.
41
+ - `normalized_total`
42
+ - Description: Value to normalize the total gene expression to.
43
+ - Impact: The script normalizes the data by applying row (gene level) normalization and scaling.
44
+ - `median_dict`
45
+ - Description: Path to a JSON file containing median values for scaling columns.
46
+ - Impact: The script scales columns based on median values from the specified dictionary.
47
+ - `median_column`
48
+ - Description: Column name to use for looking up median values.
49
+ - Impact: The script uses this column to look up median values for scaling.
50
+ - `log1p`
51
+ - Description: Indicates whether to apply a log transformation to the data.
52
+ - Impact: If enabled, the script applies a log transformation to the data.
53
+ - `compute_medians`
54
+ - Description: Indicates whether to compute and save medians of the processed data.
55
+ - Impact: If enabled, the script computes and saves medians of the processed data.
teddy/data_processing/preprocessing/__init__.py ADDED
File without changes
teddy/data_processing/preprocessing/preprocess.py ADDED
@@ -0,0 +1,516 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Module: preprocess.py
3
+
4
+ This module provides a preprocessing pipeline for single-cell RNA sequencing (scRNA-seq) data
5
+ stored in AnnData format. It includes functions for loading data, filtering cells and genes,
6
+ normalizing and scaling data, and saving processed results. The pipeline is designed to be
7
+ configurable via hyperparameters and supports various preprocessing steps such as mitochondrial
8
+ gene filtering, highly variable gene selection, and log transformation.
9
+
10
+ Main Features:
11
+ - Load and preprocess scRNA-seq data in AnnData format.
12
+ - Filter cells and genes based on various criteria.
13
+ - Normalize, scale, and log-transform data.
14
+ - Save processed data and metadata to disk.
15
+ - Configurable via JSON-based hyperparameters.
16
+
17
+ Dependencies:
18
+ - anndata, numpy, pandas, scanpy, scipy, sklearn
19
+
20
+ Usage:
21
+ - Run this script as a standalone program with a configuration file specifying the hyperparameters.
22
+ - Import the `preprocess` function and call it with the data path, metadata path, and hyperparameters.
23
+ """
24
+
25
+ import gc
26
+ import json
27
+ import os
28
+ import warnings
29
+ from argparse import ArgumentParser
30
+ from typing import Sequence, Optional, Union
31
+ from pathlib import Path
32
+
33
+ import anndata as ad
34
+ import numpy as np
35
+ import pandas as pd
36
+ import scanpy as sc
37
+ from anndata import ImplicitModificationWarning
38
+ import scipy.sparse as sp
39
+ from scipy.sparse import csr_matrix, issparse
40
+ from sklearn.utils import sparsefuncs, sparsefuncs_fast
41
+
42
+ from teddy.data_processing.utils.gene_mapping.gene_mapper import (
43
+ map_mouse_human,
44
+ map_mouse_human2,
45
+ )
46
+
47
+ # --- 1. Reference list of the 37 human mitochondrial genes (Ensembl IDs) -----
48
+ _HUMAN_MITO_ENSEMBL= {
49
+ "ENSG00000211459", "ENSG00000210082", # rRNAs
50
+ # tRNAs (22)
51
+ "ENSG00000210049", "ENSG00000210077", "ENSG00000209082",
52
+ "ENSG00000210100", "ENSG00000210107", "ENSG00000210112",
53
+ "ENSG00000210119", "ENSG00000210122", "ENSG00000210116",
54
+ "ENSG00000210117", "ENSG00000210118", "ENSG00000210124",
55
+ "ENSG00000210126", "ENSG00000210134", "ENSG00000210135",
56
+ "ENSG00000210142", "ENSG00000210144", "ENSG00000210148",
57
+ "ENSG00000210150", "ENSG00000210155", "ENSG00000210196",
58
+ "ENSG00000210151",
59
+ # protein-coding (13)
60
+ "ENSG00000198888", "ENSG00000198763", "ENSG00000198840",
61
+ "ENSG00000198886", "ENSG00000212907", "ENSG00000198786",
62
+ "ENSG00000198695", "ENSG00000198804", "ENSG00000198712",
63
+ "ENSG00000198938", "ENSG00000198899", "ENSG00000228253",
64
+ "ENSG00000198727",
65
+ }
66
+
67
+ _HUMAN_MITO_SYMBOLS = {
68
+ "MT-RNR1", "MT-RNR2", "MT-TF", "MT-TV", "MT-TL1", "MT-TI", "MT-TQ",
69
+ "MT-TM", "MT-TW", "MT-TA", "MT-TN", "MT-TC", "MT-TY", "MT-TD", "MT-TK",
70
+ "MT-TG", "MT-TR", "MT-TH", "MT-TS2", "MT-TL2", "MT-TT", "MT-TE", "MT-TP",
71
+ "MT-TS1", "MT-ND1", "MT-ND2", "MT-ND3", "MT-ND4", "MT-ND4L", "MT-ND5",
72
+ "MT-ND6", "MT-CO1", "MT-CO2", "MT-CO3", "MT-ATP6", "MT-ATP8", "MT-CYB",
73
+ }
74
+
75
+
76
+ def load_data_and_metadata(data_path: str, metadata_path: str):
77
+ """
78
+ Load an AnnData h5ad file (data_processing) and a JSON file (metadata).
79
+ """
80
+ data = ad.read_h5ad(data_path)
81
+ with open(metadata_path, "r") as f:
82
+ metadata = json.load(f)
83
+ return data, metadata
84
+
85
+
86
+ def set_raw_if_necessary(data: ad.AnnData):
87
+ """
88
+ If data_processing.raw is None, checks if data_processing.X is integer for ~64 cells.
89
+ If so, set data_processing.raw = data_processing. Otherwise return None (skip).
90
+ """
91
+ if data.raw is not None:
92
+ return data # Already has raw
93
+ # If there is a 'counts' layer
94
+ if 'counts' in data.layers:
95
+ X = data.layers['counts']
96
+ # convert only 64 rows instead of converting the whole thing
97
+ if isinstance(X, np.ndarray):
98
+ X_sample = X[:64]
99
+ elif issparse(X):
100
+ X_sample = X[:64].toarray()
101
+ # Check first 64 rows for integrality
102
+ if np.all(np.equal(np.mod(X_sample, 1), 0)):
103
+ data.raw = ad.AnnData(X = data.layers['counts'], var = data.var.copy())
104
+ return data
105
+ # If above steps fail, check that data.X has raw counts already
106
+ X = data.X
107
+ # convert only 64 rows instead of converting the whole thing
108
+ if isinstance(X, np.ndarray):
109
+ X_sample = X[:64]
110
+ elif issparse(X):
111
+ X_sample = X[:64].toarray()
112
+ # Check first 64 rows for integrality
113
+ if np.all(np.equal(np.mod(X_sample, 1), 0)):
114
+ data.raw = data
115
+ return data
116
+ else:
117
+ print("No integer-valued matrix found")
118
+ return None
119
+
120
+
121
+
122
+
123
+ def initialize_processed_layer(data: ad.AnnData):
124
+ """
125
+ If 'processed' layer is missing, copy from data_processing.raw.X
126
+ """
127
+ if "processed" not in data.layers:
128
+ data.layers["processed"] = data.raw.X.astype("float32")
129
+ return data
130
+
131
+
132
+ # Replacing inline code with a small helper:
133
+ # (we simply inline the code from the original snippet)
134
+ # You can also fully factor it out for clarity:
135
+ # ---------------------------------------------------
136
+ # Actually let's define that properly here to keep it consistent:
137
+ def filter_reference_id(data: ad.AnnData, hyperparameters: dict):
138
+ human_map = pd.read_csv("teddy/data_processing/utils/gene_mapping/data/human_mapping.txt", sep="\t")
139
+ mouse_map = pd.read_csv("teddy/data_processing/utils/gene_mapping/data/2407_mouse_gene_mapping.txt", sep="\t")
140
+ orthologs = pd.read_csv(
141
+ "teddy/data_processing/utils/gene_mapping/data/mouse_to_human_orthologs.one2one.txt", sep="\t"
142
+ )
143
+
144
+ if hyperparameters.get("mouse_nonorthologs", False):
145
+ reference_id = map_mouse_human2(
146
+ data_frame=data.var,
147
+ query_column=None,
148
+ human_map_db=human_map,
149
+ mouse_map_db=mouse_map,
150
+ orthology_db=orthologs,
151
+ )["reference_id"]
152
+ else:
153
+ reference_id = map_mouse_human(
154
+ data_frame=data.var,
155
+ query_column=None,
156
+ human_map_db=human_map,
157
+ mouse_map_db=mouse_map,
158
+ orthology_db=orthologs,
159
+ )["reference_id"]
160
+
161
+ valid_mask = reference_id != ""
162
+ data = data[:, valid_mask].copy()
163
+ reference_id = reference_id[valid_mask].reset_index(drop=True)
164
+
165
+ if not isinstance(data.layers["processed"], np.ndarray):
166
+ corrected = data.layers["processed"].toarray()
167
+ else:
168
+ corrected = data.layers["processed"]
169
+
170
+ unique_ids = reference_id.unique()
171
+ vars_to_keep = []
172
+ for rid in unique_ids:
173
+ repeated_idx = np.where(reference_id == rid)[0]
174
+ vars_to_keep.append(repeated_idx[0])
175
+ if len(repeated_idx) > 1:
176
+ corrected[:, repeated_idx[0]] = corrected[:, repeated_idx].max(axis=1)
177
+
178
+ vars_to_keep = sorted(vars_to_keep)
179
+ corrected = corrected[:, vars_to_keep]
180
+ data = data[:, vars_to_keep]
181
+
182
+ with warnings.catch_warnings():
183
+ warnings.filterwarnings("ignore", category=ImplicitModificationWarning)
184
+ data.layers["processed"] = csr_matrix(corrected)
185
+ data.var["reference_id"] = list(reference_id[vars_to_keep])
186
+
187
+ gc.collect()
188
+ return data
189
+
190
+
191
+ # End of inline helper
192
+ # ---------------------------------------------------
193
+
194
+
195
+ def remove_assays(data: ad.AnnData, assays_to_remove: list):
196
+ """
197
+ Removes observations from specified 'assay' categories if 'assay' is in data_processing.obs.
198
+ """
199
+ data = data[~data.obs.assay.isin(assays_to_remove)].copy()
200
+ gc.collect()
201
+ return data
202
+
203
+
204
+ def filter_cells_by_gene_counts(data: ad.AnnData, min_count: int):
205
+ """
206
+ Removes cells (observations) whose total gene counts < min_count.
207
+ """
208
+ mask = sc.pp.filter_cells(data.layers["processed"], min_counts=min_count)[0]
209
+ data = data[np.where(mask)].copy()
210
+ del mask
211
+ gc.collect()
212
+ return data
213
+
214
+
215
+ def filter_cells_by_mitochondrial_fraction(data: ad.AnnData, max_mito_prop: float):
216
+ """
217
+ Remove low-quality cells whose mitochondrial read fraction exceeds *max_fraction*.
218
+ DO NOT RUN THIS IN ANY PREPROCESSING PIPELINE UNTIL YOU HAVE SET RAW COUNTS
219
+ Parameters
220
+ ----------
221
+ data
222
+ `AnnData` object containing counts. Works with dense or sparse matrices.
223
+ max_mito_prop
224
+ Threshold above which cells are discarded.
225
+ Returns
226
+ -------
227
+ AnnData
228
+ A **copy** of `data` with poor-quality cells removed and two new
229
+ columns added to ``.obs``:
230
+ - **mito_prop** – per-cell mitochondrial fraction
231
+ - **poor_quality_mito** – boolean flag marking dropped cells
232
+ """
233
+ # We can safely assume that counts live in data.X because we set those
234
+ # prior to running this step in the preprocess function.
235
+ counts = data.X
236
+ var_index = data.var_names
237
+ if var_index[0].startswith("ENSG"):
238
+ ref = _HUMAN_MITO_ENSEMBL
239
+ else:
240
+ ref = _HUMAN_MITO_SYMBOLS
241
+ mito_idx = np.flatnonzero(var_index.isin(ref))
242
+ if mito_idx.size == 0:
243
+ _logger.info("No mitochondrial genes found, returning data")
244
+ return data
245
+ if sp.issparse(counts):
246
+ total = counts.sum(axis=1).A1
247
+ mito = counts[:, mito_idx].sum(axis=1).A1
248
+ else:
249
+ total = counts.sum(axis=1)
250
+ mito = counts[:, mito_idx].sum(axis=1)
251
+ mito_prop = mito / np.maximum(total, 1)
252
+ data.obs["mito_prop"] = mito_prop
253
+ data.obs["poor_quality_mito"] = mito_prop > max_mito_prop
254
+ filtered = data[~data.obs["poor_quality_mito"]].copy()
255
+ gc.collect()
256
+ return filtered
257
+
258
+
259
+ def filter_highly_variable_genes(data: ad.AnnData, method: str):
260
+ """
261
+ Filter genes to those that are highly variable using scanpy.
262
+ method must be "seurat_v3" or "cell_ranger".
263
+ """
264
+ if "highly_variable" in data.var:
265
+ data = data[:, data.var["highly_variable"]]
266
+ else:
267
+ sc.pp.highly_variable_genes(data, flavor=method, n_top_genes=10000)
268
+ gc.collect()
269
+ return data
270
+
271
+
272
+ def normalize_data_inplace(matrix_csr: csr_matrix, norm_value: float):
273
+ """
274
+ In-place row normalization + scale. matrix_csr must be a CSR matrix.
275
+ """
276
+ # In-place row normalize (L1)
277
+ sparsefuncs_fast.inplace_csr_row_normalize_l1(matrix_csr)
278
+ # Multiply each row by norm_value
279
+ scale_factors = np.array([norm_value] * matrix_csr.shape[0])
280
+ sparsefuncs.inplace_row_scale(matrix_csr, scale_factors)
281
+ gc.collect()
282
+
283
+
284
+ def scale_columns_by_median_dict(layer: csr_matrix, data: ad.AnnData, median_dict_path: str, median_column: str):
285
+ """
286
+ Read a JSON median_dict, scale columns by 1/median. The lookup key is either
287
+ data_processing.var.index or data_processing.var[median_column].
288
+ """
289
+ with open(median_dict_path) as f:
290
+ median_dict = json.load(f)
291
+
292
+ if median_column == "index":
293
+ median_var = data.var.index
294
+ else:
295
+ median_var = data.var[median_column]
296
+
297
+ factors = []
298
+ for g in median_var:
299
+ if g in median_dict:
300
+ factors.append(1.0 / median_dict[g])
301
+ else:
302
+ factors.append(1.0)
303
+ factors = np.array(factors)
304
+
305
+ # Apply in-place column scale
306
+ sparsefuncs.inplace_csr_column_scale(layer, factors)
307
+
308
+
309
+ def log_transform_layer(data: ad.AnnData, layer_name: str = "processed"):
310
+ """
311
+ Apply sc.pp.log1p in place to data_processing.layers[layer_name].
312
+ """
313
+ sc.pp.log1p(data, layer=layer_name, copy=False)
314
+
315
+
316
+ def compute_and_save_medians(data: ad.AnnData, data_path: str, hyperparameters: dict):
317
+ """
318
+ Convert zeros to NaN, compute column medians ignoring NaN, and save results as JSON.
319
+ """
320
+ with warnings.catch_warnings():
321
+ warnings.filterwarnings("ignore", r"All-NaN (slice|axis) encountered")
322
+
323
+ mat = data.layers["processed"].toarray()
324
+ mat[mat == 0] = np.nan
325
+ medians = np.nanmedian(mat, axis=0)
326
+
327
+ if hyperparameters["median_column"] == "index":
328
+ median_var = data.var.index.copy()
329
+ if not isinstance(median_var, pd.Series):
330
+ median_var = pd.Series(median_var)
331
+ else:
332
+ median_var = data.var[hyperparameters["median_column"]].copy()
333
+
334
+ valid_idxs = np.where(~np.isnan(medians))[0]
335
+ median_values = {median_var.iloc[k]: medians[k].item() for k in valid_idxs}
336
+
337
+ save_path = data_path.replace(hyperparameters["load_dir"], hyperparameters["save_dir"])
338
+ save_path = save_path.replace(".h5ad", "_medians.json")
339
+ with open(save_path, "w") as f:
340
+ json.dump(median_values, f, indent=4)
341
+
342
+
343
+ def update_metadata(metadata: dict, data: ad.AnnData, hyperparameters: dict):
344
+ """
345
+ Update metadata with cell_count and track processing arguments.
346
+ """
347
+ metadata["cell_count"] = data.n_obs
348
+ if "processing_args" in metadata:
349
+ metadata["processing_args"] = [metadata["processing_args"]] + [hyperparameters]
350
+ else:
351
+ # original fallback
352
+ metadata["processings_args"] = [hyperparameters]
353
+ return metadata
354
+
355
+
356
+ def save_and_cleanup(data: ad.AnnData, metadata: dict, data_path: str, metadata_path: str, hyperparameters: dict):
357
+ """
358
+ Write processed data_processing and metadata to disk, then GC cleanup.
359
+ """
360
+ load_dir = hyperparameters["load_dir"]
361
+ save_dir = hyperparameters["save_dir"]
362
+ data_filename = os.path.basename(data_path) # e.g. "sample_data.h5ad"
363
+ metadata_filename = os.path.basename(metadata_path) # e.g. "sample_data_metadata.json"
364
+
365
+ save_processed_path = os.path.join(save_dir, data_filename)
366
+ save_metadata_path = os.path.join(save_dir, metadata_filename)
367
+
368
+ # Make sure the directories exist
369
+ os.makedirs(os.path.dirname(save_processed_path), exist_ok=True)
370
+ os.makedirs(os.path.dirname(save_metadata_path), exist_ok=True)
371
+
372
+ if data.n_obs == 0:
373
+ return None, None
374
+
375
+ # Ensure relevant layers are sparse matrices
376
+ if not isinstance(data.raw.X, csr_matrix):
377
+ data.raw.X = csr_matrix(data.raw.X)
378
+ if not isinstance(data.X, csr_matrix):
379
+ data.X = csr_matrix(data.X)
380
+ if "processed" in data.layers and not isinstance(data.layers["processed"], csr_matrix):
381
+ data.layers["processed"] = csr_matrix(data.layers["processed"])
382
+
383
+ try:
384
+ data.write_h5ad(save_processed_path, compression="gzip")
385
+ except Exception:
386
+ # Rare bug with categorical indexes
387
+ if data.obs.index.name in data.obs.columns:
388
+ del data.obs[data.obs.index.name]
389
+ data.write_h5ad(save_processed_path, compression="gzip")
390
+
391
+ del data
392
+ gc.collect()
393
+
394
+ with open(save_metadata_path, "w") as f:
395
+ json.dump(metadata, f, indent=4)
396
+
397
+ return True, True
398
+
399
+
400
+ def preprocess(data_path: str, metadata_path: str, hyperparameters: dict):
401
+ """
402
+ Original pipeline steps:
403
+ 1. Load data_processing & metadata
404
+ 2. Ensure data_processing.raw if counts are integer
405
+ 3. Initialize 'processed' layer
406
+ 4. Filter genes by reference_id
407
+ 5. Remove assays
408
+ 6. Filter cells (min gene counts)
409
+ 7. Filter cells (max mito fraction)
410
+ 8. HVG filtering
411
+ 9. Normalize total
412
+ 10. Median-based column scaling
413
+ 11. Log transform
414
+ 12. Compute medians (optional)
415
+ 13. Update metadata and save
416
+ """
417
+ # 1. Load
418
+ data, metadata = load_data_and_metadata(data_path, metadata_path)
419
+
420
+ # 2. Ensure data_processing.raw if needed
421
+ data = set_raw_if_necessary(data)
422
+ if data is None:
423
+ return None, None
424
+
425
+ # 3. Initialize 'processed'
426
+ data = initialize_processed_layer(data)
427
+ # Perturbseq fine-tuning pipeline
428
+
429
+ # 4. Possible map/reference_id
430
+ if hyperparameters["reference_id_only"]:
431
+ data = filter_reference_id(data, hyperparameters)
432
+
433
+ # 5. Remove assays
434
+ if "assay" in data.obs and hyperparameters["remove_assays"]:
435
+ data = remove_assays(data, hyperparameters["remove_assays"])
436
+
437
+ # 6. Filter cells by min gene counts
438
+ if hyperparameters["min_gene_counts"]:
439
+ data = filter_cells_by_gene_counts(data, hyperparameters["min_gene_counts"])
440
+
441
+ # 7. Filter cells by mitochondrial fraction
442
+ if hyperparameters["max_mitochondrial_prop"]:
443
+ # The "original" version *always* used feature_name, so fallback=False
444
+ data = filter_cells_by_mitochondrial_fraction(
445
+ data, hyperparameters["max_mitochondrial_prop"])
446
+
447
+ # 8. HVG filtering
448
+ if hyperparameters["hvg_method"] in ["seurat_v3", "cell_ranger"]:
449
+ data = filter_highly_variable_genes(data, hyperparameters["hvg_method"])
450
+
451
+ # 9. Normalize total (row L1 + scale)
452
+ if hyperparameters["normalized_total"]:
453
+ if not isinstance(data.layers["processed"], csr_matrix):
454
+ data.layers["processed"] = csr_matrix(data.layers["processed"])
455
+ normalize_data_inplace(data.layers["processed"], hyperparameters["normalized_total"])
456
+
457
+ # 10. Scale columns using median_dict
458
+ if hyperparameters["median_dict"]:
459
+ scale_columns_by_median_dict(
460
+ data.layers["processed"], data, hyperparameters["median_dict"], hyperparameters["median_column"]
461
+ )
462
+
463
+ # 11. Log1p transform
464
+ if hyperparameters["log1p"]:
465
+ log_transform_layer(data, "processed")
466
+
467
+ # 12. Possibly compute medians
468
+ if hyperparameters["compute_medians"]:
469
+ compute_and_save_medians(data, data_path, hyperparameters)
470
+
471
+ # 13. Update metadata, save & cleanup
472
+ metadata = update_metadata(metadata, data, hyperparameters)
473
+ return save_and_cleanup(data, metadata, data_path, metadata_path, hyperparameters)
474
+
475
+
476
+ ###############################################################################
477
+ # Main block
478
+ ###############################################################################
479
+ if __name__ == "__main__":
480
+ parser = ArgumentParser(description="Preprocess scRNA-seq data stored in AnnData format.")
481
+ parser.add_argument(
482
+ "--data_path",
483
+ type=str,
484
+ required=True,
485
+ help="Path to the input .h5ad file."
486
+ )
487
+ parser.add_argument(
488
+ "--metadata_path",
489
+ type=str,
490
+ required=True,
491
+ help="Path to the input metadata JSON file."
492
+ )
493
+ parser.add_argument(
494
+ "--config_path",
495
+ type=str,
496
+ required=True,
497
+ help="Path to the JSON configuration file containing hyperparameters."
498
+ )
499
+
500
+ args = parser.parse_args()
501
+
502
+ # Load hyperparameters from JSON
503
+ with open(args.config_path, "r") as f:
504
+ hyperparameters = json.load(f)
505
+
506
+ # Call the pipeline
507
+ success, _ = preprocess(
508
+ data_path=args.data_path,
509
+ metadata_path=args.metadata_path,
510
+ hyperparameters=hyperparameters
511
+ )
512
+
513
+ if success:
514
+ print("Preprocessing completed successfully.")
515
+ else:
516
+ print("Preprocessing returned no data (0 cells), no file saved.")
teddy/data_processing/tokenization/README.md ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ The `tokenize_for_model.py` script is designed to tokenize gene expression data for use in our models. It takes in processesed the, applies various tokenization techniques, and prepares it for training or inference.
2
+
3
+ # General Workflow
4
+ The script follows these main steps:
5
+ 0. **Load Tokenization Arguments**: The script starts by loading the tokenization arguments from a configuration file or dictionary.
6
+ 1. **Load Gene Tokenizer**: It loads a pre-trained gene tokenizer based on the provided tokenization arguments.
7
+ 2. **Load AnnData**: The script reads the gene expression data from an AnnData file.
8
+ 3. **Check Genes in Tokenizer**: It verifies that the genes in the dataset are present in the tokenizer's vocabulary.
9
+ 4. **Build Token Array**: The script constructs a token array for the genes in the dataset.
10
+ 5. **Convert Processed Layer to Dense**: It converts the processed layer of the AnnData object to a dense matrix.
11
+ 6. **Tokenize in Batches**: The script processes the data in batches, applying tokenization and optional binning or ranking.
12
+ 7. **Save Tokenized Data**: Finally, the script saves the tokenized data to disk.
13
+
14
+ # Tokenization Arguments
15
+ The script uses several tokenization arguments to control its behavior. Here is an explanation of each argument and the steps they influence:
16
+
17
+ - `max_seq_len`
18
+ - Description: Specifies the maximum sequence length for the tokenized data.
19
+ - Impact: Determines the number of genes to include in each tokenized sequence (cell). If add_cls is enabled, the sequence length is reduced by one to accommodate the CLS token.
20
+ - `add_cls`
21
+ - Description: Indicates whether to prepend a CLS token to each sequence.
22
+ - Impact: If enabled, a CLS token is added to the beginning of each sequence, and the sequence length is adjusted accordingly.
23
+ - `cls_token_id`
24
+ - Description: The token ID to use for the CLS token.
25
+ - Impact: If add_cls is enabled, this token ID is used for the CLS token.
26
+ - `random_genes`
27
+ - Description: Specifies whether to select a random subset of genes before applying top-k selection
28
+ - Impact: If enabled, a random subset of genes is selected for each batch, and then the top-k values are determined from this subset.
29
+ - `include_zero_genes`
30
+ - Description: Indicates whether to include zero-expression genes in the tokenized data.
31
+ - Impact: If enabled, zero-expression genes are included in the tokenized sequences. Otherwise, they are filtered out.
32
+ - `bins`
33
+ - Description: Specifies the number of bins to use for binning expression values.
34
+ - Impact: If set, the script bins the expression values into the specified number of bins. This argument is only relevant for TEDDY-X.
35
+ - `continuous_rank`
36
+ - Description: Indicates whether to rank expression values continuously.
37
+ - Impact: If enabled, the script ranks the expression values in the range [-1, 1]. This argument is only relevant for TEDDY-X.
38
+ - `gene_seed`
39
+ - Description: A random seed for reproducibility.
40
+ - Impact: If set, the script uses this seed to ensure reproducible random operations.
41
+ - `gene_id_column`
42
+ - Description: The column name in the AnnData object that contains gene IDs.
43
+ - Impact: The script uses this column to identify genes from vocab in the dataset.
44
+ - `label_column`
45
+ - Description: The column name in the AnnData object that contains classification labels
46
+ - Impact: If set, the script adds these labels to the tokenized data.
47
+ - `bio_annotations`
48
+ - Description: Indicates whether to add biological annotations to the tokenized data.
49
+ - Impact: If enabled, the script adds annotations such as disease, tissue, cell type, and sex to the tokenized data.
50
+ - `disease_mapping`, `tissue_mapping`, `cell_mapping`, `sex_mapping`
51
+ - Description: File paths to JSON files containing mappings for biological annotations.
52
+ - Impact: The script uses these mappings to convert biological annotations to token IDs.
53
+ - `add_disease_annotation`
54
+ - Description: Indicates whether to override labels with disease annotations.
55
+ - Impact: If enabled, the script overrides the labels with disease annotations.
56
+ - `max_shard_samples`
57
+ - Description: The maximum number of samples per shard when saving the tokenized data.
58
+ - Impact: The script splits the tokenized data into shards with the specified maximum number of samples.
teddy/data_processing/tokenization/__init__.py ADDED
File without changes
teddy/data_processing/tokenization/tokenization.py ADDED
@@ -0,0 +1,419 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Module: tokenization.py
3
+
4
+ This module provides a tokenization pipeline for preprocessed single-cell RNA sequencing (scRNA-seq) data.
5
+ It converts gene expression data stored in AnnData format into tokenized sequences that can
6
+ be used for downstream machine learning tasks, such as masked language modeling or classification.
7
+
8
+ Main Features:
9
+ - Tokenizes gene expression data into integer tokens using a custom GeneTokenizer.
10
+ - Supports additional biological annotations (e.g., disease, tissue, cell type, sex).
11
+ - Handles both top-k and random gene selection for tokenization.
12
+ - Configurable via JSON-based hyperparameters or TokenizationArgs objects.
13
+ - Saves tokenized data in Hugging Face Dataset format for efficient processing.
14
+
15
+ Dependencies:
16
+ - anndata, numpy, torch, datasets, tqdm
17
+
18
+ Usage:
19
+ - Run this script as a standalone program with a configuration file specifying the hyperparameters.
20
+ - Import the `tokenize` function and call it with the data path, metadata path, and tokenization arguments.
21
+ """
22
+
23
+ import gc
24
+ import os
25
+ import json
26
+ import random
27
+ import shutil
28
+ from argparse import ArgumentParser
29
+ from typing import Union
30
+
31
+ import anndata as ad
32
+ import numpy as np
33
+ import torch
34
+ from datasets import Dataset, load_from_disk
35
+ from tqdm import tqdm
36
+
37
+ from teddy.tokenizer.gene_tokenizer import GeneTokenizer
38
+ from teddy.tokenizer.tokenization_args import TokenizationArgs
39
+
40
+ ###############################################################################
41
+ # Updated Functions
42
+ ###############################################################################
43
+
44
+
45
+ def _bin_values(vals_list, tokenization_args, no_sorting=False):
46
+ """
47
+ Bins expression values into specified bins, assigning bin 0 to non-expressed genes
48
+ when `include_zero_genes` is True.
49
+
50
+ no_sorting=False => "positional chunk" approach for topk-sorted arrays - provided data_processing is expected to be sorted through topk (input expression values).
51
+ no_sorting=True => simple bucketize approach ignoring the topk order - provided data_processing is not sorted (labels).
52
+ """
53
+ binned_vals = []
54
+ for vals in vals_list:
55
+ if isinstance(vals, np.ndarray):
56
+ vals = torch.tensor(vals)
57
+
58
+ vals_to_bin = vals
59
+
60
+ # Original binning approach
61
+ if not no_sorting:
62
+ # "positional chunk" approach from the original code
63
+ num_repetitions = max(1, len(vals_to_bin) // tokenization_args.bins)
64
+ bin_pattern = torch.arange(0, tokenization_args.bins).unsqueeze(1).repeat(1, num_repetitions).flatten()
65
+
66
+ # slice or pad to match the length of vals_to_bin
67
+ if len(bin_pattern) > len(vals_to_bin):
68
+ bin_pattern = bin_pattern[-len(vals_to_bin) :]
69
+ else:
70
+ extra = len(vals_to_bin) - len(bin_pattern)
71
+ if extra > 0:
72
+ bin_pattern = torch.cat([torch.zeros(extra), bin_pattern])
73
+ bin_pattern = bin_pattern.flip(0)
74
+
75
+ binned_vals.append(bin_pattern)
76
+ else:
77
+ if len(vals_to_bin) > 0:
78
+ bin_edges = torch.linspace(vals_to_bin.min(), vals_to_bin.max(), steps=tokenization_args.bins + 1)
79
+ binned_non_zero_vals = torch.bucketize(vals_to_bin, bin_edges)
80
+ binned_non_zero_vals = torch.clamp(binned_non_zero_vals, min=1)
81
+ binned_tensor = binned_non_zero_vals.float()
82
+ binned_vals.append(binned_tensor)
83
+ else:
84
+ binned_tensor = torch.zeros_like(vals_to_bin, dtype=torch.float)
85
+ binned_vals.append(binned_tensor)
86
+ return binned_vals
87
+
88
+
89
+ def _rank_continuous(vals, tokenization_args):
90
+ """
91
+ Ranks gene expression values in the range [-1, 1].
92
+ """
93
+ if isinstance(vals, np.ndarray):
94
+ vals = torch.tensor(vals)
95
+
96
+ if len(vals) > 0:
97
+ ranked_vals = torch.linspace(-1, 1, steps=len(vals)).flip(0)
98
+ else:
99
+ ranked_vals = vals
100
+ return ranked_vals
101
+
102
+
103
+ def _prepare_tokenizer_args(tokenization_args: Union[dict, TokenizationArgs]):
104
+ """
105
+ Prepares and validates tokenization arguments, ensuring reproducibility
106
+ by setting random seeds if specified.
107
+ """
108
+ if isinstance(tokenization_args, dict):
109
+ load_dir = tokenization_args["load_dir"]
110
+ save_dir = tokenization_args["save_dir"]
111
+ token_args_obj = TokenizationArgs(**tokenization_args)
112
+ else:
113
+ # It's already TokenizationArgs
114
+ load_dir = tokenization_args.load_dir
115
+ save_dir = tokenization_args.save_dir
116
+ token_args_obj = tokenization_args
117
+
118
+ # If a random seed is specified, set it for reproducibility
119
+ if token_args_obj.gene_seed is not None:
120
+ random.seed(token_args_obj.gene_seed)
121
+ np.random.seed(token_args_obj.gene_seed)
122
+ torch.manual_seed(token_args_obj.gene_seed)
123
+ if torch.cuda.is_available():
124
+ torch.cuda.manual_seed_all(token_args_obj.gene_seed)
125
+
126
+ return token_args_obj, load_dir, save_dir
127
+
128
+
129
+ def _check_genes_in_tokenizer(data: ad.AnnData, gene_id_column: str, tokenizer: GeneTokenizer):
130
+ """
131
+ Checks if the genes in the dataset are present in the tokenizer's vocabulary.
132
+ """
133
+ if gene_id_column == "index":
134
+ gene_index = data.var.index
135
+ else:
136
+ gene_index = data.var[gene_id_column]
137
+
138
+ # Check membership in vocab
139
+ gene_in_vocab = np.where([g in tokenizer.vocab for g in gene_index])[0]
140
+ coding_genes = gene_index[gene_in_vocab]
141
+ ratio = len(gene_in_vocab) / len(data.var)
142
+ if ratio < 0.1:
143
+ raise OSError(
144
+ f"Only {ratio:.2%} of gene IDs found in tokenizer vocab. " "Check gene_id_column or vocab mismatch."
145
+ )
146
+ return gene_in_vocab, coding_genes, ratio
147
+
148
+
149
+ def _build_batch_tensors(X_batch: torch.Tensor, token_array: torch.Tensor, token_args, data=None, obs_indices=None):
150
+ """
151
+ Build topk or random subsets for each row in X_batch (batch_size x num_genes).
152
+ Return gene_list, vals_list, labels_list.
153
+ """
154
+ batch_size = X_batch.shape[0]
155
+ seq_tokens = token_args.max_seq_len - 1 if token_args.add_cls else token_args.max_seq_len
156
+
157
+ # If random_genes => pick random subset then topk that subset
158
+ if token_args.random_genes:
159
+ random_indices = torch.stack([torch.randperm(X_batch.shape[1])[:seq_tokens] for _ in range(batch_size)])
160
+ random_vals = torch.gather(X_batch, 1, random_indices)
161
+ top_vals, rel_indices = torch.topk(
162
+ random_vals, k=min(seq_tokens, random_vals.shape[1]), largest=True, sorted=True
163
+ )
164
+ # Convert rel_indices => absolute indices
165
+ top_indices = torch.gather(random_indices, 1, rel_indices)
166
+ else:
167
+ # normal topk
168
+ top_vals, top_indices = torch.topk(X_batch, k=min(seq_tokens, X_batch.shape[1]), largest=True, sorted=True)
169
+
170
+ gene_ids = token_array[top_indices]
171
+
172
+ # If add_cls => prepend a CLS token
173
+ if token_args.add_cls:
174
+ cls_col = torch.tensor(token_args.cls_token_id).repeat(batch_size, 1)
175
+ gene_ids = torch.cat([cls_col, gene_ids], dim=1)
176
+ ones_col = torch.ones(batch_size, 1, dtype=top_vals.dtype)
177
+ top_vals = torch.cat([ones_col, top_vals], dim=1)
178
+
179
+ labels_list = None
180
+
181
+ return gene_ids, top_vals, labels_list, None
182
+
183
+
184
+ ###############################################################################
185
+ # Main tokenize function
186
+ ###############################################################################
187
+ def tokenize(data_path: str, metadata_path: str, tokenization_args: Union[dict, TokenizationArgs]):
188
+ """
189
+ Tokenizes gene expression data stored in AnnData format.
190
+
191
+ Args:
192
+ data_path (str): Path to the AnnData file containing preprocessed gene expression data.
193
+ metadata_path (str): Path to the metadata file in JSON format.
194
+ tokenization_args (Union[dict, TokenizationArgs]): Configuration for tokenization.
195
+ """
196
+
197
+ token_args, load_dir, save_dir = _prepare_tokenizer_args(tokenization_args)
198
+
199
+ # 1) Load GeneTokenizer
200
+ tokenizer = GeneTokenizer.from_pretrained(token_args.tokenizer_name_or_path)
201
+ if token_args.cls_token_id is None:
202
+ token_args.cls_token_id = tokenizer.cls_token_id
203
+
204
+ # 2) Load AnnData
205
+ data = ad.read_h5ad(data_path)
206
+
207
+ if "processed" not in data.layers:
208
+ raise ValueError(f"Missing 'processed' layer in {data_path}")
209
+
210
+ # 3) Genes in vocab
211
+ gene_in_vocab, coding_genes, ratio = _check_genes_in_tokenizer(data, token_args.gene_id_column, tokenizer)
212
+ print(f"{ratio:.2%} of genes found in tokenizer vocab")
213
+
214
+ # 5) Build token array for these genes
215
+ token_array = torch.tensor(tokenizer.encode(coding_genes.tolist(), add_special_tokens=False))
216
+
217
+ # 6) Convert processed layer to dense
218
+ X_matrix = data.layers["processed"].toarray()
219
+
220
+ # 7) Prepare final dictionary => HF Dataset
221
+ all_data = {"gene_ids": [], "values": []}
222
+
223
+ BATCH_SIZE = 512
224
+ n_obs = data.shape[0]
225
+
226
+ for start_idx in tqdm(range(0, n_obs, BATCH_SIZE), desc="Tokenizing in batches"):
227
+ end_idx = min(start_idx + BATCH_SIZE, n_obs)
228
+ obs_indices = np.arange(start_idx, end_idx)
229
+
230
+ X_batch = torch.tensor(X_matrix[obs_indices, :][:, gene_in_vocab], dtype=torch.float)
231
+ gene_ids_batch, vals_batch, labels_batch, decoder_vals_batch = _build_batch_tensors(
232
+ X_batch,
233
+ token_array,
234
+ token_args,
235
+ data=None,
236
+ obs_indices=None,
237
+ )
238
+
239
+ final_gene_list = []
240
+ final_vals_list = []
241
+ final_labels_list = []
242
+ if "decoder_values" in data.layers:
243
+ final_decoder_vals_list = []
244
+
245
+ # Filter out zero if needed
246
+ # or keep them
247
+ for row_idx in range(len(gene_ids_batch)):
248
+ g_row = gene_ids_batch[row_idx]
249
+ v_row = vals_batch[row_idx]
250
+
251
+ if labels_batch is not None:
252
+ lb_row = labels_batch[row_idx]
253
+ else:
254
+ lb_row = None
255
+
256
+ if decoder_vals_batch is not None:
257
+ dec_v_row = decoder_vals_batch[row_idx]
258
+ else:
259
+ dec_v_row = None
260
+
261
+ if not token_args.include_zero_genes:
262
+ nonzero_mask = v_row != 0
263
+ g_row = g_row[nonzero_mask]
264
+ v_row = v_row[nonzero_mask]
265
+ if lb_row is not None:
266
+ lb_row = lb_row[nonzero_mask]
267
+ if dec_v_row is not None:
268
+ dec_v_row = dec_v_row[nonzero_mask]
269
+
270
+ final_gene_list.append(g_row)
271
+ final_vals_list.append(v_row)
272
+ final_labels_list.append(lb_row)
273
+ if "decoder_values" in data.layers:
274
+ final_decoder_vals_list.append(dec_v_row)
275
+
276
+ # If we do binning or rank => apply them
277
+ if token_args.bins and token_args.continuous_rank:
278
+ raise ValueError("Should not use bins and continuous_rank simultaneously.")
279
+
280
+ if token_args.bins:
281
+ # possibly do no_sorting if we are binning "labels"
282
+ # we only do "no_sorting=True" for labels, but let's keep it simple for now
283
+ final_vals_list = _bin_values(final_vals_list, token_args, no_sorting=False)
284
+
285
+ elif token_args.continuous_rank:
286
+ for i, vals in enumerate(final_vals_list):
287
+ final_vals_list[i] = _rank_continuous(vals, token_args)
288
+
289
+ # Add to all_data
290
+ for row_idx in range(len(final_gene_list)):
291
+ all_data["gene_ids"].append(final_gene_list[row_idx].tolist())
292
+ all_data["values"].append(final_vals_list[row_idx].tolist())
293
+
294
+ if token_args.label_column:
295
+ all_data["labels"] = data.obs[token_args.label_column].cat.codes.values.tolist()
296
+
297
+ # bio_annotations
298
+ if token_args.bio_annotations:
299
+ with open(token_args.disease_mapping) as f:
300
+ disease_mapping = json.load(f)
301
+ with open(token_args.tissue_mapping) as f:
302
+ tissue_mapping = json.load(f)
303
+ with open(token_args.cell_mapping) as f:
304
+ cell_mapping = json.load(f)
305
+ with open(token_args.sex_mapping) as f:
306
+ sex_mapping = json.load(f)
307
+
308
+ if "disease" not in data.obs.columns:
309
+ data.obs["disease"] = "normal"
310
+ if "tissue" not in data.obs.columns:
311
+ data.obs["tissue"] = "cultured cell"
312
+ if "sex" not in data.obs.columns:
313
+ data.obs["sex"] = "unknown"
314
+ if "cell_type" not in data.obs.columns:
315
+ data.obs["cell_type"] = "unknown"
316
+
317
+ mapped_diseases = [disease_mapping[k] for k in data.obs["disease"].tolist()]
318
+ mapped_tissues = [tissue_mapping[k] for k in data.obs["tissue"].tolist()]
319
+ mapped_cell_types = [cell_mapping[k] for k in data.obs["cell_type"].tolist()]
320
+ mapped_sexes = [sex_mapping[k] for k in data.obs["sex"].tolist()]
321
+
322
+ all_data["disease"] = tokenizer.encode(mapped_diseases, add_special_tokens=False)
323
+ all_data["tissue"] = tokenizer.encode(mapped_tissues, add_special_tokens=False)
324
+ all_data["cell_type"] = tokenizer.encode(mapped_cell_types, add_special_tokens=False)
325
+ all_data["sex"] = tokenizer.encode(mapped_sexes, add_special_tokens=False)
326
+
327
+ if token_args.add_disease_annotation:
328
+ # We override "labels" with "disease" tokens
329
+ all_data["labels"] = all_data["disease"]
330
+
331
+ del data
332
+ gc.collect()
333
+
334
+ dataset = Dataset.from_dict(all_data)
335
+ num_samples = len(dataset)
336
+ if token_args.max_shard_samples:
337
+ num_shards = num_samples // min(token_args.max_shard_samples, num_samples)
338
+ else:
339
+ num_shards = 1
340
+
341
+ # Compute the path of data_path relative to load_dir
342
+ relative_data_path = os.path.relpath(data_path, load_dir)
343
+ relative_metadata_path = os.path.relpath(metadata_path, load_dir)
344
+
345
+ # Remove the ".h5ad" extension from data_path if desired
346
+ no_extension_data_path = os.path.splitext(relative_data_path)[0]
347
+
348
+ # Reconstruct the final paths under save_dir
349
+ save_tokenized_data_path = os.path.join(save_dir, no_extension_data_path)
350
+ save_metadata_path = os.path.join(save_dir, relative_metadata_path)
351
+
352
+ dataset.save_to_disk(save_tokenized_data_path, num_shards=num_shards)
353
+ shutil.copy(metadata_path, save_metadata_path)
354
+
355
+
356
+ ###############################################################################
357
+ # A simple shard function
358
+ ###############################################################################
359
+ def shard_hf_dataset(data_path: str, metadata_path: str, tokenization_args: Union[dict, TokenizationArgs]):
360
+ """
361
+ Shards a Hugging Face Dataset into smaller chunks for efficient storage and processing.
362
+ """
363
+ if isinstance(tokenization_args, dict):
364
+ load_dir = tokenization_args["load_dir"]
365
+ save_dir = tokenization_args["save_dir"]
366
+ token_args_obj = TokenizationArgs(**tokenization_args)
367
+ else:
368
+ load_dir = tokenization_args.load_dir
369
+ save_dir = tokenization_args.save_dir
370
+ token_args_obj = tokenization_args
371
+
372
+ all_data = load_from_disk(data_path)
373
+ num_samples = len(all_data)
374
+ if token_args_obj.max_shard_samples:
375
+ num_shards = num_samples // min(token_args_obj.max_shard_samples, num_samples)
376
+ else:
377
+ num_shards = 1
378
+
379
+ save_tokenized_data_path = data_path.replace(load_dir, save_dir)
380
+ save_metadata_path = metadata_path.replace(load_dir, save_dir)
381
+ all_data.save_to_disk(save_tokenized_data_path, num_shards=num_shards)
382
+ shutil.copy(metadata_path, save_metadata_path)
383
+
384
+ ###############################################################################
385
+ # Main block
386
+ ###############################################################################
387
+ if __name__ == "__main__":
388
+ parser = ArgumentParser(description="Tokenize an AnnData file for downstream ML tasks.")
389
+ parser.add_argument(
390
+ "--data_path",
391
+ type=str,
392
+ required=True,
393
+ help="Path to the .h5ad file containing the preprocessed scRNA-seq data."
394
+ )
395
+ parser.add_argument(
396
+ "--metadata_path",
397
+ type=str,
398
+ required=True,
399
+ help="Path to the JSON file containing metadata."
400
+ )
401
+ parser.add_argument(
402
+ "--config_path",
403
+ type=str,
404
+ required=True,
405
+ help="Path to the JSON file specifying tokenization hyperparameters."
406
+ )
407
+
408
+ args = parser.parse_args()
409
+
410
+ # Load tokenization arguments from JSON
411
+ with open(args.config_path, "r") as f:
412
+ tokenization_args = json.load(f)
413
+
414
+ # Call the tokenize function
415
+ tokenize(
416
+ data_path=args.data_path,
417
+ metadata_path=args.metadata_path,
418
+ tokenization_args=tokenization_args
419
+ )
teddy/data_processing/utils/__init__.py ADDED
File without changes
teddy/data_processing/utils/bio_annotations/__init__.py ADDED
File without changes
teddy/data_processing/utils/bio_annotations/calculate_biostats.py ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Module: calculate_biostats.py
3
+
4
+ This module calculates and aggregates biological statistics from single-cell RNA sequencing (scRNA-seq) data
5
+ stored in AnnData format. It generates per-category statistics (e.g., disease, cell type, tissue, sex)
6
+ and computes the median expression values for genes across datasets. The results are saved as JSON and CSV files
7
+ for downstream analysis.
8
+
9
+ Main Features:
10
+ - Computes the median expression values for genes in the "processed" layer of AnnData files.
11
+ - Generates category-wise statistics (e.g., counts of diseases, cell types, tissues, and sexes).
12
+ - Aggregates statistics across multiple training datasets.
13
+ - Outputs results in JSON and CSV formats for easy integration with other tools.
14
+
15
+ Dependencies:
16
+ - anndata: For handling AnnData files.
17
+ - numpy: For numerical operations, including median calculations.
18
+ - pandas: For creating and exporting tabular data.
19
+ - tqdm: For progress visualization during processing.
20
+ - glob: For recursive file searching.
21
+
22
+ Usage:
23
+ - Run this script as a standalone program with the following arguments:
24
+ - `--load_dir`: Directory containing the training `.h5ad` files.
25
+ - `--stats_dict_name`: Path to save the aggregated statistics JSON file.
26
+ """
27
+
28
+ import json
29
+ import os
30
+ from argparse import ArgumentParser
31
+ from glob import glob
32
+
33
+ import anndata as ad
34
+ import numpy as np
35
+ import pandas as pd
36
+ from datasets.utils.logging import disable_progress_bar
37
+ from tqdm import tqdm
38
+
39
+
40
+ def make_median_list(file, out_file):
41
+ data = ad.read_h5ad(file)
42
+
43
+ # set up gene ids
44
+ gene_index = data.var.index
45
+ all_X = data.layers["processed"].toarray()
46
+ all_X[all_X == 0] = np.nan
47
+ median = np.nanmedian(all_X, axis=0) # (gene_ids,)
48
+ num_median = np.where(~np.isnan(median))[0]
49
+ median_dict = {gene_index[k]: median[k].item() for k in num_median}
50
+
51
+ with open(out_file, "w") as f:
52
+ json.dump(median_dict, f, indent=4)
53
+
54
+
55
+ if __name__ == "__main__":
56
+ parser = ArgumentParser()
57
+ parser.add_argument("--load_dir", default="")
58
+ parser.add_argument("--stats_dict_name", default="")
59
+ args = parser.parse_args()
60
+ disable_progress_bar()
61
+
62
+ # calculate median of medians
63
+ all_train = list(glob(args.load_dir + "/**/train_*.h5ad", recursive=True))
64
+ print("Generating individual stats")
65
+ for train in tqdm(all_train):
66
+ data = ad.read_h5ad(train, backed="r+")
67
+ stats = {}
68
+ for cat in ["disease", "cell_type", "tissue", "sex"]:
69
+ stats[cat] = data.obs[cat].value_counts().to_dict()
70
+
71
+ with open(os.path.join(os.path.dirname(train), "bio_stats.json"), "w") as f:
72
+ json.dump(stats, f, indent=4)
73
+
74
+ print("Collecting stats")
75
+ summary_dict = {}
76
+ summary_dict["disease"] = {}
77
+ summary_dict["cell_type"] = {}
78
+ summary_dict["tissue"] = {}
79
+ summary_dict["sex"] = {}
80
+ for train in tqdm(all_train):
81
+ with open(os.path.join(os.path.dirname(train), "bio_stats.json")) as f:
82
+ stats = json.load(f)
83
+ for cat in ["disease", "cell_type", "tissue", "sex"]:
84
+ for k in stats[cat].keys():
85
+ if k not in summary_dict[cat].keys():
86
+ summary_dict[cat][k] = stats[cat][k]
87
+ else:
88
+ summary_dict[cat][k] += stats[cat][k]
89
+
90
+ os.makedirs(os.path.dirname(args.stats_dict_name), exist_ok=True)
91
+ with open(args.stats_dict_name, "w") as f:
92
+ json.dump(summary_dict, f, indent=4)
93
+
94
+ # with open(args.stats_dict_name) as f:
95
+ # summary_dict = json.load(f)
96
+
97
+ for cat in ["disease", "cell_type", "tissue", "sex"]:
98
+ df = pd.DataFrame.from_dict(summary_dict[cat], orient="index", columns=["Counts"])
99
+ df.to_csv(args.stats_dict_name.replace(".json", f"_{cat}.csv"))
teddy/data_processing/utils/bio_annotations/data/all_filtered.json ADDED
@@ -0,0 +1,1227 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "disease": {
3
+ "frontotemporal dementia": 143044,
4
+ "normal": 97973249,
5
+ "dementia": 2118803,
6
+ "autosomal dominant polycystic kidney disease": 415071,
7
+ "diabetic kidney disease": 360692,
8
+ "lung adenocarcinoma": 1229577,
9
+ "small cell lung carcinoma": 194368,
10
+ "COVID-19": 1704704,
11
+ "Alzheimer disease": 128732,
12
+ "renal cell carcinoma": 142070,
13
+ "dilated cardiomyopathy": 588582,
14
+ "Crohn disease": 263111,
15
+ "gastric intestinal metaplasia": 33283,
16
+ "gastritis": 160485,
17
+ "Barrett esophagus": 14172,
18
+ "myocardial infarction": 241855,
19
+ "epidermolysis bullosa": 7618,
20
+ "B-cell acute lymphoblastic leukemia": 129641,
21
+ "anencephaly": 2358,
22
+ "metastatic melanoma": 218670,
23
+ "premalignant hematological system disease": 55556,
24
+ "triple-negative breast carcinoma": 13700,
25
+ "luminal B breast carcinoma": 10467,
26
+ "malignant ovarian serous tumor": 649870,
27
+ "basal cell carcinoma": 62950,
28
+ "nonpapillary renal cell carcinoma": 147521,
29
+ "benign prostatic hyperplasia": 41258,
30
+ "luminal A breast carcinoma": 12402,
31
+ "systemic lupus erythematosus": 715392,
32
+ "Lewy body dementia": 26103,
33
+ "Parkinson disease": 130493,
34
+ "periodontitis": 32082,
35
+ "arrhythmogenic right ventricular cardiomyopathy": 52749,
36
+ "non-compaction cardiomyopathy": 9733,
37
+ "pulmonary emphysema": 21052,
38
+ "juvenile dermatomyositis": 92130,
39
+ "blastoma": 25472,
40
+ "hydrosalpinx": 1048,
41
+ "chronic kidney disease": 225534,
42
+ "type 1 diabetes mellitus": 90792,
43
+ "acute kidney failure": 128160,
44
+ "listeriosis": 52720,
45
+ "Plasmodium malariae malaria": 16864,
46
+ "pilocytic astrocytoma": 35774,
47
+ "glioblastoma": 1145013,
48
+ "Wilms tumor": 3474,
49
+ "primary sclerosing cholangitis": 27661,
50
+ "amyotrophic lateral sclerosis": 61201,
51
+ "aspiration pneumonia": 13718,
52
+ "malignant pancreatic neoplasm": 4296,
53
+ "amyotrophic lateral sclerosis 26 with or without frontotemporal dementia": 44240,
54
+ "acute myeloid leukemia": 31466,
55
+ "breast carcinoma": 71768,
56
+ "temporal lobe epilepsy": 2094,
57
+ "type 2 diabetes mellitus": 143044,
58
+ "non-small cell lung carcinoma": 524920,
59
+ "Crohn ileitis": 22076,
60
+ "respiratory failure": 4176,
61
+ "long COVID-19": 258,
62
+ "tubular adenoma": 719,
63
+ "tubulovillous adenoma": 402,
64
+ "colorectal cancer": 1155,
65
+ "multiple sclerosis": 10249,
66
+ "clear cell renal carcinoma": 96277,
67
+ "colon sessile serrated adenoma/polyp": 748,
68
+ "hyperplastic polyp": 101,
69
+ "epilepsy": 92965,
70
+ "brain neoplasm": 78,
71
+ "hydrocephalus": 6,
72
+ "plasma cell myeloma": 3264,
73
+ "lymphadenitis": 1373,
74
+ "primary biliary cholangitis": 31864,
75
+ "chromophobe renal cell carcinoma": 501,
76
+ "kidney oncocytoma": 45668,
77
+ "neuroendocrine carcinoma": 558,
78
+ "adenocarcinoma": 1033,
79
+ "influenza": 8650,
80
+ "endocrine pancreas disorder": 28000,
81
+ "age related macular degeneration 7": 1052,
82
+ "basal laminar drusen": 891,
83
+ "opiate dependence": 50745,
84
+ "digestive system disorder": 15495,
85
+ "breast cancer": 4376,
86
+ "cardiomyopathy": 2570,
87
+ "respiratory system disorder": 82770,
88
+ "pulmonary fibrosis": 215970,
89
+ "chronic obstructive pulmonary disease": 58075,
90
+ "interstitial lung disease": 57615,
91
+ "squamous cell lung carcinoma": 20585,
92
+ "cystic fibrosis": 17940,
93
+ "lung large cell carcinoma": 17825,
94
+ "chronic rhinitis": 16790,
95
+ "lymphangioleiomyomatosis": 11385,
96
+ "pleomorphic carcinoma": 11155,
97
+ "pulmonary sarcoidosis": 4600,
98
+ "hypersensitivity pneumonitis": 3910,
99
+ "non-specific interstitial pneumonia": 230
100
+ },
101
+ "cell_type": {
102
+ "oligodendrocyte": 4057585,
103
+ "neuron": 6962719,
104
+ "astrocyte": 1585731,
105
+ "oligodendrocyte precursor cell": 742101,
106
+ "microglial cell": 2726973,
107
+ "endothelial cell": 2112592,
108
+ "extravillous trophoblast": 83805,
109
+ "placental villous trophoblast": 57075,
110
+ "syncytiotrophoblast cell": 14970,
111
+ "skin fibroblast": 123651,
112
+ "T cell": 742776,
113
+ "enterocyte": 46834,
114
+ "endothelial cell of lymphatic vessel": 128117,
115
+ "fibroblast": 1845139,
116
+ "blood vessel endothelial cell": 169588,
117
+ "B cell": 1179496,
118
+ "enteroendocrine cell": 11373,
119
+ "macrophage": 847475,
120
+ "dendritic cell": 79130,
121
+ "vascular leptomeningeal cell": 74718,
122
+ "retina horizontal cell": 21504,
123
+ "natural killer cell": 672198,
124
+ "large pre-B-II cell": 15414,
125
+ "small pre-B-II cell": 13686,
126
+ "double negative thymocyte": 60459,
127
+ "pro-B cell": 47113,
128
+ "group 3 innate lymphoid cell": 10890,
129
+ "late pro-B cell": 10121,
130
+ "fraction A pre-pro B cell": 7828,
131
+ "B-2 B cell": 2072,
132
+ "unknown": 1083041,
133
+ "early lymphoid progenitor": 3226,
134
+ "double-positive, alpha-beta thymocyte": 70221,
135
+ "hematopoietic stem cell": 86169,
136
+ "naive thymus-derived CD4-positive, alpha-beta T cell": 676827,
137
+ "hematopoietic multipotent progenitor cell": 18797,
138
+ "B-1 B cell": 394,
139
+ "naive thymus-derived CD8-positive, alpha-beta T cell": 240314,
140
+ "megakaryocyte-erythroid progenitor cell": 25072,
141
+ "regulatory T cell": 176108,
142
+ "mature B cell": 15042,
143
+ "group 2 innate lymphoid cell": 287,
144
+ "innate lymphoid cell": 367929,
145
+ "immature B cell": 18765,
146
+ "common myeloid progenitor": 2420,
147
+ "CD8-alpha-alpha-positive, alpha-beta intraepithelial T cell": 6932,
148
+ "granulocyte monocyte progenitor cell": 5762,
149
+ "plasma cell": 298232,
150
+ "kidney proximal convoluted tubule epithelial cell": 870511,
151
+ "leukocyte": 139392,
152
+ "kidney loop of Henle thick ascending limb epithelial cell": 357616,
153
+ "kidney distal convoluted tubule epithelial cell": 96341,
154
+ "kidney interstitial fibroblast": 87152,
155
+ "blood vessel smooth muscle cell": 27558,
156
+ "kidney collecting duct principal cell": 218533,
157
+ "kidney collecting duct intercalated cell": 53571,
158
+ "podocyte": 22019,
159
+ "mesangial cell": 2969,
160
+ "kidney granular cell": 3024,
161
+ "macula densa epithelial cell": 1030,
162
+ "muscle cell": 35598,
163
+ "fibroblast of dermis": 54301,
164
+ "tendon cell": 28946,
165
+ "Schwann cell": 35927,
166
+ "chondrocyte": 304322,
167
+ "smooth muscle cell": 216928,
168
+ "endothelial cell of artery": 206630,
169
+ "reticulocyte": 676,
170
+ "vein endothelial cell": 261893,
171
+ "pericyte": 452287,
172
+ "peridermal cell": 834,
173
+ "basal cell": 438249,
174
+ "articular chondrocyte": 128,
175
+ "mesenchymal cell": 434310,
176
+ "connective tissue cell": 71,
177
+ "erythrocyte": 284387,
178
+ "hypertrophic chondrocyte": 146,
179
+ "megakaryocyte": 75963,
180
+ "muscle fibroblast": 34365,
181
+ "mature NK T cell": 138818,
182
+ "myeloid cell": 329228,
183
+ "kidney interstitial cell": 12086,
184
+ "epithelial cell of nephron": 9155,
185
+ "mesenchymal stem cell": 75348,
186
+ "epithelial cell of proximal tubule": 272380,
187
+ "kidney connecting tubule epithelial cell": 24270,
188
+ "epithelial cell of glomerular capsule": 940,
189
+ "nephron tubule epithelial cell": 750,
190
+ "kidney collecting duct cell": 670,
191
+ "stromal cell of ovary": 49866,
192
+ "granulosa cell": 32638,
193
+ "theca cell": 7196,
194
+ "epithelial cell": 699771,
195
+ "epithelial cell of alveolus of lung": 3875,
196
+ "goblet cell": 16960,
197
+ "ionocyte": 2292,
198
+ "hepatocyte": 423766,
199
+ "ciliated epithelial cell": 6479,
200
+ "neuroendocrine cell": 1145,
201
+ "club cell": 94729,
202
+ "brush cell": 1345,
203
+ "platelet": 38070,
204
+ "central nervous system macrophage": 170125,
205
+ "ependymal cell": 97775,
206
+ "vascular associated smooth muscle cell": 235809,
207
+ "mesothelial cell": 180169,
208
+ "neutrophil": 147519,
209
+ "monocyte": 378010,
210
+ "stromal cell": 1139185,
211
+ "cord blood hematopoietic stem cell": 120,
212
+ "mast cell": 140002,
213
+ "professional antigen presenting cell": 3636,
214
+ "erythroid lineage cell": 6784,
215
+ "primordial germ cell": 2661,
216
+ "alternatively activated macrophage": 13712,
217
+ "L2/3-6 intratelencephalic projecting glutamatergic neuron": 4474360,
218
+ "pvalb GABAergic cortical interneuron": 544593,
219
+ "chandelier pvalb GABAergic cortical interneuron": 64124,
220
+ "sst GABAergic cortical interneuron": 478132,
221
+ "Bergmann glial cell": 52764,
222
+ "glutamatergic neuron": 10341906,
223
+ "transit amplifying cell of colon": 1716,
224
+ "CD8-alpha-beta-positive, alpha-beta intraepithelial T cell": 9,
225
+ "intestinal crypt stem cell": 13171,
226
+ "intestinal tuft cell": 136,
227
+ "enteric smooth muscle cell": 12676,
228
+ "smooth muscle cell of large intestine": 1194,
229
+ "interstitial cell of Cajal": 3948,
230
+ "smooth muscle cell of small intestine": 255,
231
+ "cardiac valve cell": 89903,
232
+ "primitive red blood cell": 61602,
233
+ "neurectodermal cell": 15338,
234
+ "midbrain dopaminergic neuron": 223991,
235
+ "paraxial cell": 44784,
236
+ "mesodermal cell": 6565020,
237
+ "splanchnic mesodermal cell": 274057,
238
+ "neuroplacodal cell": 30685,
239
+ "premigratory neural crest cell": 133537,
240
+ "notochordal cell": 30276,
241
+ "hemangioblast": 7952,
242
+ "spinal cord interneuron": 69801,
243
+ "endodermal cell": 13635,
244
+ "surface ectodermal cell": 8005,
245
+ "gut endothelial cell": 9987,
246
+ "anterior visceral endoderm cell": 2078,
247
+ "activated CD4-negative, CD8-negative type I NK T cell": 2035,
248
+ "parietal epithelial cell": 6234,
249
+ "kidney loop of Henle epithelial cell": 2578,
250
+ "kidney loop of Henle thin descending limb epithelial cell": 42271,
251
+ "malignant cell": 1185459,
252
+ "exhausted T cell": 21551,
253
+ "CD4-positive helper T cell": 81404,
254
+ "CD8-positive, alpha-beta T cell": 820541,
255
+ "promonocyte": 25130,
256
+ "granulocyte": 52137,
257
+ "osteoclast": 4143,
258
+ "promyelocyte": 5841,
259
+ "Kupffer cell": 57611,
260
+ "pre-conventional dendritic cell": 340,
261
+ "myelocyte": 7951,
262
+ "plasmacytoid dendritic cell": 65151,
263
+ "common dendritic progenitor": 2427,
264
+ "mural cell": 382262,
265
+ "myofibroblast cell": 121210,
266
+ "glial cell": 85689,
267
+ "lymphocyte": 61696,
268
+ "retinal ganglion cell": 193525,
269
+ "lamp5 GABAergic cortical interneuron": 374201,
270
+ "luminal epithelial cell of mammary gland": 232901,
271
+ "endothelial cell of vascular tree": 296575,
272
+ "mammary gland epithelial cell": 73026,
273
+ "adipocyte of breast": 4072,
274
+ "IgA plasma cell": 33332,
275
+ "class switched memory B cell": 13393,
276
+ "naive B cell": 264747,
277
+ "IgG plasma cell": 7559,
278
+ "unswitched memory B cell": 8330,
279
+ "centrilobular region hepatocyte": 43795,
280
+ "periportal region hepatocyte": 64925,
281
+ "blood cell": 1735,
282
+ "tracheal epithelial cell": 91061,
283
+ "medium spiny neuron": 94404,
284
+ "inhibitory interneuron": 161476,
285
+ "cell": 2298,
286
+ "uterine smooth muscle cell": 10162,
287
+ "decidual natural killer cell, human": 9147,
288
+ "endothelial cell of uterus": 5504,
289
+ "trophoblast giant cell": 54,
290
+ "embryonic fibroblast": 207,
291
+ "cardiac endothelial cell": 47096,
292
+ "fibroblast of cardiac tissue": 328488,
293
+ "immature innate lymphoid cell": 52909,
294
+ "cardiac muscle myoblast": 20061,
295
+ "lymphoid lineage restricted progenitor cell": 17592,
296
+ "smooth muscle myoblast": 3485,
297
+ "neuronal receptor cell": 1736,
298
+ "fibroblast of lymphatic vessel": 425,
299
+ "flat midget bipolar cell": 452566,
300
+ "classical monocyte": 798528,
301
+ "conventional dendritic cell": 85061,
302
+ "CD14-positive monocyte": 492097,
303
+ "effector memory CD8-positive, alpha-beta T cell": 228419,
304
+ "CD14-positive, CD16-positive monocyte": 3762,
305
+ "central memory CD4-positive, alpha-beta T cell": 664927,
306
+ "CD56-positive, CD161-positive immature natural killer cell, human": 910,
307
+ "CD16-positive, CD56-dim natural killer cell, human": 242315,
308
+ "CD8-positive, alpha-beta cytotoxic T cell": 48837,
309
+ "supporting cell": 25177,
310
+ "interstitial cell of ovary": 15608,
311
+ "hematopoietic cell": 59088,
312
+ "neural cell": 6988055,
313
+ "germ cell": 5414,
314
+ "ovarian surface epithelial cell": 3128,
315
+ "L4/5 intratelencephalic projecting glutamatergic neuron of the primary motor cortex": 327769,
316
+ "L6 corticothalamic-projecting glutamatergic cortical neuron": 146758,
317
+ "vip GABAergic cortical interneuron": 584164,
318
+ "L6 intratelencephalic projecting glutamatergic neuron of the primary motor cortex": 81460,
319
+ "hippocampal neuron": 64655,
320
+ "L6b glutamatergic cortical neuron": 137215,
321
+ "L5/6 near-projecting glutamatergic neuron of the primary motor cortex": 38791,
322
+ "L5 extratelencephalic projecting glutamatergic cortical neuron": 47098,
323
+ "pyramidal neuron": 25116,
324
+ "sncg GABAergic cortical interneuron": 159382,
325
+ "corticothalamic-projecting glutamatergic cortical neuron": 160820,
326
+ "L2/3 intratelencephalic projecting glutamatergic neuron of the primary motor cortex": 12699,
327
+ "sst chodl GABAergic cortical interneuron": 2506,
328
+ "cortical interneuron": 17508,
329
+ "vascular leptomeningeal cell (Mmus)": 84,
330
+ "meis2 expressing cortical GABAergic cell": 72,
331
+ "Cajal-Retzius cell": 52222,
332
+ "fibroblast of lung": 41525,
333
+ "type I pneumocyte": 84264,
334
+ "type II pneumocyte": 317250,
335
+ "gut absorptive cell": 2017,
336
+ "progenitor cell": 102352,
337
+ "intestinal crypt stem cell of large intestine": 1020,
338
+ "transit amplifying cell of small intestine": 1050,
339
+ "intestinal crypt stem cell of small intestine": 864,
340
+ "secretory cell": 93111,
341
+ "intestine goblet cell": 5700,
342
+ "enterocyte of epithelium of large intestine": 21896,
343
+ "paneth cell of epithelium of small intestine": 61,
344
+ "intestinal enteroendocrine cell": 560,
345
+ "duodenum glandular cell": 2,
346
+ "large intestine goblet cell": 7884,
347
+ "T follicular helper cell": 25073,
348
+ "GABAergic neuron": 2473147,
349
+ "fibroblast of mammary gland": 2345758,
350
+ "perivascular cell": 202406,
351
+ "luminal adaptive secretory precursor cell of mammary gland": 370605,
352
+ "endothelial tip cell": 33694,
353
+ "CD8-positive, alpha-beta memory T cell": 127077,
354
+ "luminal hormone-sensing cell of mammary gland": 246454,
355
+ "myoepithelial cell of mammary gland": 64397,
356
+ "capillary endothelial cell": 221145,
357
+ "brain vascular cell": 128078,
358
+ "dopaminergic neuron": 45014,
359
+ "serotonergic neuron": 3977,
360
+ "cerebellar neuron": 259346,
361
+ "neural progenitor cell": 323569,
362
+ "CD4-positive, alpha-beta T cell": 1055045,
363
+ "glycinergic amacrine cell": 462394,
364
+ "starburst amacrine cell": 51890,
365
+ "retinal rod cell": 1321083,
366
+ "Mueller cell": 264909,
367
+ "rod bipolar cell": 157105,
368
+ "ON-bipolar cell": 52848,
369
+ "OFF-bipolar cell": 43664,
370
+ "retinal cone cell": 87116,
371
+ "amacrine cell": 290117,
372
+ "melanocyte": 22531,
373
+ "retinal pigment epithelial cell": 7979,
374
+ "adipocyte": 145654,
375
+ "fibro/adipogenic progenitor cell": 15532,
376
+ "neuron associated cell": 888,
377
+ "inhibitory motor neuron": 483,
378
+ "motor neuron": 484,
379
+ "precursor B cell": 56201,
380
+ "interneuron": 34115,
381
+ "fallopian tube secretory epithelial cell": 105472,
382
+ "suprabasal keratinocyte": 22807,
383
+ "basal cell of epidermis": 83046,
384
+ "proerythroblast": 12294,
385
+ "kidney loop of Henle ascending limb epithelial cell": 8025,
386
+ "collagen secreting cell": 12335,
387
+ "epithelial cell of proximal tubule segment 1": 5885,
388
+ "MHC-II-positive classical monocyte": 77,
389
+ "naive T cell": 16638,
390
+ "chondroblast": 25460,
391
+ "osteoblast": 13763,
392
+ "myoblast": 221886,
393
+ "skeletal muscle myoblast": 12575,
394
+ "Schwann cell precursor": 459201,
395
+ "keratinocyte": 831988,
396
+ "inflammatory macrophage": 11854,
397
+ "monocyte-derived dendritic cell": 30,
398
+ "Langerhans cell": 730,
399
+ "cytotoxic T cell": 4099,
400
+ "forebrain neuroblast": 19384,
401
+ "chandelier cell": 898,
402
+ "caudal ganglionic eminence derived GABAergic cortical interneuron": 45259,
403
+ "basal cell of prostate epithelium": 100016,
404
+ "epithelial cell of urethra": 10903,
405
+ "luminal cell of prostate epithelium": 60854,
406
+ "prostate gland microvascular endothelial cell": 8454,
407
+ "prostate stromal cell": 1040,
408
+ "smooth muscle cell of prostate": 12018,
409
+ "lymphocyte of B lineage": 604,
410
+ "smooth muscle cell of the pulmonary artery": 5782,
411
+ "acinar cell of salivary gland": 97013,
412
+ "memory B cell": 116175,
413
+ "adventitial cell": 35481,
414
+ "duct epithelial cell": 5535,
415
+ "endothelial cell of hepatic sinusoid": 55050,
416
+ "non-classical monocyte": 134769,
417
+ "plasmablast": 20107,
418
+ "glomerular endothelial cell": 31920,
419
+ "renal intercalated cell": 428,
420
+ "vasa recta ascending limb cell": 670,
421
+ "vasa recta descending limb cell": 318,
422
+ "kidney epithelial cell": 8247,
423
+ "renal beta-intercalated cell": 3596,
424
+ "renal alpha-intercalated cell": 5747,
425
+ "urothelial cell": 45,
426
+ "renal principal cell": 2709,
427
+ "cell of skeletal muscle": 986689,
428
+ "thymocyte": 25853,
429
+ "pro-T cell": 5208,
430
+ "hematopoietic precursor cell": 10899,
431
+ "stem cell": 26362,
432
+ "paneth cell": 1204,
433
+ "type L enteroendocrine cell": 538,
434
+ "type EC enteroendocrine cell": 1642,
435
+ "hepatic stellate cell": 6302,
436
+ "cholangiocyte": 4645,
437
+ "endothelial cell of periportal hepatic sinusoid": 2930,
438
+ "endothelial cell of pericentral hepatic sinusoid": 7562,
439
+ "alveolar macrophage": 428496,
440
+ "effector memory CD4-positive, alpha-beta T cell": 86897,
441
+ "myeloid leukocyte": 13316,
442
+ "CD1c-positive myeloid dendritic cell": 90776,
443
+ "myeloid dendritic cell, human": 204,
444
+ "stratified epithelial cell": 26520,
445
+ "epithelial cell of stratum germinativum of esophagus": 405,
446
+ "mononuclear phagocyte": 41624,
447
+ "mucus secreting cell": 1815,
448
+ "regular atrial cardiac myocyte": 137963,
449
+ "Tc1 cell": 4156,
450
+ "endothelial cell of placenta": 21094,
451
+ "Hofbauer cell": 40343,
452
+ "group 3 innate lymphoid cell, human": 187,
453
+ "kidney collecting duct epithelial cell": 707,
454
+ "fenestrated cell": 3592,
455
+ "early T lineage precursor": 672,
456
+ "CD4-positive, alpha-beta memory T cell": 17487,
457
+ "erythroid progenitor cell": 776619,
458
+ "central memory CD8-positive, alpha-beta T cell": 26584,
459
+ "gamma-delta T cell": 113818,
460
+ "early promyelocyte": 4281,
461
+ "CD16-negative, CD56-bright natural killer cell, human": 37098,
462
+ "megakaryocyte progenitor cell": 423,
463
+ "late promyelocyte": 1039,
464
+ "basophil mast progenitor cell": 275,
465
+ "CD4-positive, alpha-beta cytotoxic T cell": 18248,
466
+ "airway submucosal gland duct basal cell": 4957,
467
+ "serous secreting cell of bronchus submucosal gland": 16876,
468
+ "ciliated cell": 32399,
469
+ "lung secretory cell": 20594,
470
+ "myoepithelial cell": 2506,
471
+ "lung macrophage": 15050,
472
+ "mesenchymal stem cell of adipose tissue": 31777,
473
+ "regular ventricular cardiac myocyte": 235052,
474
+ "choroid plexus epithelial cell": 77830,
475
+ "aortic endothelial cell": 969,
476
+ "fibrocyte": 386,
477
+ "kidney loop of Henle thin ascending limb epithelial cell": 28383,
478
+ "kidney interstitial alternatively activated macrophage": 6719,
479
+ "renal interstitial pericyte": 11322,
480
+ "papillary tips cell": 714,
481
+ "fast muscle cell": 2253,
482
+ "skeletal muscle fiber": 1683,
483
+ "slow muscle cell": 1564,
484
+ "skeletal muscle satellite cell": 439526,
485
+ "retinal blood vessel endothelial cell": 980,
486
+ "non-myelinating Schwann cell": 587,
487
+ "lung perichondrial fibroblast": 1769,
488
+ "respiratory suprabasal cell": 4061,
489
+ "lung pericyte": 14470,
490
+ "memory T cell": 19260,
491
+ "leptomeningeal cell": 3653,
492
+ "Sertoli cell": 1349,
493
+ "macroglial cell": 38258,
494
+ "retinal bipolar neuron": 137284,
495
+ "cerebellar granule cell": 763044,
496
+ "intermediate monocyte": 6477,
497
+ "erythroblast": 78556,
498
+ "midzonal region hepatocyte": 7649,
499
+ "endothelial cell of venule": 17454,
500
+ "helper T cell": 25512,
501
+ "mucosal invariant T cell": 82156,
502
+ "T-helper 17 cell": 465,
503
+ "olfactory epithelial cell": 150663,
504
+ "auditory epithelial cell": 131901,
505
+ "endo-epithelial cell": 167871,
506
+ "epithelial cell of amnion": 80603,
507
+ "intermediate mesodermal cell": 88574,
508
+ "ectodermal cell": 50893,
509
+ "metanephric mesenchyme stem cell": 7479,
510
+ "ureteric bud cell": 5049,
511
+ "pituitary gland cell": 10257,
512
+ "pancreatic acinar cell": 45689,
513
+ "lens epithelial cell": 4935,
514
+ "epithelial cell of parathyroid gland": 540,
515
+ "epithelial cell of thymus": 651,
516
+ "intrahepatic cholangiocyte": 4572,
517
+ "epithelial cell of thyroid gland": 1057,
518
+ "peripheral nervous system neuron": 185140,
519
+ "neural crest cell": 1710,
520
+ "sensory neuron": 423,
521
+ "cerebral cortex endothelial cell": 42882,
522
+ "microvascular endothelial cell": 7776,
523
+ "brain pericyte": 5548,
524
+ "endocardial cell": 69205,
525
+ "adipocyte of epicardial fat of left ventricle": 732,
526
+ "CD14-low, CD16-positive monocyte": 82303,
527
+ "DN4 thymocyte": 5218,
528
+ "pancreatic stellate cell": 27724,
529
+ "pancreatic ductal cell": 163344,
530
+ "type B pancreatic cell": 186236,
531
+ "CD8-positive, alpha-beta memory T cell, CD45RO-positive": 15518,
532
+ "alpha-beta T cell": 28679,
533
+ "effector memory CD8-positive, alpha-beta T cell, terminally differentiated": 6391,
534
+ "brown preadipocyte": 37252,
535
+ "brown adipocyte": 73109,
536
+ "lung ciliated cell": 2749,
537
+ "effector CD8-positive, alpha-beta T cell": 120129,
538
+ "T-helper 22 cell": 29502,
539
+ "myeloid dendritic cell": 8631,
540
+ "dendritic cell, human": 5529,
541
+ "erythroid progenitor cell, mammalian": 922,
542
+ "ILC1, human": 825,
543
+ "CD34-positive, CD38-negative hematopoietic stem cell": 561,
544
+ "IgM plasma cell": 1012,
545
+ "T-helper 1 cell": 99,
546
+ "group 2 innate lymphoid cell, human": 66,
547
+ "myeloid lineage restricted progenitor cell": 2530,
548
+ "T-helper 2 cell": 35,
549
+ "astrocyte of the cerebral cortex": 277552,
550
+ "near-projecting glutamatergic cortical neuron": 112806,
551
+ "effector CD4-positive, alpha-beta T cell": 99747,
552
+ "type I NK T cell": 44099,
553
+ "CD141-positive myeloid dendritic cell": 4243,
554
+ "mature conventional dendritic cell": 228,
555
+ "melanocyte of skin": 2528,
556
+ "pancreatic A cell": 85447,
557
+ "pancreatic D cell": 33374,
558
+ "pancreatic PP cell": 7920,
559
+ "CD14-positive, CD16-negative classical monocyte": 42474,
560
+ "CD4-positive, CD25-positive, alpha-beta regulatory T cell": 1167,
561
+ "kidney connecting tubule principal cell": 317,
562
+ "epithelial cell of large intestine": 8616,
563
+ "Purkinje cell": 270713,
564
+ "granule cell": 63342,
565
+ "neuron associated cell (sensu Vertebrata)": 36057,
566
+ "stellate neuron": 29019,
567
+ "neuronal brush cell": 13107,
568
+ "myotube": 34568,
569
+ "muscle precursor cell": 65050,
570
+ "transitional stage B cell": 50088,
571
+ "immature neutrophil": 16,
572
+ "medial ganglionic eminence derived interneuron": 1431,
573
+ "caudal ganglionic eminence derived interneuron": 852,
574
+ "bronchus fibroblast of lung": 11918,
575
+ "pigmented epithelial cell": 21792,
576
+ "smooth muscle cell of sphincter of pupil": 1107,
577
+ "IgG plasmablast": 599,
578
+ "IgA plasmablast": 413,
579
+ "plasmatocyte": 1663,
580
+ "kidney cortex artery cell": 1199,
581
+ "kidney capillary endothelial cell": 168,
582
+ "kidney proximal straight tubule epithelial cell": 13294,
583
+ "cardiac muscle cell": 765261,
584
+ "mesothelial cell of epicardium": 2111,
585
+ "fetal cardiomyocyte": 128,
586
+ "cardiac mesenchymal cell": 4,
587
+ "pneumocyte": 3852,
588
+ "mononuclear cell": 922,
589
+ "tonsil germinal center B cell": 700,
590
+ "centroblast": 110,
591
+ "centrocyte": 56,
592
+ "macrophage dendritic cell progenitor": 218,
593
+ "immature NK T cell": 3517,
594
+ "neuroblast (sensu Vertebrata)": 2062822,
595
+ "alveolar type 2 fibroblast cell": 33366,
596
+ "tracheobronchial smooth muscle cell": 18892,
597
+ "lung goblet cell": 724,
598
+ "respiratory basal cell": 132261,
599
+ "brush cell of trachebronchial tree": 912,
600
+ "mesothelial fibroblast": 78,
601
+ "bladder urothelial cell": 7613,
602
+ "bladder cell": 5132,
603
+ "neoplastic cell": 25559,
604
+ "endothelial cell of coronary artery": 8166,
605
+ "cardiac neuron": 11728,
606
+ "OFF retinal ganglion cell": 896,
607
+ "ON retinal ganglion cell": 482,
608
+ "lung resident memory CD8-positive, alpha-beta T cell": 6137,
609
+ "lung resident memory CD4-positive, alpha-beta T cell": 2732,
610
+ "deuterosomal cell": 266,
611
+ "granulocytopoietic cell": 16454,
612
+ "basophil": 816,
613
+ "PP cell": 408,
614
+ "pancreatic epsilon cell": 95,
615
+ "fibroblast of connective tissue of prostate": 7852,
616
+ "double negative T regulatory cell": 265,
617
+ "progenitor cell of mammary luminal epithelium": 8436,
618
+ "lactocyte": 5949,
619
+ "vascular lymphangioblast": 4051,
620
+ "lung endothelial cell": 35063,
621
+ "respiratory goblet cell": 1842,
622
+ "cardiac pacemaker cell of sinoatrial node": 792,
623
+ "activated CD4-positive, alpha-beta T cell": 3104,
624
+ "differentiation-committed oligodendrocyte precursor": 3019,
625
+ "glycinergic neuron": 93491,
626
+ "keratinocyte stem cell": 5652,
627
+ "bronchial smooth muscle cell": 11900,
628
+ "epidermal cell": 2044,
629
+ "basal epithelial cell of tracheobronchial tree": 927,
630
+ "neural stem cell": 272,
631
+ "mature alpha-beta T cell": 54887,
632
+ "brush cell of epithelium proper of large intestine": 313,
633
+ "smooth muscle cell of trachea": 337,
634
+ "ciliated columnar cell of tracheobronchial tree": 71880,
635
+ "early pro-B cell": 32954,
636
+ "pulmonary interstitial fibroblast": 176,
637
+ "neuroepithelial stem cell": 163,
638
+ "lung neuroendocrine cell": 627,
639
+ "common lymphoid progenitor": 7871,
640
+ "plasmacytoid dendritic cell, human": 3061,
641
+ "activated CD4-positive, alpha-beta T cell, human": 1153,
642
+ "lateral mesodermal cell": 2491777,
643
+ "hypothalamus cell": 342662,
644
+ "primitive erythroid progenitor": 427024,
645
+ "retinal progenitor cell": 340536,
646
+ "spinal cord motor neuron": 152116,
647
+ "cranial motor neuron": 98269,
648
+ "enteric neuron": 37246,
649
+ "spiral ganglion neuron": 27856,
650
+ "cerebral cortex GABAergic interneuron": 565934,
651
+ "embryonic blood vessel endothelial progenitor cell": 10477,
652
+ "sympathetic neuron": 55650,
653
+ "olfactory receptor cell": 10237,
654
+ "extraembryonic cell": 717,
655
+ "fibroblast of breast": 5733,
656
+ "endothelial cell of umbilical vein": 56790,
657
+ "transit amplifying cell": 3564,
658
+ "M cell of gut": 52,
659
+ "hypendymal cell": 2379,
660
+ "oogonial cell": 4905,
661
+ "female germ cell": 1990,
662
+ "male germ cell": 1271,
663
+ "oocyte": 316,
664
+ "basket cell": 20982,
665
+ "epithelial cell of prostate": 15037,
666
+ "basal epithelial cell of prostatic duct": 12694,
667
+ "contractile cell": 726,
668
+ "mature T cell": 111587,
669
+ "eosinophil": 150,
670
+ "corneal epithelial cell": 15616,
671
+ "corneal endothelial cell": 4434,
672
+ "activated CD8-positive, alpha-beta T cell": 15428,
673
+ "follicular B cell": 3924,
674
+ "colon macrophage": 9,
675
+ "myelinating Schwann cell": 43810,
676
+ "cell in vitro": 23880,
677
+ "S cone cell": 813,
678
+ "lung interstitial macrophage": 152,
679
+ "Leydig cell": 549,
680
+ "L2/3 intratelencephalic projecting glutamatergic neuron": 92223,
681
+ "enterocyte of colon": 12280,
682
+ "mesenchymal lymphangioblast": 2134,
683
+ "colon epithelial cell": 3672,
684
+ "CD34-positive, CD56-positive, CD117-positive common innate lymphoid precursor, human": 1012,
685
+ "NKp44-positive group 3 innate lymphoid cell, human": 748,
686
+ "NKp44-negative group 3 innate lymphoid cell, human": 374,
687
+ "primary sensory neuron (sensu Teleostei)": 22,
688
+ "type N enteroendocrine cell": 22,
689
+ "progenitor cell of endocrine pancreas": 22,
690
+ "CD4-positive, alpha-beta thymocyte": 10413,
691
+ "fibroblast of connective tissue of nonglandular part of prostate": 3872,
692
+ "fibroblast of connective tissue of glandular part of prostate": 1609,
693
+ "CD8-positive, alpha-beta thymocyte": 4748,
694
+ "enucleate erythrocyte": 802,
695
+ "lung microvascular endothelial cell": 118,
696
+ "serous cell of epithelium of bronchus": 8,
697
+ "pulmonary ionocyte": 529,
698
+ "epithelial cell of pancreas": 1184,
699
+ "cultured cell": 183749,
700
+ "reticular cell": 1742,
701
+ "inflammatory cell": 1504,
702
+ "stem cell of epidermis": 1247,
703
+ "pigmented ciliary epithelial cell": 8987,
704
+ "non-pigmented ciliary epithelial cell": 2750,
705
+ "ciliary muscle cell": 7528,
706
+ "acinar cell": 326656,
707
+ "endocrine cell": 116526,
708
+ "non-terminally differentiated cell": 4,
709
+ "pre-natural killer cell": 4,
710
+ "midget ganglion cell of retina": 79066,
711
+ "GABAergic amacrine cell": 328894,
712
+ "diffuse bipolar 3b cell": 15386,
713
+ "diffuse bipolar 2 cell": 38767,
714
+ "ON parasol ganglion cell": 2100,
715
+ "diffuse bipolar 1 cell": 20617,
716
+ "invaginating midget bipolar cell": 22451,
717
+ "diffuse bipolar 3a cell": 17037,
718
+ "H2 horizontal cell": 6438,
719
+ "OFFx cell": 8003,
720
+ "H1 horizontal cell": 25682,
721
+ "diffuse bipolar 4 cell": 17039,
722
+ "diffuse bipolar 6 cell": 6525,
723
+ "OFF parasol ganglion cell": 324,
724
+ "hepatic pit cell": 6559,
725
+ "follicular dendritic cell": 2,
726
+ "mature gamma-delta T cell": 1406,
727
+ "thalamic excitatory neuron": 76744,
728
+ "small bistratified retinal ganglion cell": 2226,
729
+ "mature microglial cell": 11466,
730
+ "intestinal epithelial cell": 29221,
731
+ "epithelial cell of lung": 29856,
732
+ "CD38-negative naive B cell": 4952,
733
+ "urethra urothelial cell": 29251,
734
+ "seminal vesicle glandular cell": 5439,
735
+ "type I cell of adrenal cortex": 3043,
736
+ "germinal center B cell": 217,
737
+ "kidney cell": 4410,
738
+ "kidney loop of Henle medullary thick ascending limb epithelial cell": 4792,
739
+ "kidney loop of Henle cortical thick ascending limb epithelial cell": 2800,
740
+ "kidney cortex tubule cell": 984,
741
+ "kidney glomerular epithelial cell": 192,
742
+ "preadipocyte": 128328,
743
+ "type 6 cone bipolar cell (sensu Mus)": 24353,
744
+ "type 5a cone bipolar cell": 20325,
745
+ "type 7 cone bipolar cell (sensu Mus)": 16313,
746
+ "type 3b cone bipolar cell": 12627,
747
+ "type 3a cone bipolar cell": 11349,
748
+ "type 5b cone bipolar cell": 9221,
749
+ "type 5 cone bipolar cell (sensu Mus)": 11677,
750
+ "type 8 cone bipolar cell (sensu Mus)": 3790,
751
+ "type 9 cone bipolar cell (sensu Mus)": 3519,
752
+ "type 2 cone bipolar cell (sensu Mus)": 899,
753
+ "type 4 cone bipolar cell (sensu Mus)": 759,
754
+ "type 1 cone bipolar cell (sensu Mus)": 3221,
755
+ "cerebellar granule cell precursor": 18256,
756
+ "unipolar brush cell": 3600,
757
+ "glioblast": 2686,
758
+ "immature astrocyte": 4156,
759
+ "meningeal macrophage": 1348,
760
+ "noradrenergic cell": 6,
761
+ "multi-ciliated epithelial cell": 40320,
762
+ "pulmonary artery endothelial cell": 18256,
763
+ "cone retinal bipolar cell": 884,
764
+ "retinal astrocyte": 70,
765
+ "efferent neuron": 711,
766
+ "enterocyte of epithelium proper of ileum": 474,
767
+ "ileal goblet cell": 68,
768
+ "smooth muscle fiber of ileum": 52,
769
+ "enteroendocrine cell of small intestine": 53,
770
+ "aortic smooth muscle cell": 5680,
771
+ "mesothelial cell of visceral pleura": 336,
772
+ "ciliated cell of the bronchus": 19278,
773
+ "squamous epithelial cell": 2828,
774
+ "nasal mucosa goblet cell": 54749,
775
+ "memory regulatory T cell": 45,
776
+ "naive regulatory T cell": 1512,
777
+ "myeloid suppressor cell": 44926,
778
+ "adipose macrophage": 5054,
779
+ "absorptive cell": 110,
780
+ "intestinal crypt stem cell of colon": 53,
781
+ "mature astrocyte": 14537,
782
+ "hair follicular keratinocyte": 14465,
783
+ "sebum secreting cell": 369,
784
+ "granular cell of epidermis": 444,
785
+ "anterior lens cell": 4920,
786
+ "secondary lens fiber": 1493,
787
+ "lens fiber cell": 731,
788
+ "A2 amacrine cell": 759,
789
+ "sperm": 10,
790
+ "abnormal cell": 11002,
791
+ "myometrial cell": 144,
792
+ "epithelial cell of uterus": 87,
793
+ "prickle cell": 29060,
794
+ "Merkel cell": 964,
795
+ "cortical thymic epithelial cell": 8923,
796
+ "medullary thymic epithelial cell": 684,
797
+ "epicardial adipocyte": 3447,
798
+ "peritubular capillary endothelial cell": 3,
799
+ "conjunctival epithelial cell": 1933,
800
+ "glomerular capillary endothelial cell": 218,
801
+ "columnar/cuboidal epithelial cell": 4,
802
+ "kidney resident macrophage": 40,
803
+ "ON-blue cone bipolar cell": 1599,
804
+ "CD8-alpha alpha positive, gamma-delta intraepithelial T cell": 1480,
805
+ "NKp46-positive innate lymphoid cell, human": 12750,
806
+ "neutrophil progenitor cell": 16,
807
+ "skeletal muscle satellite stem cell": 1384,
808
+ "mucosal type mast cell": 294,
809
+ "metallothionein-positive alveolar macrophage": 46,
810
+ "cerebral cortex neuron": 6961,
811
+ "basal cell of epithelium of trachea": 95,
812
+ "tracheal goblet cell": 336,
813
+ "photoreceptor cell": 2970,
814
+ "cochlea auditory hair cell": 430,
815
+ "pinealocyte": 340,
816
+ "iris pigment epithelial cell": 310,
817
+ "radial glial cell": 1331,
818
+ "GABAergic interneuron": 4622,
819
+ "pancreatic endocrine cell": 11527,
820
+ "endothelial cell of sinusoid": 212,
821
+ "DN3 thymocyte": 3597,
822
+ "DN1 thymic pro-T cell": 1622,
823
+ "parasol ganglion cell of retina": 9659,
824
+ "epithelial cell of proximal tubule segment 3": 3962,
825
+ "valve interstitial cell": 174,
826
+ "valve endothelial cell": 150,
827
+ "myocyte of sinoatrial node": 63,
828
+ "colon goblet cell": 226,
829
+ "enteroendocrine cell of colon": 54,
830
+ "paneth cell of colon": 221,
831
+ "cholinergic neuron": 13860,
832
+ "L4/5 intratelencephalic projecting glutamatergic neuron": 6704,
833
+ "L6 intratelencephalic projecting glutamatergic neuron": 2232,
834
+ "L3 intratelencephalic projecting glutamatergic neuron": 862,
835
+ "tanycyte": 2560,
836
+ "IgG-negative class switched memory B cell": 3171,
837
+ "IgG memory B cell": 1557,
838
+ "indirect pathway medium spiny neuron": 5000,
839
+ "direct pathway medium spiny neuron": 4055,
840
+ "elicited macrophage": 68370,
841
+ "alveolar type 1 fibroblast cell": 35555,
842
+ "respiratory hillock cell": 7550,
843
+ "epithelial cell of lower respiratory tract": 10120,
844
+ "serous secreting cell": 2615,
845
+ "tracheobronchial serous cell": 3820,
846
+ "tracheobronchial goblet cell": 1650,
847
+ "bronchial goblet cell": 1473,
848
+ "epithelial fate stem cell": 200,
849
+ "lymphatic endothelial cell of medulla ceiling": 364,
850
+ "lymphatic endothelial cell of subcapsular sinus floor": 283,
851
+ "lymphatic endothelial cell of subcapsular sinus ceiling": 194,
852
+ "lymph node lymphatic vessel endothelial cell": 52,
853
+ "tissue-resident macrophage": 3616,
854
+ "glandular epithelial cell": 10369,
855
+ "L4 intratelencephalic projecting glutamatergic neuron": 3992,
856
+ "L5/6 near-projecting glutamatergic neuron": 838,
857
+ "forebrain radial glial cell": 17848,
858
+ "white adipocyte": 6056,
859
+ "precursor cell": 145,
860
+ "primary cultured cell": 29,
861
+ "liver dendritic cell": 642,
862
+ "giant bipolar cell": 6497,
863
+ "eurydendroid cell": 208,
864
+ "type A enteroendocrine cell": 841,
865
+ "type D enteroendocrine cell": 69,
866
+ "serous cell of epithelium of trachea": 20,
867
+ "T follicular regulatory cell": 32,
868
+ "enterocyte of epithelium of small intestine": 336,
869
+ "tuft cell of colon": 199,
870
+ "small intestine goblet cell": 149,
871
+ "epithelial cell of small intestine": 60,
872
+ "BEST4+ intestinal epithelial cell, human": 18,
873
+ "microfold cell of epithelium of small intestine": 11,
874
+ "foveolar cell of stomach": 98557,
875
+ "mucous neck cell": 30510,
876
+ "type G enteroendocrine cell": 9971,
877
+ "natural T-regulatory cell": 9175,
878
+ "peptic cell": 2821,
879
+ "P/D1 enteroendocrine cell": 2307,
880
+ "parietal cell": 686,
881
+ "eye photoreceptor cell": 318,
882
+ "keratocyte": 242,
883
+ "preosteoblast": 240,
884
+ "endosteal cell": 110,
885
+ "immature natural killer cell": 5,
886
+ "basal cell of epithelium of bronchus": 19220,
887
+ "brush cell of bronchus": 308,
888
+ "sensory neuron of dorsal root ganglion": 31650,
889
+ "parasympathetic neuron": 28182,
890
+ "immature T cell": 1318,
891
+ "epithelial cell of esophagus": 24915,
892
+ "glandular cell of esophagus": 990,
893
+ "perineuronal satellite cell": 7154,
894
+ "olfactory ensheathing cell": 1946
895
+ },
896
+ "tissue": {
897
+ "occipital cortex": 162963,
898
+ "trophoblast": 30396,
899
+ "dermis": 25135,
900
+ "skin of body": 166875,
901
+ "small intestine": 46378,
902
+ "dorsolateral prefrontal cortex": 3412403,
903
+ "fovea centralis": 487732,
904
+ "peripheral region of retina": 2732577,
905
+ "bone marrow": 488335,
906
+ "liver": 767833,
907
+ "thymus": 257017,
908
+ "kidney": 1923863,
909
+ "hindlimb": 124567,
910
+ "cerebral cortex": 3652821,
911
+ "lung": 3548542,
912
+ "lymph node": 688503,
913
+ "axilla": 14419,
914
+ "brain": 2497814,
915
+ "adrenal gland": 14065,
916
+ "bone spine": 3772,
917
+ "pleural effusion": 56289,
918
+ "cerebellum lobule": 52243,
919
+ "blood": 6877569,
920
+ "ovary": 80987,
921
+ "cerebellum": 2415752,
922
+ "thalamic complex": 808199,
923
+ "pleura": 577320,
924
+ "primary motor cortex": 1383100,
925
+ "myelencephalon": 182804,
926
+ "pons": 528502,
927
+ "midbrain": 1080027,
928
+ "cerebral nuclei": 704599,
929
+ "hypothalamus": 617214,
930
+ "cortex of kidney": 743525,
931
+ "entorhinal cortex": 271265,
932
+ "colonic epithelium": 12167,
933
+ "colon": 176975,
934
+ "ileum": 313492,
935
+ "hindgut": 3942,
936
+ "embryo": 26758932,
937
+ "renal medulla": 468214,
938
+ "heart left ventricle": 908218,
939
+ "heart right ventricle": 1008903,
940
+ "middle temporal gyrus": 3335710,
941
+ "lamina propria of mucosa of colon": 8626,
942
+ "sigmoid colon": 7269,
943
+ "retina": 653872,
944
+ "body of stomach": 225939,
945
+ "submucosal esophageal gland": 6716,
946
+ "cardia of stomach": 27350,
947
+ "lower esophagus": 30615,
948
+ "esophagogastric junction": 13434,
949
+ "duodenum": 12222,
950
+ "breast": 5177404,
951
+ "caudate lobe of liver": 222664,
952
+ "inguinal fat pad": 53002,
953
+ "epididymal fat pad": 161432,
954
+ "periovarian fat pad": 24905,
955
+ "hippocampal formation": 1261479,
956
+ "spinal cord": 90537,
957
+ "tracheal epithelial cell": 91061,
958
+ "cortical layer VI": 141377,
959
+ "corpus callosum": 138982,
960
+ "cortical layer V": 101073,
961
+ "cortical layer II/III": 88877,
962
+ "striatum": 1276789,
963
+ "pia mater": 25346,
964
+ "olfactory region": 7353,
965
+ "brain ventricle": 3306,
966
+ "decidua basalis": 32543,
967
+ "spleen": 401655,
968
+ "prefrontal cortex": 2048128,
969
+ "temporal lobe": 1019468,
970
+ "primary somatosensory cortex": 701183,
971
+ "parietal cortex": 63124,
972
+ "secondary visual cortex": 29245,
973
+ "primary auditory cortex": 199760,
974
+ "autopod skin": 13687,
975
+ "macula lutea": 358464,
976
+ "macula lutea proper": 105599,
977
+ "gonad": 172997,
978
+ "primary visual cortex": 1374285,
979
+ "anterolateral visual area": 10288,
980
+ "anterior cingulate cortex": 400317,
981
+ "visual cortex": 193642,
982
+ "subicular complex": 55770,
983
+ "secondary somatosensory cortex": 2116,
984
+ "posterior parietal association areas": 28569,
985
+ "temporal cortex": 59645,
986
+ "agranular insular cortex": 54701,
987
+ "retrosplenial region": 1604,
988
+ "gustatory cortex": 1580,
989
+ "lateral entorhinal cortex": 1572,
990
+ "medial entorhinal cortex": 1112,
991
+ "medial orbital frontal cortex": 110134,
992
+ "auditory cortex": 73036,
993
+ "claustrum of brain": 4,
994
+ "amygdala": 142045,
995
+ "lateral amygdaloid nucleus": 18643,
996
+ "medial amygdaloid nucleus": 17295,
997
+ "apex of heart": 83168,
998
+ "interventricular septum": 392607,
999
+ "right cardiac atrium": 114445,
1000
+ "left cardiac atrium": 380823,
1001
+ "diencephalon": 238693,
1002
+ "pigment epithelium of eye": 956,
1003
+ "subcutaneous abdominal adipose tissue": 41136,
1004
+ "visceral abdominal adipose tissue": 35796,
1005
+ "posterior hypothalamic region": 74913,
1006
+ "intestine": 25127,
1007
+ "adnexa of uterus": 280223,
1008
+ "nose skin": 71320,
1009
+ "Brodmann (1909) area 25": 36225,
1010
+ "ileal epithelium": 8118,
1011
+ "forelimb": 500298,
1012
+ "head of caudate nucleus": 28424,
1013
+ "skin of forehead": 15247,
1014
+ "ganglionic eminence": 8108,
1015
+ "respiratory airway": 467047,
1016
+ "transition zone of prostate": 88242,
1017
+ "peripheral zone of prostate": 54097,
1018
+ "parotid gland": 178118,
1019
+ "upper lobe of left lung": 27792,
1020
+ "substantia nigra pars compacta": 314973,
1021
+ "renal pelvis": 95,
1022
+ "kidney blood vessel": 59,
1023
+ "epithelium of esophagus": 39735,
1024
+ "sinoatrial node": 66843,
1025
+ "rectum": 1208,
1026
+ "decidua": 185295,
1027
+ "inferior temporal gyrus": 99318,
1028
+ "bronchus": 47507,
1029
+ "mesenteric fat pad": 5461,
1030
+ "atrioventricular node": 729,
1031
+ "aorta": 12331,
1032
+ "renal papilla": 168923,
1033
+ "tendon of semitendinosus": 10413,
1034
+ "adipose tissue": 80482,
1035
+ "exocrine pancreas": 20609,
1036
+ "myometrium": 3014,
1037
+ "cardiac atrium": 7745,
1038
+ "subcutaneous adipose tissue": 121403,
1039
+ "mammary gland": 28594,
1040
+ "skin of chest": 4486,
1041
+ "muscle of abdomen": 1610,
1042
+ "sublingual gland": 418,
1043
+ "muscle of pelvic diaphragm": 14648,
1044
+ "sclera": 372,
1045
+ "cardiac ventricle": 1141,
1046
+ "skin of abdomen": 7387,
1047
+ "rectus abdominis muscle": 300,
1048
+ "trachea": 51774,
1049
+ "prostate gland": 116873,
1050
+ "coronary artery": 278,
1051
+ "posterior part of tongue": 242,
1052
+ "cornea": 23298,
1053
+ "endometrium": 117357,
1054
+ "bladder organ": 11030,
1055
+ "anterior part of tongue": 4346,
1056
+ "vasculature": 8863,
1057
+ "uterus": 34,
1058
+ "large intestine": 64353,
1059
+ "lacrimal gland": 4,
1060
+ "inguinal lymph node": 20664,
1061
+ "gingiva": 86814,
1062
+ "inguinal part of abdomen": 8437,
1063
+ "neural tube": 10299,
1064
+ "glabella skin": 4007,
1065
+ "neocortex": 6313087,
1066
+ "white matter": 1111027,
1067
+ "hindbrain": 567702,
1068
+ "pallidum": 121800,
1069
+ "cortical subplate": 342549,
1070
+ "ventricular system of brain": 167874,
1071
+ "alveolus of lung": 39661,
1072
+ "thoracic lymph node": 46753,
1073
+ "jejunal epithelium": 32845,
1074
+ "lamina propria": 36972,
1075
+ "brown preadipocyte": 7005,
1076
+ "brain white matter": 20212,
1077
+ "brain gray matter": 6670,
1078
+ "caudal ganglionic eminence": 49470,
1079
+ "medial ganglionic eminence": 28614,
1080
+ "parietal lobe": 20622,
1081
+ "orbitofrontal cortex": 6972,
1082
+ "skin of temple": 61595,
1083
+ "skin of cheek": 47600,
1084
+ "islet of Langerhans": 439308,
1085
+ "cervical spinal cord white matter": 35591,
1086
+ "white matter of cerebellum": 21181,
1087
+ "Brodmann (1909) area 4": 14097,
1088
+ "iris": 48066,
1089
+ "ascitic fluid": 93627,
1090
+ "omentum": 152695,
1091
+ "peritoneum": 55902,
1092
+ "abdomen": 18036,
1093
+ "right ovary": 17910,
1094
+ "lung parenchyma": 645194,
1095
+ "tonsil": 25382,
1096
+ "superior frontal gyrus": 71605,
1097
+ "bladder lumen": 12866,
1098
+ "heart": 51664,
1099
+ "fallopian tube": 47676,
1100
+ "lower lobe of left lung": 63958,
1101
+ "urethra": 77362,
1102
+ "placenta": 230912,
1103
+ "omental fat pad": 82178,
1104
+ "skin of forearm": 32892,
1105
+ "skin of pes": 32646,
1106
+ "inguinal region skin": 10056,
1107
+ "gonadal fat pad": 6539,
1108
+ "pancreas": 1411636,
1109
+ "tongue": 112725,
1110
+ "diaphragm": 4850,
1111
+ "limb muscle": 111326,
1112
+ "brown adipose tissue": 3957,
1113
+ "endothelial cell": 56790,
1114
+ "cerebellar vermis": 236767,
1115
+ "testis": 1219,
1116
+ "gonad primordium": 399,
1117
+ "mesoderm": 139,
1118
+ "submucosa of ileum": 9015,
1119
+ "submucosa of ascending colon": 6989,
1120
+ "superior parietal cortex": 42614,
1121
+ "cerebrocerebellum": 48747,
1122
+ "perirhinal cortex": 21809,
1123
+ "inferior parietal cortex": 19112,
1124
+ "nucleus accumbens": 35731,
1125
+ "Brodmann (1909) area 19": 11814,
1126
+ "ventral lateral nucleus of thalamus": 23976,
1127
+ "medial dorsal nucleus of thalamus": 10144,
1128
+ "superior temporal sulcus": 5717,
1129
+ "lateral geniculate body": 13543,
1130
+ "medulla oblongata": 312,
1131
+ "right frontal lobe": 90870,
1132
+ "right parietal lobe": 52245,
1133
+ "meningeal dura mater": 35748,
1134
+ "dura mater": 702,
1135
+ "brain meninx": 282,
1136
+ "subdural space": 246,
1137
+ "mesenteric lymph node": 67885,
1138
+ "frontal cortex": 127978,
1139
+ "choroid plexus": 33310,
1140
+ "anterior hypothalamic region": 24475,
1141
+ "gut wall": 170929,
1142
+ "epithelial cell of alveolus of lung": 14952,
1143
+ "cultured cell": 71512,
1144
+ "zone of skin": 12787,
1145
+ "ciliary body": 19320,
1146
+ "muscle organ": 2410668,
1147
+ "Brodmann (1909) area 23": 1424,
1148
+ "esophagus": 14368,
1149
+ "pyloric antrum": 11248,
1150
+ "umbilical cord blood": 11302,
1151
+ "outer medulla of kidney": 21560,
1152
+ "inner medulla of kidney": 6904,
1153
+ "brainstem": 92366,
1154
+ "basal forebrain": 7885,
1155
+ "perifoveal part of retina": 4422,
1156
+ "endocrine pancreas": 2088,
1157
+ "mesenteric artery": 6624,
1158
+ "fimbria of uterine tube": 24914,
1159
+ "ampulla of uterine tube": 30895,
1160
+ "isthmus of fallopian tube": 33922,
1161
+ "skin of back": 4402,
1162
+ "skin of breast": 2732,
1163
+ "nasopharynx": 24100,
1164
+ "lamina propria of large intestine": 29204,
1165
+ "lamina propria of small intestine": 62327,
1166
+ "angular gyrus": 114354,
1167
+ "pubis": 2762,
1168
+ "descending colon": 1215,
1169
+ "ascending colon": 2756,
1170
+ "hepatic cecum": 768,
1171
+ "mammary gland epithelial cell": 61935,
1172
+ "caecum": 36317,
1173
+ "left colon": 24992,
1174
+ "right colon": 18368,
1175
+ "upper outer quadrant of breast": 34306,
1176
+ "skin epidermis": 7006,
1177
+ "scalp": 3029,
1178
+ "skin of external ear": 4239,
1179
+ "lens of camera-type eye": 11296,
1180
+ "skin of scalp": 48525,
1181
+ "skin of trunk": 22552,
1182
+ "transverse colon": 537,
1183
+ "hepatic flexure of colon": 23,
1184
+ "eye trabecular meshwork": 10728,
1185
+ "corneo-scleral junction": 7491,
1186
+ "epithelial cell of lung": 35350,
1187
+ "retinal neural layer": 4316,
1188
+ "chorioretinal region": 2512,
1189
+ "cervical lymph node": 4355,
1190
+ "putamen": 125801,
1191
+ "nose": 188460,
1192
+ "nasal cavity": 574,
1193
+ "peripheral lymph node": 5625,
1194
+ "left ovary": 2162,
1195
+ "parietal peritoneum": 260,
1196
+ "urinary bladder": 78,
1197
+ "adrenal tissue": 9884,
1198
+ "perirenal fat": 5334,
1199
+ "vein": 1512,
1200
+ "insular cortex": 295812,
1201
+ "cingulate cortex": 280656,
1202
+ "respiratory basal cell": 12211,
1203
+ "duodeno-jejunal junction": 17920,
1204
+ "Brodmann (1909) area 46": 34492,
1205
+ "barrel cortex": 9775,
1206
+ "caecum epithelium": 220,
1207
+ "preadipocyte": 38642,
1208
+ "trophoblast cell": 5247,
1209
+ "retrosplenial granular cortex": 67319,
1210
+ "frontal lobe": 34397,
1211
+ "lateral visual area": 18939,
1212
+ "upper leg skin": 2711,
1213
+ "caudate nucleus": 24433,
1214
+ "lateral nuclear group of thalamus": 12825,
1215
+ "jejunum": 461,
1216
+ "eye": 3814,
1217
+ "dorsal thalamus": 65204,
1218
+ "ventral thalamus": 14796,
1219
+ "venous blood": 25296,
1220
+ "bronchial epithelial cell": 59516
1221
+ },
1222
+ "sex": {
1223
+ "female": 46072046,
1224
+ "male": 61255034,
1225
+ "unknown": 4409822
1226
+ }
1227
+ }
teddy/data_processing/utils/bio_annotations/data/mappings/all_filtered_cell_mapping.json ADDED
@@ -0,0 +1,862 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "oligodendrocyte": "neural_cell",
3
+ "neuron": "neural_cell",
4
+ "astrocyte": "neural_cell",
5
+ "oligodendrocyte precursor cell": "neural_cell",
6
+ "microglial cell": "immune_cell",
7
+ "endothelial cell": "epithelial_cell",
8
+ "extravillous trophoblast": "embryonic_cell",
9
+ "placental villous trophoblast": "embryonic_cell",
10
+ "syncytiotrophoblast cell": "embryonic_cell",
11
+ "skin fibroblast": "connective_cell",
12
+ "T cell": "hematopoietic_cell",
13
+ "enterocyte": "ciliated_cell",
14
+ "endothelial cell of lymphatic vessel": "epithelial_cell",
15
+ "fibroblast": "connective_cell",
16
+ "blood vessel endothelial cell": "epithelial_cell",
17
+ "B cell": "hematopoietic_cell",
18
+ "enteroendocrine cell": "secretory_cell",
19
+ "macrophage": "immune_cell",
20
+ "dendritic cell": "immune_cell",
21
+ "vascular leptomeningeal cell": "connective_cell",
22
+ "retina horizontal cell": "neural_cell",
23
+ "natural killer cell": "hematopoietic_cell",
24
+ "large pre-B-II cell": "hematopoietic_cell",
25
+ "small pre-B-II cell": "hematopoietic_cell",
26
+ "double negative thymocyte": "hematopoietic_cell",
27
+ "pro-B cell": "precursor_cell",
28
+ "group 3 innate lymphoid cell": "hematopoietic_cell",
29
+ "late pro-B cell": "precursor_cell",
30
+ "fraction A pre-pro B cell": "hematopoietic_cell",
31
+ "B-2 B cell": "hematopoietic_cell",
32
+ "unknown": "unknown",
33
+ "early lymphoid progenitor": "precursor_cell",
34
+ "double-positive, alpha-beta thymocyte": "hematopoietic_cell",
35
+ "hematopoietic stem cell": "hematopoietic_cell",
36
+ "naive thymus-derived CD4-positive, alpha-beta T cell": "hematopoietic_cell",
37
+ "hematopoietic multipotent progenitor cell": "precursor_cell",
38
+ "B-1 B cell": "hematopoietic_cell",
39
+ "naive thymus-derived CD8-positive, alpha-beta T cell": "hematopoietic_cell",
40
+ "megakaryocyte-erythroid progenitor cell": "precursor_cell",
41
+ "regulatory T cell": "hematopoietic_cell",
42
+ "mature B cell": "hematopoietic_cell",
43
+ "group 2 innate lymphoid cell": "hematopoietic_cell",
44
+ "innate lymphoid cell": "hematopoietic_cell",
45
+ "immature B cell": "hematopoietic_cell",
46
+ "common myeloid progenitor": "precursor_cell",
47
+ "CD8-alpha-alpha-positive, alpha-beta intraepithelial T cell": "hematopoietic_cell",
48
+ "granulocyte monocyte progenitor cell": "precursor_cell",
49
+ "plasma cell": "hematopoietic_cell",
50
+ "kidney proximal convoluted tubule epithelial cell": "ciliated_cell",
51
+ "leukocyte": "hematopoietic_cell",
52
+ "kidney loop of Henle thick ascending limb epithelial cell": "epithelial_cell",
53
+ "kidney distal convoluted tubule epithelial cell": "epithelial_cell",
54
+ "kidney interstitial fibroblast": "connective_cell",
55
+ "blood vessel smooth muscle cell": "contractile_cell",
56
+ "kidney collecting duct principal cell": "epithelial_cell",
57
+ "kidney collecting duct intercalated cell": "epithelial_cell",
58
+ "podocyte": "epithelial_cell",
59
+ "mesangial cell": "connective_cell",
60
+ "kidney granular cell": "contractile_cell",
61
+ "macula densa epithelial cell": "epithelial_cell",
62
+ "muscle cell": "contractile_cell",
63
+ "fibroblast of dermis": "connective_cell",
64
+ "tendon cell": "connective_cell",
65
+ "Schwann cell": "neural_cell",
66
+ "chondrocyte": "connective_cell",
67
+ "smooth muscle cell": "contractile_cell",
68
+ "endothelial cell of artery": "epithelial_cell",
69
+ "reticulocyte": "hematopoietic_cell",
70
+ "vein endothelial cell": "epithelial_cell",
71
+ "pericyte": "perivascular_cell",
72
+ "peridermal cell": "epithelial_cell",
73
+ "basal cell": "epithelial_cell",
74
+ "articular chondrocyte": "connective_cell",
75
+ "mesenchymal cell": "connective_cell",
76
+ "connective tissue cell": "connective_cell",
77
+ "erythrocyte": "hematopoietic_cell",
78
+ "hypertrophic chondrocyte": "connective_cell",
79
+ "megakaryocyte": "hematopoietic_cell",
80
+ "muscle fibroblast": "skeletal_muscle",
81
+ "mature NK T cell": "hematopoietic_cell",
82
+ "myeloid cell": "immune_cell",
83
+ "kidney interstitial cell": "connective_cell",
84
+ "epithelial cell of nephron": "epithelial_cell",
85
+ "mesenchymal stem cell": "connective_cell",
86
+ "epithelial cell of proximal tubule": "ciliated_cell",
87
+ "kidney connecting tubule epithelial cell": "epithelial_cell",
88
+ "epithelial cell of glomerular capsule": "epithelial_cell",
89
+ "nephron tubule epithelial cell": "epithelial_cell",
90
+ "kidney collecting duct cell": "epithelial_cell",
91
+ "stromal cell of ovary": "connective_cell",
92
+ "granulosa cell": "epithelial_cell",
93
+ "theca cell": "connective_cell",
94
+ "epithelial cell": "epithelial_cell",
95
+ "epithelial cell of alveolus of lung": "epithelial_cell",
96
+ "goblet cell": "epithelial_cell",
97
+ "ionocyte": "epithelial_cell",
98
+ "hepatocyte": "epithelial_cell",
99
+ "ciliated epithelial cell": "ciliated_cell",
100
+ "neuroendocrine cell": "secretory_cell",
101
+ "club cell": "precursor_cell",
102
+ "brush cell": "epithelial_cell",
103
+ "platelet": "hematopoietic_cell",
104
+ "central nervous system macrophage": "immune_cell",
105
+ "ependymal cell": "ciliated_cell",
106
+ "vascular associated smooth muscle cell": "contractile_cell",
107
+ "mesothelial cell": "epithelial_cell",
108
+ "neutrophil": "immune_cell",
109
+ "monocyte": "precursor_cell",
110
+ "stromal cell": "connective_cell",
111
+ "cord blood hematopoietic stem cell": "hematopoietic_cell",
112
+ "mast cell": "hematopoietic_cell",
113
+ "professional antigen presenting cell": "hematopoietic_cell",
114
+ "erythroid lineage cell": "hematopoietic_cell",
115
+ "primordial germ cell": "unknown",
116
+ "alternatively activated macrophage": "immune_cell",
117
+ "L2/3-6 intratelencephalic projecting glutamatergic neuron": "neural_cell",
118
+ "pvalb GABAergic cortical interneuron": "neural_cell",
119
+ "chandelier pvalb GABAergic cortical interneuron": "neural_cell",
120
+ "sst GABAergic cortical interneuron": "neural_cell",
121
+ "Bergmann glial cell": "neural_cell",
122
+ "glutamatergic neuron": "neural_cell",
123
+ "transit amplifying cell of colon": "epithelial_cell",
124
+ "CD8-alpha-beta-positive, alpha-beta intraepithelial T cell": "hematopoietic_cell",
125
+ "intestinal crypt stem cell": "epithelial_cell",
126
+ "intestinal tuft cell": "epithelial_cell",
127
+ "enteric smooth muscle cell": "contractile_cell",
128
+ "smooth muscle cell of large intestine": "contractile_cell",
129
+ "interstitial cell of Cajal": "epithelial_cell",
130
+ "smooth muscle cell of small intestine": "contractile_cell",
131
+ "cardiac valve cell": "embryonic_cell",
132
+ "primitive red blood cell": "hematopoietic_cell",
133
+ "neurectodermal cell": "embryonic_cell",
134
+ "midbrain dopaminergic neuron": "neural_cell",
135
+ "paraxial cell": "embryonic_cell",
136
+ "mesodermal cell": "embryonic_cell",
137
+ "splanchnic mesodermal cell": "embryonic_cell",
138
+ "neuroplacodal cell": "embryonic_cell",
139
+ "premigratory neural crest cell": "embryonic_cell",
140
+ "notochordal cell": "epithelial_cell",
141
+ "hemangioblast": "embryonic_cell",
142
+ "spinal cord interneuron": "neural_cell",
143
+ "endodermal cell": "unknown",
144
+ "surface ectodermal cell": "embryonic_cell",
145
+ "gut endothelial cell": "epithelial_cell",
146
+ "anterior visceral endoderm cell": "embryonic_cell",
147
+ "activated CD4-negative, CD8-negative type I NK T cell": "hematopoietic_cell",
148
+ "parietal epithelial cell": "epithelial_cell",
149
+ "kidney loop of Henle epithelial cell": "epithelial_cell",
150
+ "kidney loop of Henle thin descending limb epithelial cell": "epithelial_cell",
151
+ "malignant cell": "unknown",
152
+ "exhausted T cell": "hematopoietic_cell",
153
+ "CD4-positive helper T cell": "hematopoietic_cell",
154
+ "CD8-positive, alpha-beta T cell": "hematopoietic_cell",
155
+ "promonocyte": "precursor_cell",
156
+ "granulocyte": "hematopoietic_cell",
157
+ "osteoclast": "unknown",
158
+ "promyelocyte": "precursor_cell",
159
+ "Kupffer cell": "immune_cell",
160
+ "pre-conventional dendritic cell": "immune_cell",
161
+ "myelocyte": "precursor_cell",
162
+ "plasmacytoid dendritic cell": "immune_cell",
163
+ "common dendritic progenitor": "precursor_cell",
164
+ "mural cell": "perivascular_cell",
165
+ "myofibroblast cell": "connective_cell",
166
+ "glial cell": "neural_cell",
167
+ "lymphocyte": "hematopoietic_cell",
168
+ "retinal ganglion cell": "neural_cell",
169
+ "lamp5 GABAergic cortical interneuron": "neural_cell",
170
+ "luminal epithelial cell of mammary gland": "epithelial_cell",
171
+ "endothelial cell of vascular tree": "epithelial_cell",
172
+ "mammary gland epithelial cell": "epithelial_cell",
173
+ "adipocyte of breast": "connective_cell",
174
+ "IgA plasma cell": "hematopoietic_cell",
175
+ "class switched memory B cell": "hematopoietic_cell",
176
+ "naive B cell": "hematopoietic_cell",
177
+ "IgG plasma cell": "hematopoietic_cell",
178
+ "unswitched memory B cell": "hematopoietic_cell",
179
+ "centrilobular region hepatocyte": "epithelial_cell",
180
+ "periportal region hepatocyte": "epithelial_cell",
181
+ "blood cell": "hematopoietic_cell",
182
+ "tracheal epithelial cell": "epithelial_cell",
183
+ "medium spiny neuron": "neural_cell",
184
+ "inhibitory interneuron": "neural_cell",
185
+ "cell": "unknown",
186
+ "uterine smooth muscle cell": "contractile_cell",
187
+ "decidual natural killer cell, human": "connective_cell",
188
+ "endothelial cell of uterus": "epithelial_cell",
189
+ "trophoblast giant cell": "embryonic_cell",
190
+ "embryonic fibroblast": "connective_cell",
191
+ "cardiac endothelial cell": "epithelial_cell",
192
+ "fibroblast of cardiac tissue": "connective_cell",
193
+ "immature innate lymphoid cell": "hematopoietic_cell",
194
+ "cardiac muscle myoblast": "precursor_cell",
195
+ "lymphoid lineage restricted progenitor cell": "precursor_cell",
196
+ "smooth muscle myoblast": "precursor_cell",
197
+ "neuronal receptor cell": "neural_cell",
198
+ "fibroblast of lymphatic vessel": "connective_cell",
199
+ "flat midget bipolar cell": "neural_cell",
200
+ "classical monocyte": "precursor_cell",
201
+ "conventional dendritic cell": "immune_cell",
202
+ "CD14-positive monocyte": "precursor_cell",
203
+ "effector memory CD8-positive, alpha-beta T cell": "hematopoietic_cell",
204
+ "CD14-positive, CD16-positive monocyte": "precursor_cell",
205
+ "central memory CD4-positive, alpha-beta T cell": "hematopoietic_cell",
206
+ "CD56-positive, CD161-positive immature natural killer cell, human": "hematopoietic_cell",
207
+ "CD16-positive, CD56-dim natural killer cell, human": "hematopoietic_cell",
208
+ "CD8-positive, alpha-beta cytotoxic T cell": "hematopoietic_cell",
209
+ "supporting cell": "unknown",
210
+ "interstitial cell of ovary": "connective_cell",
211
+ "hematopoietic cell": "hematopoietic_cell",
212
+ "neural cell": "neural_cell",
213
+ "germ cell": "unknown",
214
+ "ovarian surface epithelial cell": "epithelial_cell",
215
+ "L4/5 intratelencephalic projecting glutamatergic neuron of the primary motor cortex": "neural_cell",
216
+ "L6 corticothalamic-projecting glutamatergic cortical neuron": "neural_cell",
217
+ "vip GABAergic cortical interneuron": "neural_cell",
218
+ "L6 intratelencephalic projecting glutamatergic neuron of the primary motor cortex": "neural_cell",
219
+ "hippocampal neuron": "neural_cell",
220
+ "L6b glutamatergic cortical neuron": "neural_cell",
221
+ "L5/6 near-projecting glutamatergic neuron of the primary motor cortex": "neural_cell",
222
+ "L5 extratelencephalic projecting glutamatergic cortical neuron": "neural_cell",
223
+ "pyramidal neuron": "neural_cell",
224
+ "sncg GABAergic cortical interneuron": "neural_cell",
225
+ "corticothalamic-projecting glutamatergic cortical neuron": "neural_cell",
226
+ "L2/3 intratelencephalic projecting glutamatergic neuron of the primary motor cortex": "neural_cell",
227
+ "sst chodl GABAergic cortical interneuron": "neural_cell",
228
+ "cortical interneuron": "neural_cell",
229
+ "vascular leptomeningeal cell (Mmus)": "connective_cell",
230
+ "meis2 expressing cortical GABAergic cell": "secretory_cell",
231
+ "Cajal-Retzius cell": "neural_cell",
232
+ "fibroblast of lung": "connective_cell",
233
+ "type I pneumocyte": "epithelial_cell",
234
+ "type II pneumocyte": "epithelial_cell",
235
+ "gut absorptive cell": "epithelial_cell",
236
+ "progenitor cell": "precursor_cell",
237
+ "intestinal crypt stem cell of large intestine": "precursor_cell",
238
+ "transit amplifying cell of small intestine": "epithelial_cell",
239
+ "intestinal crypt stem cell of small intestine": "precursor_cell",
240
+ "secretory cell": "secretory_cell",
241
+ "intestine goblet cell": "epithelial_cell",
242
+ "enterocyte of epithelium of large intestine": "ciliated_cell",
243
+ "paneth cell of epithelium of small intestine": "secretory_cell",
244
+ "intestinal enteroendocrine cell": "secretory_cell",
245
+ "duodenum glandular cell": "secretory_cell",
246
+ "large intestine goblet cell": "epithelial_cell",
247
+ "T follicular helper cell": "hematopoietic_cell",
248
+ "GABAergic neuron": "neural_cell",
249
+ "fibroblast of mammary gland": "connective_cell",
250
+ "perivascular cell": "epithelial_cell",
251
+ "luminal adaptive secretory precursor cell of mammary gland": "epithelial_cell",
252
+ "endothelial tip cell": "epithelial_cell",
253
+ "CD8-positive, alpha-beta memory T cell": "hematopoietic_cell",
254
+ "luminal hormone-sensing cell of mammary gland": "epithelial_cell",
255
+ "myoepithelial cell of mammary gland": "contractile_cell",
256
+ "capillary endothelial cell": "epithelial_cell",
257
+ "brain vascular cell": "neural_cell",
258
+ "dopaminergic neuron": "neural_cell",
259
+ "serotonergic neuron": "neural_cell",
260
+ "cerebellar neuron": "neural_cell",
261
+ "neural progenitor cell": "neural_cell",
262
+ "CD4-positive, alpha-beta T cell": "hematopoietic_cell",
263
+ "glycinergic amacrine cell": "neural_cell",
264
+ "starburst amacrine cell": "neural_cell",
265
+ "retinal rod cell": "neural_cell",
266
+ "Mueller cell": "neural_cell",
267
+ "rod bipolar cell": "neural_cell",
268
+ "ON-bipolar cell": "neural_cell",
269
+ "OFF-bipolar cell": "neural_cell",
270
+ "retinal cone cell": "neural_cell",
271
+ "amacrine cell": "neural_cell",
272
+ "melanocyte": "secretory_cell",
273
+ "retinal pigment epithelial cell": "epithelial_cell",
274
+ "adipocyte": "embryonic_cell",
275
+ "fibro/adipogenic progenitor cell": "precursor_cell",
276
+ "neuron associated cell": "neural_cell",
277
+ "inhibitory motor neuron": "neural_cell",
278
+ "motor neuron": "neural_cell",
279
+ "precursor B cell": "hematopoietic_cell",
280
+ "interneuron": "neural_cell",
281
+ "fallopian tube secretory epithelial cell": "epithelial_cell",
282
+ "suprabasal keratinocyte": "epithelial_cell",
283
+ "basal cell of epidermis": "epithelial_cell",
284
+ "proerythroblast": "hematopoietic_cell",
285
+ "kidney loop of Henle ascending limb epithelial cell": "epithelial_cell",
286
+ "collagen secreting cell": "connective_cell",
287
+ "epithelial cell of proximal tubule segment 1": "ciliated_cell",
288
+ "MHC-II-positive classical monocyte": "precursor_cell",
289
+ "naive T cell": "hematopoietic_cell",
290
+ "chondroblast": "connective_cell",
291
+ "osteoblast": "connective_cell",
292
+ "myoblast": "precursor_cell",
293
+ "skeletal muscle myoblast": "skeletal_muscle",
294
+ "Schwann cell precursor": "neural_cell",
295
+ "keratinocyte": "epithelial_cell",
296
+ "inflammatory macrophage": "immune_cell",
297
+ "monocyte-derived dendritic cell": "immune_cell",
298
+ "Langerhans cell": "immune_cell",
299
+ "cytotoxic T cell": "hematopoietic_cell",
300
+ "forebrain neuroblast": "neural_cell",
301
+ "chandelier cell": "neural_cell",
302
+ "caudal ganglionic eminence derived GABAergic cortical interneuron": "neural_cell",
303
+ "basal cell of prostate epithelium": "epithelial_cell",
304
+ "epithelial cell of urethra": "epithelial_cell",
305
+ "luminal cell of prostate epithelium": "epithelial_cell",
306
+ "prostate gland microvascular endothelial cell": "epithelial_cell",
307
+ "prostate stromal cell": "connective_cell",
308
+ "smooth muscle cell of prostate": "contractile_cell",
309
+ "lymphocyte of B lineage": "hematopoietic_cell",
310
+ "smooth muscle cell of the pulmonary artery": "contractile_cell",
311
+ "acinar cell of salivary gland": "epithelial_cell",
312
+ "memory B cell": "hematopoietic_cell",
313
+ "adventitial cell": "connective_cell",
314
+ "duct epithelial cell": "epithelial_cell",
315
+ "endothelial cell of hepatic sinusoid": "epithelial_cell",
316
+ "non-classical monocyte": "precursor_cell",
317
+ "plasmablast": "hematopoietic_cell",
318
+ "glomerular endothelial cell": "epithelial_cell",
319
+ "renal intercalated cell": "epithelial_cell",
320
+ "vasa recta ascending limb cell": "epithelial_cell",
321
+ "vasa recta descending limb cell": "epithelial_cell",
322
+ "kidney epithelial cell": "epithelial_cell",
323
+ "renal beta-intercalated cell": "epithelial_cell",
324
+ "renal alpha-intercalated cell": "epithelial_cell",
325
+ "urothelial cell": "epithelial_cell",
326
+ "renal principal cell": "epithelial_cell",
327
+ "cell of skeletal muscle": "skeletal_muscle",
328
+ "thymocyte": "hematopoietic_cell",
329
+ "pro-T cell": "precursor_cell",
330
+ "hematopoietic precursor cell": "hematopoietic_cell",
331
+ "stem cell": "precursor_cell",
332
+ "paneth cell": "secretory_cell",
333
+ "type L enteroendocrine cell": "secretory_cell",
334
+ "type EC enteroendocrine cell": "secretory_cell",
335
+ "hepatic stellate cell": "connective_cell",
336
+ "cholangiocyte": "epithelial_cell",
337
+ "endothelial cell of periportal hepatic sinusoid": "epithelial_cell",
338
+ "endothelial cell of pericentral hepatic sinusoid": "epithelial_cell",
339
+ "alveolar macrophage": "immune_cell",
340
+ "effector memory CD4-positive, alpha-beta T cell": "hematopoietic_cell",
341
+ "myeloid leukocyte": "hematopoietic_cell",
342
+ "CD1c-positive myeloid dendritic cell": "immune_cell",
343
+ "myeloid dendritic cell, human": "immune_cell",
344
+ "stratified epithelial cell": "epithelial_cell",
345
+ "epithelial cell of stratum germinativum of esophagus": "epithelial_cell",
346
+ "mononuclear phagocyte": "immune_cell",
347
+ "mucus secreting cell": "secretory_cell",
348
+ "regular atrial cardiac myocyte": "contractile_cell",
349
+ "Tc1 cell": "hematopoietic_cell",
350
+ "endothelial cell of placenta": "epithelial_cell",
351
+ "Hofbauer cell": "immune_cell",
352
+ "group 3 innate lymphoid cell, human": "hematopoietic_cell",
353
+ "kidney collecting duct epithelial cell": "epithelial_cell",
354
+ "fenestrated cell": "epithelial_cell",
355
+ "early T lineage precursor": "hematopoietic_cell",
356
+ "CD4-positive, alpha-beta memory T cell": "hematopoietic_cell",
357
+ "erythroid progenitor cell": "precursor_cell",
358
+ "central memory CD8-positive, alpha-beta T cell": "hematopoietic_cell",
359
+ "gamma-delta T cell": "hematopoietic_cell",
360
+ "early promyelocyte": "precursor_cell",
361
+ "CD16-negative, CD56-bright natural killer cell, human": "hematopoietic_cell",
362
+ "megakaryocyte progenitor cell": "precursor_cell",
363
+ "late promyelocyte": "precursor_cell",
364
+ "basophil mast progenitor cell": "precursor_cell",
365
+ "CD4-positive, alpha-beta cytotoxic T cell": "hematopoietic_cell",
366
+ "airway submucosal gland duct basal cell": "epithelial_cell",
367
+ "serous secreting cell of bronchus submucosal gland": "epithelial_cell",
368
+ "ciliated cell": "ciliated_cell",
369
+ "lung secretory cell": "secretory_cell",
370
+ "myoepithelial cell": "contractile_cell",
371
+ "lung macrophage": "immune_cell",
372
+ "mesenchymal stem cell of adipose tissue": "precursor_cell",
373
+ "regular ventricular cardiac myocyte": "contractile_cell",
374
+ "choroid plexus epithelial cell": "epithelial_cell",
375
+ "aortic endothelial cell": "epithelial_cell",
376
+ "fibrocyte": "connective_cell",
377
+ "kidney loop of Henle thin ascending limb epithelial cell": "epithelial_cell",
378
+ "kidney interstitial alternatively activated macrophage": "immune_cell",
379
+ "renal interstitial pericyte": "perivascular_cell",
380
+ "papillary tips cell": "unknown",
381
+ "fast muscle cell": "skeletal_muscle",
382
+ "skeletal muscle fiber": "skeletal_muscle",
383
+ "slow muscle cell": "skeletal_muscle",
384
+ "skeletal muscle satellite cell": "skeletal_muscle",
385
+ "retinal blood vessel endothelial cell": "epithelial_cell",
386
+ "non-myelinating Schwann cell": "neural_cell",
387
+ "lung perichondrial fibroblast": "connective_cell",
388
+ "respiratory suprabasal cell": "epithelial_cell",
389
+ "lung pericyte": "perivascular_cell",
390
+ "memory T cell": "hematopoietic_cell",
391
+ "leptomeningeal cell": "connective_cell",
392
+ "Sertoli cell": "secretory_cell",
393
+ "macroglial cell": "neural_cell",
394
+ "retinal bipolar neuron": "neural_cell",
395
+ "cerebellar granule cell": "neural_cell",
396
+ "intermediate monocyte": "precursor_cell",
397
+ "erythroblast": "hematopoietic_cell",
398
+ "midzonal region hepatocyte": "epithelial_cell",
399
+ "endothelial cell of venule": "epithelial_cell",
400
+ "helper T cell": "hematopoietic_cell",
401
+ "mucosal invariant T cell": "hematopoietic_cell",
402
+ "T-helper 17 cell": "hematopoietic_cell",
403
+ "olfactory epithelial cell": "epithelial_cell",
404
+ "auditory epithelial cell": "epithelial_cell",
405
+ "endo-epithelial cell": "epithelial_cell",
406
+ "epithelial cell of amnion": "epithelial_cell",
407
+ "intermediate mesodermal cell": "embryonic_cell",
408
+ "ectodermal cell": "embryonic_cell",
409
+ "metanephric mesenchyme stem cell": "precursor_cell",
410
+ "ureteric bud cell": "epithelial_cell",
411
+ "pituitary gland cell": "neural_cell",
412
+ "pancreatic acinar cell": "epithelial_cell",
413
+ "lens epithelial cell": "epithelial_cell",
414
+ "epithelial cell of parathyroid gland": "epithelial_cell",
415
+ "epithelial cell of thymus": "epithelial_cell",
416
+ "intrahepatic cholangiocyte": "epithelial_cell",
417
+ "epithelial cell of thyroid gland": "epithelial_cell",
418
+ "peripheral nervous system neuron": "neural_cell",
419
+ "neural crest cell": "embryonic_cell",
420
+ "sensory neuron": "neural_cell",
421
+ "cerebral cortex endothelial cell": "epithelial_cell",
422
+ "microvascular endothelial cell": "epithelial_cell",
423
+ "brain pericyte": "perivascular_cell",
424
+ "endocardial cell": "epithelial_cell",
425
+ "adipocyte of epicardial fat of left ventricle": "connective_cell",
426
+ "CD14-low, CD16-positive monocyte": "precursor_cell",
427
+ "DN4 thymocyte": "hematopoietic_cell",
428
+ "pancreatic stellate cell": "connective_cell",
429
+ "pancreatic ductal cell": "epithelial_cell",
430
+ "type B pancreatic cell": "immune_cell",
431
+ "CD8-positive, alpha-beta memory T cell, CD45RO-positive": "hematopoietic_cell",
432
+ "alpha-beta T cell": "hematopoietic_cell",
433
+ "effector memory CD8-positive, alpha-beta T cell, terminally differentiated": "hematopoietic_cell",
434
+ "brown preadipocyte": "connective_cell",
435
+ "brown adipocyte": "connective_cell",
436
+ "lung ciliated cell": "ciliated_cell",
437
+ "effector CD8-positive, alpha-beta T cell": "hematopoietic_cell",
438
+ "T-helper 22 cell": "hematopoietic_cell",
439
+ "myeloid dendritic cell": "immune_cell",
440
+ "dendritic cell, human": "immune_cell",
441
+ "erythroid progenitor cell, mammalian": "precursor_cell",
442
+ "ILC1, human": "hematopoietic_cell",
443
+ "CD34-positive, CD38-negative hematopoietic stem cell": "precursor_cell",
444
+ "IgM plasma cell": "hematopoietic_cell",
445
+ "T-helper 1 cell": "hematopoietic_cell",
446
+ "group 2 innate lymphoid cell, human": "hematopoietic_cell",
447
+ "myeloid lineage restricted progenitor cell": "precursor_cell",
448
+ "T-helper 2 cell": "hematopoietic_cell",
449
+ "astrocyte of the cerebral cortex": "neural_cell",
450
+ "near-projecting glutamatergic cortical neuron": "neural_cell",
451
+ "effector CD4-positive, alpha-beta T cell": "hematopoietic_cell",
452
+ "type I NK T cell": "hematopoietic_cell",
453
+ "CD141-positive myeloid dendritic cell": "immune_cell",
454
+ "mature conventional dendritic cell": "immune_cell",
455
+ "melanocyte of skin": "secretory_cell",
456
+ "pancreatic A cell": "epithelial_cell",
457
+ "pancreatic D cell": "epithelial_cell",
458
+ "pancreatic PP cell": "epithelial_cell",
459
+ "CD14-positive, CD16-negative classical monocyte": "precursor_cell",
460
+ "CD4-positive, CD25-positive, alpha-beta regulatory T cell": "hematopoietic_cell",
461
+ "kidney connecting tubule principal cell": "epithelial_cell",
462
+ "epithelial cell of large intestine": "epithelial_cell",
463
+ "Purkinje cell": "neural_cell",
464
+ "granule cell": "neural_cell",
465
+ "neuron associated cell (sensu Vertebrata)": "neural_cell",
466
+ "stellate neuron": "neural_cell",
467
+ "neuronal brush cell": "epithelial_cell",
468
+ "myotube": "contractile_cell",
469
+ "muscle precursor cell": "precursor_cell",
470
+ "transitional stage B cell": "hematopoietic_cell",
471
+ "immature neutrophil": "immune_cell",
472
+ "medial ganglionic eminence derived interneuron": "neural_cell",
473
+ "caudal ganglionic eminence derived interneuron": "neural_cell",
474
+ "bronchus fibroblast of lung": "connective_cell",
475
+ "pigmented epithelial cell": "epithelial_cell",
476
+ "smooth muscle cell of sphincter of pupil": "contractile_cell",
477
+ "IgG plasmablast": "hematopoietic_cell",
478
+ "IgA plasmablast": "hematopoietic_cell",
479
+ "plasmatocyte": "immune_cell",
480
+ "kidney cortex artery cell": "epithelial_cell",
481
+ "kidney capillary endothelial cell": "epithelial_cell",
482
+ "kidney proximal straight tubule epithelial cell": "ciliated_cell",
483
+ "cardiac muscle cell": "contractile_cell",
484
+ "mesothelial cell of epicardium": "epithelial_cell",
485
+ "fetal cardiomyocyte": "contractile_cell",
486
+ "cardiac mesenchymal cell": "embryonic_cell",
487
+ "pneumocyte": "epithelial_cell",
488
+ "mononuclear cell": "hematopoietic_cell",
489
+ "tonsil germinal center B cell": "hematopoietic_cell",
490
+ "centroblast": "hematopoietic_cell",
491
+ "centrocyte": "hematopoietic_cell",
492
+ "macrophage dendritic cell progenitor": "precursor_cell",
493
+ "immature NK T cell": "hematopoietic_cell",
494
+ "neuroblast (sensu Vertebrata)": "neural_cell",
495
+ "alveolar type 2 fibroblast cell": "connective_cell",
496
+ "tracheobronchial smooth muscle cell": "contractile_cell",
497
+ "lung goblet cell": "epithelial_cell",
498
+ "respiratory basal cell": "epithelial_cell",
499
+ "brush cell of trachebronchial tree": "epithelial_cell",
500
+ "mesothelial fibroblast": "connective_cell",
501
+ "bladder urothelial cell": "unknown",
502
+ "bladder cell": "unknown",
503
+ "neoplastic cell": "epithelial_cell",
504
+ "endothelial cell of coronary artery": "epithelial_cell",
505
+ "cardiac neuron": "neural_cell",
506
+ "OFF retinal ganglion cell": "neural_cell",
507
+ "ON retinal ganglion cell": "neural_cell",
508
+ "lung resident memory CD8-positive, alpha-beta T cell": "hematopoietic_cell",
509
+ "lung resident memory CD4-positive, alpha-beta T cell": "hematopoietic_cell",
510
+ "deuterosomal cell": "epithelial_cell",
511
+ "granulocytopoietic cell": "hematopoietic_cell",
512
+ "basophil": "hematopoietic_cell",
513
+ "PP cell": "epithelial_cell",
514
+ "pancreatic epsilon cell": "epithelial_cell",
515
+ "fibroblast of connective tissue of prostate": "connective_cell",
516
+ "double negative T regulatory cell": "immune_cell",
517
+ "progenitor cell of mammary luminal epithelium": "precursor_cell",
518
+ "lactocyte": "secretory_cell",
519
+ "vascular lymphangioblast": "immune_cell",
520
+ "lung endothelial cell": "epithelial_cell",
521
+ "respiratory goblet cell": "epithelial_cell",
522
+ "cardiac pacemaker cell of sinoatrial node": "contractile_cell",
523
+ "activated CD4-positive, alpha-beta T cell": "hematopoietic_cell",
524
+ "differentiation-committed oligodendrocyte precursor": "neural_cell",
525
+ "glycinergic neuron": "neural_cell",
526
+ "keratinocyte stem cell": "precursor_cell",
527
+ "bronchial smooth muscle cell": "contractile_cell",
528
+ "epidermal cell": "epithelial_cell",
529
+ "basal epithelial cell of tracheobronchial tree": "epithelial_cell",
530
+ "neural stem cell": "precursor_cell",
531
+ "mature alpha-beta T cell": "hematopoietic_cell",
532
+ "brush cell of epithelium proper of large intestine": "epithelial_cell",
533
+ "smooth muscle cell of trachea": "contractile_cell",
534
+ "ciliated columnar cell of tracheobronchial tree": "ciliated_cell",
535
+ "early pro-B cell": "precursor_cell",
536
+ "pulmonary interstitial fibroblast": "connective_cell",
537
+ "neuroepithelial stem cell": "precursor_cell",
538
+ "lung neuroendocrine cell": "secretory_cell",
539
+ "common lymphoid progenitor": "precursor_cell",
540
+ "plasmacytoid dendritic cell, human": "immune_cell",
541
+ "activated CD4-positive, alpha-beta T cell, human": "hematopoietic_cell",
542
+ "lateral mesodermal cell": "embryonic_cell",
543
+ "hypothalamus cell": "neural_cell",
544
+ "primitive erythroid progenitor": "precursor_cell",
545
+ "retinal progenitor cell": "precursor_cell",
546
+ "spinal cord motor neuron": "neural_cell",
547
+ "cranial motor neuron": "neural_cell",
548
+ "enteric neuron": "neural_cell",
549
+ "spiral ganglion neuron": "neural_cell",
550
+ "cerebral cortex GABAergic interneuron": "neural_cell",
551
+ "embryonic blood vessel endothelial progenitor cell": "neural_cell",
552
+ "sympathetic neuron": "neural_cell",
553
+ "olfactory receptor cell": "neural_cell",
554
+ "extraembryonic cell": "embryonic_cell",
555
+ "fibroblast of breast": "connective_cell",
556
+ "endothelial cell of umbilical vein": "epithelial_cell",
557
+ "transit amplifying cell": "unknown",
558
+ "M cell of gut": "epithelial_cell",
559
+ "hypendymal cell": "epithelial_cell",
560
+ "oogonial cell": "unknown",
561
+ "female germ cell": "unknown",
562
+ "male germ cell": "unknown",
563
+ "oocyte": "unknown",
564
+ "basket cell": "secretory_cell",
565
+ "epithelial cell of prostate": "epithelial_cell",
566
+ "basal epithelial cell of prostatic duct": "epithelial_cell",
567
+ "contractile cell": "contractile_cell",
568
+ "mature T cell": "hematopoietic_cell",
569
+ "eosinophil": "hematopoietic_cell",
570
+ "corneal epithelial cell": "epithelial_cell",
571
+ "corneal endothelial cell": "epithelial_cell",
572
+ "activated CD8-positive, alpha-beta T cell": "hematopoietic_cell",
573
+ "follicular B cell": "hematopoietic_cell",
574
+ "colon macrophage": "immune_cell",
575
+ "myelinating Schwann cell": "neural_cell",
576
+ "cell in vitro": "unknown",
577
+ "S cone cell": "neural_cell",
578
+ "lung interstitial macrophage": "connective_cell",
579
+ "Leydig cell": "unknown",
580
+ "L2/3 intratelencephalic projecting glutamatergic neuron": "neural_cell",
581
+ "enterocyte of colon": "ciliated_cell",
582
+ "mesenchymal lymphangioblast": "precursor_cell",
583
+ "colon epithelial cell": "epithelial_cell",
584
+ "CD34-positive, CD56-positive, CD117-positive common innate lymphoid precursor, human": "precursor_cell",
585
+ "NKp44-positive group 3 innate lymphoid cell, human": "hematopoietic_cell",
586
+ "NKp44-negative group 3 innate lymphoid cell, human": "hematopoietic_cell",
587
+ "primary sensory neuron (sensu Teleostei)": "neural_cell",
588
+ "type N enteroendocrine cell": "secretory_cell",
589
+ "progenitor cell of endocrine pancreas": "precursor_cell",
590
+ "CD4-positive, alpha-beta thymocyte": "hematopoietic_cell",
591
+ "fibroblast of connective tissue of nonglandular part of prostate": "connective_cell",
592
+ "fibroblast of connective tissue of glandular part of prostate": "connective_cell",
593
+ "CD8-positive, alpha-beta thymocyte": "hematopoietic_cell",
594
+ "enucleate erythrocyte": "hematopoietic_cell",
595
+ "lung microvascular endothelial cell": "epithelial_cell",
596
+ "serous cell of epithelium of bronchus": "secretory_cell",
597
+ "pulmonary ionocyte": "epithelial_cell",
598
+ "epithelial cell of pancreas": "epithelial_cell",
599
+ "cultured cell": "unknown",
600
+ "reticular cell": "unknown",
601
+ "inflammatory cell": "immune_cell",
602
+ "stem cell of epidermis": "precursor_cell",
603
+ "pigmented ciliary epithelial cell": "epithelial_cell",
604
+ "non-pigmented ciliary epithelial cell": "epithelial_cell",
605
+ "ciliary muscle cell": "contractile_cell",
606
+ "acinar cell": "epithelial_cell",
607
+ "endocrine cell": "secretory_cell",
608
+ "non-terminally differentiated cell": "unknown",
609
+ "pre-natural killer cell": "hematopoietic_cell",
610
+ "midget ganglion cell of retina": "neural_cell",
611
+ "GABAergic amacrine cell": "neural_cell",
612
+ "diffuse bipolar 3b cell": "neural_cell",
613
+ "diffuse bipolar 2 cell": "neural_cell",
614
+ "ON parasol ganglion cell": "neural_cell",
615
+ "diffuse bipolar 1 cell": "neural_cell",
616
+ "invaginating midget bipolar cell": "neural_cell",
617
+ "diffuse bipolar 3a cell": "neural_cell",
618
+ "H2 horizontal cell": "neural_cell",
619
+ "OFFx cell": "neural_cell",
620
+ "H1 horizontal cell": "neural_cell",
621
+ "diffuse bipolar 4 cell": "neural_cell",
622
+ "diffuse bipolar 6 cell": "neural_cell",
623
+ "OFF parasol ganglion cell": "neural_cell",
624
+ "hepatic pit cell": "hematopoietic_cell",
625
+ "follicular dendritic cell": "immune_cell",
626
+ "mature gamma-delta T cell": "hematopoietic_cell",
627
+ "thalamic excitatory neuron": "neural_cell",
628
+ "small bistratified retinal ganglion cell": "neural_cell",
629
+ "mature microglial cell": "neural_cell",
630
+ "intestinal epithelial cell": "epithelial_cell",
631
+ "epithelial cell of lung": "epithelial_cell",
632
+ "CD38-negative naive B cell": "hematopoietic_cell",
633
+ "urethra urothelial cell": "epithelial_cell",
634
+ "seminal vesicle glandular cell": "secretory_cell",
635
+ "type I cell of adrenal cortex": "epithelial_cell",
636
+ "germinal center B cell": "hematopoietic_cell",
637
+ "kidney cell": "unknown",
638
+ "kidney loop of Henle medullary thick ascending limb epithelial cell": "epithelial_cell",
639
+ "kidney loop of Henle cortical thick ascending limb epithelial cell": "epithelial_cell",
640
+ "kidney cortex tubule cell": "epithelial_cell",
641
+ "kidney glomerular epithelial cell": "epithelial_cell",
642
+ "preadipocyte": "connective_cell",
643
+ "type 6 cone bipolar cell (sensu Mus)": "neural_cell",
644
+ "type 5a cone bipolar cell": "neural_cell",
645
+ "type 7 cone bipolar cell (sensu Mus)": "neural_cell",
646
+ "type 3b cone bipolar cell": "neural_cell",
647
+ "type 3a cone bipolar cell": "neural_cell",
648
+ "type 5b cone bipolar cell": "neural_cell",
649
+ "type 5 cone bipolar cell (sensu Mus)": "neural_cell",
650
+ "type 8 cone bipolar cell (sensu Mus)": "neural_cell",
651
+ "type 9 cone bipolar cell (sensu Mus)": "neural_cell",
652
+ "type 2 cone bipolar cell (sensu Mus)": "neural_cell",
653
+ "type 4 cone bipolar cell (sensu Mus)": "neural_cell",
654
+ "type 1 cone bipolar cell (sensu Mus)": "neural_cell",
655
+ "cerebellar granule cell precursor": "neural_cell",
656
+ "unipolar brush cell": "epithelial_cell",
657
+ "glioblast": "precursor_cell",
658
+ "immature astrocyte": "neural_cell",
659
+ "meningeal macrophage": "immune_cell",
660
+ "noradrenergic cell": "secretory_cell",
661
+ "multi-ciliated epithelial cell": "ciliated_cell",
662
+ "pulmonary artery endothelial cell": "epithelial_cell",
663
+ "cone retinal bipolar cell": "neural_cell",
664
+ "retinal astrocyte": "neural_cell",
665
+ "efferent neuron": "neural_cell",
666
+ "enterocyte of epithelium proper of ileum": "ciliated_cell",
667
+ "ileal goblet cell": "epithelial_cell",
668
+ "smooth muscle fiber of ileum": "contractile_cell",
669
+ "enteroendocrine cell of small intestine": "secretory_cell",
670
+ "aortic smooth muscle cell": "contractile_cell",
671
+ "mesothelial cell of visceral pleura": "epithelial_cell",
672
+ "ciliated cell of the bronchus": "ciliated_cell",
673
+ "squamous epithelial cell": "epithelial_cell",
674
+ "nasal mucosa goblet cell": "epithelial_cell",
675
+ "memory regulatory T cell": "hematopoietic_cell",
676
+ "naive regulatory T cell": "hematopoietic_cell",
677
+ "myeloid suppressor cell": "immune_cell",
678
+ "adipose macrophage": "immune_cell",
679
+ "absorptive cell": "unknown",
680
+ "intestinal crypt stem cell of colon": "precursor_cell",
681
+ "mature astrocyte": "neural_cell",
682
+ "hair follicular keratinocyte": "epithelial_cell",
683
+ "sebum secreting cell": "secretory_cell",
684
+ "granular cell of epidermis": "epithelial_cell",
685
+ "anterior lens cell": "neural_cell",
686
+ "secondary lens fiber": "epithelial_cell",
687
+ "lens fiber cell": "epithelial_cell",
688
+ "A2 amacrine cell": "neural_cell",
689
+ "sperm": "unknown",
690
+ "abnormal cell": "unknown",
691
+ "myometrial cell": "unknown",
692
+ "epithelial cell of uterus": "epithelial_cell",
693
+ "prickle cell": "epithelial_cell",
694
+ "Merkel cell": "epithelial_cell",
695
+ "cortical thymic epithelial cell": "epithelial_cell",
696
+ "medullary thymic epithelial cell": "epithelial_cell",
697
+ "epicardial adipocyte": "connective_cell",
698
+ "peritubular capillary endothelial cell": "epithelial_cell",
699
+ "conjunctival epithelial cell": "epithelial_cell",
700
+ "glomerular capillary endothelial cell": "epithelial_cell",
701
+ "columnar/cuboidal epithelial cell": "epithelial_cell",
702
+ "kidney resident macrophage": "immune_cell",
703
+ "ON-blue cone bipolar cell": "neural_cell",
704
+ "CD8-alpha alpha positive, gamma-delta intraepithelial T cell": "hematopoietic_cell",
705
+ "NKp46-positive innate lymphoid cell, human": "hematopoietic_cell",
706
+ "neutrophil progenitor cell": "precursor_cell",
707
+ "skeletal muscle satellite stem cell": "precursor_cell",
708
+ "mucosal type mast cell": "hematopoietic_cell",
709
+ "metallothionein-positive alveolar macrophage": "immune_cell",
710
+ "cerebral cortex neuron": "neural_cell",
711
+ "basal cell of epithelium of trachea": "epithelial_cell",
712
+ "tracheal goblet cell": "epithelial_cell",
713
+ "photoreceptor cell": "neural_cell",
714
+ "cochlea auditory hair cell": "neural_cell",
715
+ "pinealocyte": "epithelial_cell",
716
+ "iris pigment epithelial cell": "epithelial_cell",
717
+ "radial glial cell": "neural_cell",
718
+ "GABAergic interneuron": "neural_cell",
719
+ "pancreatic endocrine cell": "secretory_cell",
720
+ "endothelial cell of sinusoid": "epithelial_cell",
721
+ "DN3 thymocyte": "hematopoietic_cell",
722
+ "DN1 thymic pro-T cell": "precursor_cell",
723
+ "parasol ganglion cell of retina": "neural_cell",
724
+ "epithelial cell of proximal tubule segment 3": "ciliated_cell",
725
+ "valve interstitial cell": "epithelial_cell",
726
+ "valve endothelial cell": "epithelial_cell",
727
+ "myocyte of sinoatrial node": "contractile_cell",
728
+ "colon goblet cell": "epithelial_cell",
729
+ "enteroendocrine cell of colon": "secretory_cell",
730
+ "paneth cell of colon": "secretory_cell",
731
+ "cholinergic neuron": "neural_cell",
732
+ "L4/5 intratelencephalic projecting glutamatergic neuron": "neural_cell",
733
+ "L6 intratelencephalic projecting glutamatergic neuron": "neural_cell",
734
+ "L3 intratelencephalic projecting glutamatergic neuron": "neural_cell",
735
+ "tanycyte": "neural_cell",
736
+ "IgG-negative class switched memory B cell": "hematopoietic_cell",
737
+ "IgG memory B cell": "hematopoietic_cell",
738
+ "indirect pathway medium spiny neuron": "neural_cell",
739
+ "direct pathway medium spiny neuron": "neural_cell",
740
+ "elicited macrophage": "immune_cell",
741
+ "alveolar type 1 fibroblast cell": "connective_cell",
742
+ "respiratory hillock cell": "epithelial_cell",
743
+ "epithelial cell of lower respiratory tract": "epithelial_cell",
744
+ "serous secreting cell": "secretory_cell",
745
+ "tracheobronchial serous cell": "secretory_cell",
746
+ "tracheobronchial goblet cell": "secretory_cell",
747
+ "bronchial goblet cell": "secretory_cell",
748
+ "epithelial fate stem cell": "epithelial_cell",
749
+ "lymphatic endothelial cell of medulla ceiling": "epithelial_cell",
750
+ "lymphatic endothelial cell of subcapsular sinus floor": "epithelial_cell",
751
+ "lymphatic endothelial cell of subcapsular sinus ceiling": "epithelial_cell",
752
+ "lymph node lymphatic vessel endothelial cell": "epithelial_cell",
753
+ "tissue-resident macrophage": "immune_cell",
754
+ "glandular epithelial cell": "epithelial_cell",
755
+ "L4 intratelencephalic projecting glutamatergic neuron": "neural_cell",
756
+ "L5/6 near-projecting glutamatergic neuron": "neural_cell",
757
+ "forebrain radial glial cell": "neural_cell",
758
+ "white adipocyte": "connective_cell",
759
+ "precursor cell": "precursor_cell",
760
+ "primary cultured cell": "unknown",
761
+ "liver dendritic cell": "immune_cell",
762
+ "giant bipolar cell": "neural_cell",
763
+ "eurydendroid cell": "neural_cell",
764
+ "type A enteroendocrine cell": "secretory_cell",
765
+ "type D enteroendocrine cell": "secretory_cell",
766
+ "serous cell of epithelium of trachea": "secretory_cell",
767
+ "T follicular regulatory cell": "hematopoietic_cell",
768
+ "enterocyte of epithelium of small intestine": "ciliated_cell",
769
+ "tuft cell of colon": "epithelial_cell",
770
+ "small intestine goblet cell": "epithelial_cell",
771
+ "epithelial cell of small intestine": "epithelial_cell",
772
+ "BEST4+ intestinal epithelial cell, human": "epithelial_cell",
773
+ "microfold cell of epithelium of small intestine": "immune_cell",
774
+ "foveolar cell of stomach": "epithelial_cell",
775
+ "mucous neck cell": "epithelial_cell",
776
+ "type G enteroendocrine cell": "secretory_cell",
777
+ "natural T-regulatory cell": "immune_cell",
778
+ "peptic cell": "epithelial_cell",
779
+ "P/D1 enteroendocrine cell": "secretory_cell",
780
+ "parietal cell": "epithelial_cell",
781
+ "eye photoreceptor cell": "neural_cell",
782
+ "keratocyte": "connective_cell",
783
+ "preosteoblast": "unknown",
784
+ "endosteal cell": "unknown",
785
+ "immature natural killer cell": "hematopoietic_cell",
786
+ "basal cell of epithelium of bronchus": "epithelial_cell",
787
+ "brush cell of bronchus": "epithelial_cell",
788
+ "sensory neuron of dorsal root ganglion": "neural_cell",
789
+ "parasympathetic neuron": "neural_cell",
790
+ "immature T cell": "hematopoietic_cell",
791
+ "epithelial cell of esophagus": "epithelial_cell",
792
+ "glandular cell of esophagus": "secretory_cell",
793
+ "perineuronal satellite cell": "neural_cell",
794
+ "olfactory ensheathing cell": "neural_cell",
795
+ "onychocyte": "embryonic_cell",
796
+ "epidermal Langerhans cell": "immune_cell",
797
+ "brush cell of trachea": "epithelial_cell",
798
+ "mesothelial cell of pleura": "epithelial_cell",
799
+ "subcutaneous adipocyte": "connective_cell",
800
+ "hepatoblast": "embryonic_cell",
801
+ "stromal cell of endometrium": "connective_cell",
802
+ "central nervous system neuron": "neural_cell",
803
+ "intraepithelial lymphocyte": "hematopoietic_cell",
804
+ "amygdala excitatory neuron": "neural_cell",
805
+ "bistratified retinal ganglion cell": "neural_cell",
806
+ "chromaffin cell": "embryonic_cell",
807
+ "chorionic trophoblast cell": "embryonic_cell",
808
+ "B-1a B cell": "hematopoietic_cell",
809
+ "ganglion interneuron": "neural_cell",
810
+ "B-1b B cell": "hematopoietic_cell",
811
+ "tongue muscle cell": "contractile_cell",
812
+ "cortical cell of adrenal gland": "embryonic_cell",
813
+ "histaminergic neuron": "neural_cell",
814
+ "epithelial cell of exocrine pancreas": "epithelial_cell",
815
+ "cerebellar Golgi cell": "neural_cell",
816
+ "kidney inner medulla collecting duct epithelial cell": "epithelial_cell",
817
+ "kidney pelvis urothelial cell": "epithelial_cell",
818
+ "atrioventricular bundle cell": "contractile_cell",
819
+ "peripheral blood mononuclear cell": "hematopoietic_cell",
820
+ "type II NK T cell": "hematopoietic_cell",
821
+ "immature alpha-beta T cell": "hematopoietic_cell",
822
+ "bipolar neuron": "neural_cell",
823
+ "brainstem motor neuron": "neural_cell",
824
+ "epithelial cell of lacrimal sac": "epithelial_cell",
825
+ "skeletal muscle fibroblast": "skeletal_muscle",
826
+ "salivary gland cell": "secretory_cell",
827
+ "astrocyte of the cerebellum": "neural_cell",
828
+ "CD4-positive, alpha-beta memory T cell, CD45RO-positive": "hematopoietic_cell",
829
+ "GIP cell": "epithelial_cell",
830
+ "decidual cell": "connective_cell",
831
+ "migratory enteric neural crest cell": "neural_cell",
832
+ "dentate gyrus neuron": "neural_cell",
833
+ "taste receptor cell": "epithelial_cell",
834
+ "dermis microvascular lymphatic vessel endothelial cell": "epithelial_cell",
835
+ "activated type II NK T cell": "hematopoietic_cell",
836
+ "bone marrow cell": "skeletal_muscle",
837
+ "CNS interneuron": "neural_cell",
838
+ "type I enteroendocrine cell": "secretory_cell",
839
+ "hair follicle melanocyte": "secretory_cell",
840
+ "kidney afferent arteriole endothelial cell": "epithelial_cell",
841
+ "multinucleated giant cell": "unknown",
842
+ "conjunctiva goblet cell": "epithelial_cell",
843
+ "thyroid follicular cell": "epithelial_cell",
844
+ "embryonic stem cell": "embryonic_cell",
845
+ "respiratory epithelial cell": "epithelial_cell",
846
+ "bronchial epithelial cell": "epithelial_cell",
847
+ "endothelial stalk cell": "epithelial_cell",
848
+ "enucleated reticulocyte": "hematopoietic_cell",
849
+ "kidney efferent arteriole endothelial cell": "epithelial_cell",
850
+ "hippocampal CA1-3 neuron": "neural_cell",
851
+ "intratelencephalic-projecting glutamatergic cortical neuron": "neural_cell",
852
+ "gingival epithelial cell": "epithelial_cell",
853
+ "visceromotor neuron": "neural_cell",
854
+ "sebaceous gland cell": "epithelial_cell",
855
+ "activated CD8-positive, alpha-beta T cell, human": "hematopoietic_cell",
856
+ "stromal cell of lamina propria of small intestine": "connective_cell",
857
+ "pre-B-I cell": "precursor_cell",
858
+ "immature Schwann cell": "precursor_cell",
859
+ "CD8-positive, alpha-beta cytokine secreting effector T cell": "hematopoietic_cell",
860
+ "epithelial cell of sweat gland": "epithelial_cell",
861
+ "ventricular cardiac muscle cell": "contractile_cell"
862
+ }
teddy/data_processing/utils/bio_annotations/data/mappings/all_filtered_disease_mapping.json ADDED
@@ -0,0 +1,125 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "Alzheimer disease": "brain_disease",
3
+ "B-cell acute lymphoblastic leukemia": "cancer_disease",
4
+ "Barrett esophagus": "digestive_disease",
5
+ "COVID-19": "infectious_disease",
6
+ "Crohn disease": "immune_disease",
7
+ "Crohn ileitis": "immune_disease",
8
+ "Lewy body dementia": "brain_disease",
9
+ "Parkinson disease": "brain_disease",
10
+ "Plasmodium malariae malaria": "infectious_disease",
11
+ "Wilms tumor": "cancer_disease",
12
+ "acute kidney failure": "kidney_disease",
13
+ "acute myeloid leukemia": "cancer_disease",
14
+ "acute myocardial infarction": "cardiovascular_disease",
15
+ "acute promyelocytic leukemia": "cancer_disease",
16
+ "adenocarcinoma": "cancer_disease",
17
+ "age related macular degeneration 7": "other_disease",
18
+ "amyotrophic lateral sclerosis": "brain_disease",
19
+ "amyotrophic lateral sclerosis 26 with or without frontotemporal dementia": "brain_disease",
20
+ "anencephaly": "genetic_disease",
21
+ "arrhythmogenic right ventricular cardiomyopathy": "cardiovascular_disease",
22
+ "aspiration pneumonia": "infectious_disease",
23
+ "autosomal dominant polycystic kidney disease": "genetic_disease",
24
+ "basal cell carcinoma": "cancer_disease",
25
+ "basal laminar drusen": "other_disease",
26
+ "benign prostatic hyperplasia": "other_disease",
27
+ "blastoma": "cancer_disease",
28
+ "brain neoplasm": "cancer_disease",
29
+ "breast cancer": "cancer_disease",
30
+ "breast carcinoma": "cancer_disease",
31
+ "cardiomyopathy": "cardiovascular_disease",
32
+ "cataract": "other_disease",
33
+ "chromophobe renal cell carcinoma": "cancer_disease",
34
+ "chronic kidney disease": "kidney_disease",
35
+ "chronic obstructive pulmonary disease": "immune_disease",
36
+ "chronic rhinitis": "immune_disease",
37
+ "clear cell renal carcinoma": "cancer_disease",
38
+ "colon sessile serrated adenoma/polyp": "cancer_disease",
39
+ "colorectal cancer": "cancer_disease",
40
+ "common variable immunodeficiency": "immune_disease",
41
+ "congenital heart disease": "cardiovascular_disease",
42
+ "cystic fibrosis": "immune_disease",
43
+ "dementia": "brain_disease",
44
+ "diabetic kidney disease": "immune_disease",
45
+ "digestive system disorder": "digestive_disease",
46
+ "dilated cardiomyopathy": "cardiovascular_disease",
47
+ "endocrine pancreas disorder": "other_disease",
48
+ "epidermolysis bullosa": "other_disease",
49
+ "epilepsy": "brain_disease",
50
+ "frontotemporal dementia": "brain_disease",
51
+ "gastric intestinal metaplasia": "cancer_disease",
52
+ "gastritis": "digestive_disease",
53
+ "gingivitis": "other_disease",
54
+ "glioblastoma": "cancer_disease",
55
+ "heart disorder": "cardiovascular_disease",
56
+ "heart failure": "cardiovascular_disease",
57
+ "hydrocephalus": "brain_disease",
58
+ "hydrosalpinx": "other_disease",
59
+ "hyperplastic polyp": "cancer_disease",
60
+ "hypersensitivity pneumonitis": "immune_disease",
61
+ "influenza": "infectious_disease",
62
+ "injury": "other_disease",
63
+ "interstitial lung disease": "respiratory_disease",
64
+ "juvenile dermatomyositis": "immune_disease",
65
+ "keloid": "other_disease",
66
+ "kidney benign neoplasm": "cancer_disease",
67
+ "kidney oncocytoma": "cancer_disease",
68
+ "listeriosis": "infectious_disease",
69
+ "localized scleroderma": "immune_disease",
70
+ "long COVID-19": "infectious_disease",
71
+ "luminal A breast carcinoma": "cancer_disease",
72
+ "luminal B breast carcinoma": "cancer_disease",
73
+ "lung adenocarcinoma": "cancer_disease",
74
+ "lung large cell carcinoma": "cancer_disease",
75
+ "lymphadenitis": "infectious_disease",
76
+ "lymphangioleiomyomatosis": "respiratory_disease",
77
+ "macular degeneration": "other_disease",
78
+ "malignant ovarian serous tumor": "cancer_disease",
79
+ "malignant pancreatic neoplasm": "cancer_disease",
80
+ "metastatic melanoma": "cancer_disease",
81
+ "multiple sclerosis": "brain_disease",
82
+ "myocardial infarction": "cardiovascular_disease",
83
+ "neuroendocrine carcinoma": "cancer_disease",
84
+ "non-compaction cardiomyopathy": "cardiovascular_disease",
85
+ "non-small cell lung carcinoma": "cancer_disease",
86
+ "non-specific interstitial pneumonia": "immune_disease",
87
+ "nonpapillary renal cell carcinoma": "cancer_disease",
88
+ "normal": "healthy",
89
+ "opiate dependence": "other_disease",
90
+ "periodontitis": "other_disease",
91
+ "pilocytic astrocytoma": "cancer_disease",
92
+ "plasma cell myeloma": "cancer_disease",
93
+ "pleomorphic carcinoma": "cancer_disease",
94
+ "premalignant hematological system disease": "cancer_disease",
95
+ "primary biliary cholangitis": "immune_disease",
96
+ "primary sclerosing cholangitis": "immune_disease",
97
+ "pulmonary emphysema": "respiratory_disease",
98
+ "pulmonary fibrosis": "immune_disease",
99
+ "pulmonary sarcoidosis": "immune_disease",
100
+ "renal cell carcinoma": "cancer_disease",
101
+ "respiratory failure": "respiratory_disease",
102
+ "respiratory system disorder": "respiratory_disease",
103
+ "severe acute respiratory syndrome": "infectious_disease",
104
+ "small cell lung carcinoma": "cancer_disease",
105
+ "squamous cell lung carcinoma": "cancer_disease",
106
+ "systemic lupus erythematosus": "immune_disease",
107
+ "temporal lobe epilepsy": "brain_disease",
108
+ "tongue cancer": "cancer_disease",
109
+ "toxoplasmosis": "infectious_disease",
110
+ "triple-negative breast carcinoma": "cancer_disease",
111
+ "trisomy 18": "genetic_disease",
112
+ "tubular adenoma": "cancer_disease",
113
+ "tubulovillous adenoma": "cancer_disease",
114
+ "type 1 diabetes mellitus": "immune_disease",
115
+ "type 2 diabetes mellitus": "immune_disease",
116
+ "B-cell non-Hodgkin lymphoma": "cancer_disease",
117
+ "colorectal neoplasm": "cancer_disease",
118
+ "follicular lymphoma": "cancer_disease",
119
+ "Down syndrome": "genetic_disease",
120
+ "gastric cancer": "cancer_disease",
121
+ "post-COVID-19 disorder": "infectious_disease",
122
+ "encephalomyelitis": "brain_disease",
123
+ "pneumonia": "infectious_disease",
124
+ "rheumatoid arthritis": "immune_disease"
125
+ }
teddy/data_processing/utils/bio_annotations/data/mappings/all_filtered_sex_mapping.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "female": "female",
3
+ "male": "male",
4
+ "unknown": "unknown"
5
+ }
teddy/data_processing/utils/bio_annotations/data/mappings/all_filtered_tissue_mapping.json ADDED
@@ -0,0 +1,415 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "Brodmann (1909) area 19": "central_nervous_tissue",
3
+ "Brodmann (1909) area 23": "central_nervous_tissue",
4
+ "Brodmann (1909) area 25": "central_nervous_tissue",
5
+ "Brodmann (1909) area 4": "central_nervous_tissue",
6
+ "Brodmann (1909) area 46": "central_nervous_tissue",
7
+ "abdomen": "musculature_tissue",
8
+ "abdominal wall": "musculature_tissue",
9
+ "adipose tissue": "adipose_tissue",
10
+ "adnexa of uterus": "reproductive_tissue",
11
+ "adrenal gland": "endocrine_tissue",
12
+ "adrenal tissue": "endocrine_tissue",
13
+ "agranular insular cortex": "central_nervous_tissue",
14
+ "alveolus of lung": "respiratory_tissue",
15
+ "ampulla of uterine tube": "reproductive_tissue",
16
+ "amygdala": "central_nervous_tissue",
17
+ "angular gyrus": "central_nervous_tissue",
18
+ "anterior cingulate cortex": "central_nervous_tissue",
19
+ "anterior cingulate gyrus": "central_nervous_tissue",
20
+ "anterior hypothalamic region": "central_nervous_tissue",
21
+ "anterior part of tongue": "sensory_tissue",
22
+ "anterior wall of left ventricle": "cardiovascular_tissue",
23
+ "anterolateral visual area": "central_nervous_tissue",
24
+ "aorta": "cardiovascular_tissue",
25
+ "apex of heart": "cardiovascular_tissue",
26
+ "artery": "cardiovascular_tissue",
27
+ "ascending colon": "digestive_tissue",
28
+ "ascitic fluid": "unknown",
29
+ "atrioventricular node": "cardiovascular_tissue",
30
+ "auditory cortex": "central_nervous_tissue",
31
+ "autopod skin": "integumentary_tissue",
32
+ "axilla": "integumentary_tissue",
33
+ "barrel cortex": "central_nervous_tissue",
34
+ "basal forebrain": "central_nervous_tissue",
35
+ "basal ganglion": "central_nervous_tissue",
36
+ "basal zone of heart": "cardiovascular_tissue",
37
+ "bladder lumen": "renal_tissue",
38
+ "bladder organ": "renal_tissue",
39
+ "blood": "hematopoietic_tissue",
40
+ "body of stomach": "digestive_tissue",
41
+ "bone marrow": "hematopoietic_tissue",
42
+ "bone spine": "hematopoietic_tissue",
43
+ "brain": "central_nervous_tissue",
44
+ "brain gray matter": "central_nervous_tissue",
45
+ "brain meninx": "central_nervous_tissue",
46
+ "brain ventricle": "central_nervous_tissue",
47
+ "brain white matter": "central_nervous_tissue",
48
+ "brainstem": "central_nervous_tissue",
49
+ "breast": "exocrine_tissue",
50
+ "bronchial epithelial cell": "respiratory_tissue",
51
+ "bronchopulmonary lymph node": "immune_tissue",
52
+ "bronchus": "respiratory_tissue",
53
+ "brown adipose tissue": "immune_tissue",
54
+ "brown preadipocyte": "adipose_tissue",
55
+ "caecum": "digestive_tissue",
56
+ "caecum epithelium": "digestive_tissue",
57
+ "cardia of stomach": "digestive_tissue",
58
+ "cardiac atrium": "cardiovascular_tissue",
59
+ "cardiac ventricle": "cardiovascular_tissue",
60
+ "caudal ganglionic eminence": "central_nervous_tissue",
61
+ "caudate lobe of liver": "hepatic_tissue",
62
+ "caudate nucleus": "central_nervous_tissue",
63
+ "cerebellar cortex": "central_nervous_tissue",
64
+ "cerebellar vermis": "central_nervous_tissue",
65
+ "cerebellum": "central_nervous_tissue",
66
+ "cerebellum lobule": "central_nervous_tissue",
67
+ "cerebellum vermis lobule": "central_nervous_tissue",
68
+ "cerebral cortex": "central_nervous_tissue",
69
+ "cerebral nuclei": "central_nervous_tissue",
70
+ "cerebrocerebellum": "central_nervous_tissue",
71
+ "cervical lymph node": "immune_tissue",
72
+ "cervical spinal cord white matter": "central_nervous_tissue",
73
+ "chorionic villus": "reproductive_tissue",
74
+ "chorioretinal region": "eye_tissue",
75
+ "choroid plexus": "central_nervous_tissue",
76
+ "ciliary body": "eye_tissue",
77
+ "cingulate cortex": "central_nervous_tissue",
78
+ "claustrum of brain": "central_nervous_tissue",
79
+ "colon": "digestive_tissue",
80
+ "colonic epithelium": "digestive_tissue",
81
+ "conjunctiva": "eye_tissue",
82
+ "cornea": "eye_tissue",
83
+ "corneo-scleral junction": "eye_tissue",
84
+ "coronary artery": "cardiovascular_tissue",
85
+ "corpus callosum": "central_nervous_tissue",
86
+ "cortex of kidney": "renal_tissue",
87
+ "cortical layer II/III": "central_nervous_tissue",
88
+ "cortical layer V": "central_nervous_tissue",
89
+ "cortical layer VI": "central_nervous_tissue",
90
+ "cortical plate": "central_nervous_tissue",
91
+ "cortical subplate": "central_nervous_tissue",
92
+ "cultured cell": "unknown",
93
+ "decidua": "reproductive_tissue",
94
+ "decidua basalis": "embryonic_tissue",
95
+ "dentate nucleus": "central_nervous_tissue",
96
+ "dermis": "integumentary_tissue",
97
+ "descending colon": "digestive_tissue",
98
+ "diaphragm": "musculature_tissue",
99
+ "diencephalon": "central_nervous_tissue",
100
+ "dorsal thalamus": "central_nervous_tissue",
101
+ "dorsolateral prefrontal cortex": "central_nervous_tissue",
102
+ "duodeno-jejunal junction": "digestive_tissue",
103
+ "duodenum": "digestive_tissue",
104
+ "dura mater": "central_nervous_tissue",
105
+ "embryo": "embryonic_tissue",
106
+ "embryonic stem cell": "embryonic_tissue",
107
+ "endocrine pancreas": "endocrine_tissue",
108
+ "endometrium": "reproductive_tissue",
109
+ "endothelial cell": "cardiovascular_tissue",
110
+ "entorhinal cortex": "central_nervous_tissue",
111
+ "epididymal fat pad": "cardiovascular_tissue",
112
+ "epithelial cell of alveolus of lung": "respiratory_tissue",
113
+ "epithelial cell of lung": "respiratory_tissue",
114
+ "epithelium of esophagus": "digestive_tissue",
115
+ "epithelium of small intestine": "digestive_tissue",
116
+ "epithelium of trachea": "respiratory_tissue",
117
+ "esophagogastric junction": "digestive_tissue",
118
+ "esophagus": "digestive_tissue",
119
+ "esophagus muscularis mucosa": "digestive_tissue",
120
+ "exocrine pancreas": "exocrine_tissue",
121
+ "eye": "eye_tissue",
122
+ "eye trabecular meshwork": "eye_tissue",
123
+ "fallopian tube": "reproductive_tissue",
124
+ "fimbria of uterine tube": "reproductive_tissue",
125
+ "forebrain": "central_nervous_tissue",
126
+ "forelimb": "musculature_tissue",
127
+ "fovea centralis": "eye_tissue",
128
+ "frontal cortex": "central_nervous_tissue",
129
+ "frontal lobe": "central_nervous_tissue",
130
+ "gallbladder": "digestive_tissue",
131
+ "ganglionic eminence": "central_nervous_tissue",
132
+ "gastrocnemius": "musculature_tissue",
133
+ "gingiva": "exocrine_tissue",
134
+ "glabella skin": "integumentary_tissue",
135
+ "gonad": "reproductive_tissue",
136
+ "gonad primordium": "reproductive_tissue",
137
+ "gonadal fat pad": "reproductive_tissue",
138
+ "gustatory cortex": "central_nervous_tissue",
139
+ "gut wall": "digestive_tissue",
140
+ "head of caudate nucleus": "central_nervous_tissue",
141
+ "heart": "cardiovascular_tissue",
142
+ "heart left ventricle": "cardiovascular_tissue",
143
+ "heart right ventricle": "cardiovascular_tissue",
144
+ "hemisphere part of cerebellar posterior lobe": "central_nervous_tissue",
145
+ "hepatic cecum": "hepatic_tissue",
146
+ "hepatic flexure of colon": "digestive_tissue",
147
+ "hindbrain": "central_nervous_tissue",
148
+ "hindgut": "digestive_tissue",
149
+ "hindlimb": "musculature_tissue",
150
+ "hindlimb skin": "integumentary_tissue",
151
+ "hippocampal formation": "central_nervous_tissue",
152
+ "hypothalamus": "central_nervous_tissue",
153
+ "ileal epithelium": "digestive_tissue",
154
+ "ileum": "digestive_tissue",
155
+ "inferior parietal cortex": "central_nervous_tissue",
156
+ "inferior temporal gyrus": "central_nervous_tissue",
157
+ "inguinal fat pad": "adipose_tissue",
158
+ "inguinal lymph node": "immune_tissue",
159
+ "inguinal part of abdomen": "musculature_tissue",
160
+ "inguinal region skin": "integumentary_tissue",
161
+ "inner medulla of kidney": "adipose_tissue",
162
+ "insular cortex": "central_nervous_tissue",
163
+ "interventricular septum": "cardiovascular_tissue",
164
+ "intestine": "digestive_tissue",
165
+ "iris": "eye_tissue",
166
+ "islet of Langerhans": "endocrine_tissue",
167
+ "isthmus of fallopian tube": "reproductive_tissue",
168
+ "jejunal epithelium": "digestive_tissue",
169
+ "jejunum": "digestive_tissue",
170
+ "kidney": "renal_tissue",
171
+ "kidney blood vessel": "cardiovascular_tissue",
172
+ "lacrimal gland": "exocrine_tissue",
173
+ "lamina propria": "digestive_tissue",
174
+ "lamina propria of large intestine": "digestive_tissue",
175
+ "lamina propria of mucosa of colon": "digestive_tissue",
176
+ "lamina propria of small intestine": "digestive_tissue",
177
+ "large intestine": "digestive_tissue",
178
+ "lateral amygdaloid nucleus": "central_nervous_tissue",
179
+ "lateral entorhinal cortex": "central_nervous_tissue",
180
+ "lateral ganglionic eminence": "central_nervous_tissue",
181
+ "lateral geniculate body": "central_nervous_tissue",
182
+ "lateral nuclear group of thalamus": "central_nervous_tissue",
183
+ "lateral visual area": "central_nervous_tissue",
184
+ "left cardiac atrium": "cardiovascular_tissue",
185
+ "left colon": "digestive_tissue",
186
+ "left frontal lobe": "central_nervous_tissue",
187
+ "left lung": "respiratory_tissue",
188
+ "left ovary": "reproductive_tissue",
189
+ "left parietal lobe": "central_nervous_tissue",
190
+ "left temporal lobe": "central_nervous_tissue",
191
+ "lens of camera-type eye": "eye_tissue",
192
+ "limb muscle": "musculature_tissue",
193
+ "lingula of left lung": "respiratory_tissue",
194
+ "liver": "hepatic_tissue",
195
+ "lower esophagus": "digestive_tissue",
196
+ "lower leg skin": "integumentary_tissue",
197
+ "lower lobe of left lung": "respiratory_tissue",
198
+ "lower lobe of right lung": "respiratory_tissue",
199
+ "lung": "respiratory_tissue",
200
+ "lung parenchyma": "respiratory_tissue",
201
+ "lymph node": "immune_tissue",
202
+ "macula lutea": "eye_tissue",
203
+ "macula lutea proper": "eye_tissue",
204
+ "mammary gland": "exocrine_tissue",
205
+ "mammary gland epithelial cell": "exocrine_tissue",
206
+ "medial amygdaloid nucleus": "central_nervous_tissue",
207
+ "medial dorsal nucleus of thalamus": "central_nervous_tissue",
208
+ "medial entorhinal cortex": "central_nervous_tissue",
209
+ "medial ganglionic eminence": "central_nervous_tissue",
210
+ "medial orbital frontal cortex": "central_nervous_tissue",
211
+ "medulla oblongata": "central_nervous_tissue",
212
+ "meningeal dura mater": "central_nervous_tissue",
213
+ "mesenteric artery": "cardiovascular_tissue",
214
+ "mesenteric fat pad": "immune_tissue",
215
+ "mesenteric lymph node": "immune_tissue",
216
+ "mesoderm": "reproductive_tissue",
217
+ "midbrain": "central_nervous_tissue",
218
+ "middle lobe of right lung": "respiratory_tissue",
219
+ "middle temporal gyrus": "central_nervous_tissue",
220
+ "mucosa": "digestive_tissue",
221
+ "muscle of abdomen": "musculature_tissue",
222
+ "muscle of pelvic diaphragm": "musculature_tissue",
223
+ "muscle organ": "musculature_tissue",
224
+ "muscle tissue": "musculature_tissue",
225
+ "myelencephalon": "central_nervous_tissue",
226
+ "myometrium": "reproductive_tissue",
227
+ "nasal cavity": "sensory_tissue",
228
+ "nasopharynx": "respiratory_tissue",
229
+ "neocortex": "central_nervous_tissue",
230
+ "neural tube": "embryonic_tissue",
231
+ "nose": "sensory_tissue",
232
+ "nose skin": "integumentary_tissue",
233
+ "nucleus accumbens": "central_nervous_tissue",
234
+ "occipital cortex": "central_nervous_tissue",
235
+ "occipital lobe": "central_nervous_tissue",
236
+ "olfactory region": "sensory_tissue",
237
+ "omental fat pad": "adipose_tissue",
238
+ "omentum": "musculature_tissue",
239
+ "orbitofrontal cortex": "central_nervous_tissue",
240
+ "outer medulla of kidney": "adipose_tissue",
241
+ "ovary": "reproductive_tissue",
242
+ "pallidum": "central_nervous_tissue",
243
+ "pancreas": "endocrine_tissue",
244
+ "paracolic gutter": "musculature_tissue",
245
+ "parietal cortex": "central_nervous_tissue",
246
+ "parietal lobe": "central_nervous_tissue",
247
+ "parietal peritoneum": "musculature_tissue",
248
+ "parotid gland": "exocrine_tissue",
249
+ "perifoveal part of retina": "eye_tissue",
250
+ "periovarian fat pad": "reproductive_tissue",
251
+ "peripheral lymph node": "immune_tissue",
252
+ "peripheral region of retina": "eye_tissue",
253
+ "peripheral zone of prostate": "reproductive_tissue",
254
+ "perirenal fat": "adipose_tissue",
255
+ "perirhinal cortex": "central_nervous_tissue",
256
+ "peritoneum": "musculature_tissue",
257
+ "pia mater": "central_nervous_tissue",
258
+ "pigment epithelium of eye": "eye_tissue",
259
+ "placenta": "reproductive_tissue",
260
+ "pleura": "respiratory_tissue",
261
+ "pleural effusion": "respiratory_tissue",
262
+ "pons": "central_nervous_tissue",
263
+ "posterior hypothalamic region": "central_nervous_tissue",
264
+ "posterior parietal association areas": "central_nervous_tissue",
265
+ "posterior part of tongue": "sensory_tissue",
266
+ "preadipocyte": "musculature_tissue",
267
+ "prefrontal cortex": "central_nervous_tissue",
268
+ "primary auditory cortex": "central_nervous_tissue",
269
+ "primary motor cortex": "central_nervous_tissue",
270
+ "primary somatosensory cortex": "central_nervous_tissue",
271
+ "primary visual cortex": "central_nervous_tissue",
272
+ "prostate gland": "reproductive_tissue",
273
+ "pubis": "hematopoietic_tissue",
274
+ "putamen": "central_nervous_tissue",
275
+ "pyloric antrum": "digestive_tissue",
276
+ "rectum": "digestive_tissue",
277
+ "rectus abdominis muscle": "musculature_tissue",
278
+ "renal medulla": "renal_tissue",
279
+ "renal papilla": "renal_tissue",
280
+ "renal pelvis": "renal_tissue",
281
+ "respiratory airway": "respiratory_tissue",
282
+ "respiratory basal cell": "respiratory_tissue",
283
+ "retina": "eye_tissue",
284
+ "retinal neural layer": "eye_tissue",
285
+ "retrosplenial granular cortex": "central_nervous_tissue",
286
+ "retrosplenial region": "central_nervous_tissue",
287
+ "rib": "hematopoietic_tissue",
288
+ "right cardiac atrium": "cardiovascular_tissue",
289
+ "right colon": "digestive_tissue",
290
+ "right frontal lobe": "central_nervous_tissue",
291
+ "right lung": "respiratory_tissue",
292
+ "right occipital lobe": "central_nervous_tissue",
293
+ "right ovary": "reproductive_tissue",
294
+ "right parietal lobe": "central_nervous_tissue",
295
+ "right temporal lobe": "central_nervous_tissue",
296
+ "saliva": "exocrine_tissue",
297
+ "scalp": "integumentary_tissue",
298
+ "sclera": "eye_tissue",
299
+ "secondary somatosensory cortex": "central_nervous_tissue",
300
+ "secondary visual cortex": "central_nervous_tissue",
301
+ "sigmoid colon": "digestive_tissue",
302
+ "sinoatrial node": "cardiovascular_tissue",
303
+ "skin epidermis": "integumentary_tissue",
304
+ "skin of abdomen": "integumentary_tissue",
305
+ "skin of back": "integumentary_tissue",
306
+ "skin of body": "integumentary_tissue",
307
+ "skin of breast": "integumentary_tissue",
308
+ "skin of cheek": "integumentary_tissue",
309
+ "skin of chest": "integumentary_tissue",
310
+ "skin of external ear": "integumentary_tissue",
311
+ "skin of face": "integumentary_tissue",
312
+ "skin of forearm": "integumentary_tissue",
313
+ "skin of forehead": "integumentary_tissue",
314
+ "skin of hip": "integumentary_tissue",
315
+ "skin of leg": "integumentary_tissue",
316
+ "skin of pes": "integumentary_tissue",
317
+ "skin of prepuce of penis": "integumentary_tissue",
318
+ "skin of scalp": "integumentary_tissue",
319
+ "skin of shoulder": "integumentary_tissue",
320
+ "skin of temple": "integumentary_tissue",
321
+ "skin of trunk": "integumentary_tissue",
322
+ "small intestine": "digestive_tissue",
323
+ "spinal cord": "central_nervous_tissue",
324
+ "spleen": "immune_tissue",
325
+ "stomach": "digestive_tissue",
326
+ "striatum": "central_nervous_tissue",
327
+ "subcutaneous abdominal adipose tissue": "adipose_tissue",
328
+ "subcutaneous adipose tissue": "adipose_tissue",
329
+ "subdural space": "central_nervous_tissue",
330
+ "subicular complex": "central_nervous_tissue",
331
+ "sublingual gland": "exocrine_tissue",
332
+ "submucosa of ascending colon": "digestive_tissue",
333
+ "submucosa of ileum": "digestive_tissue",
334
+ "submucosal esophageal gland": "digestive_tissue",
335
+ "substantia nigra pars compacta": "central_nervous_tissue",
336
+ "superior frontal gyrus": "central_nervous_tissue",
337
+ "superior parietal cortex": "central_nervous_tissue",
338
+ "superior temporal sulcus": "central_nervous_tissue",
339
+ "telencephalon": "central_nervous_tissue",
340
+ "temporal cortex": "central_nervous_tissue",
341
+ "temporal lobe": "central_nervous_tissue",
342
+ "temporoparietal junction": "central_nervous_tissue",
343
+ "tendon of semitendinosus": "musculature_tissue",
344
+ "testis": "reproductive_tissue",
345
+ "thalamic complex": "central_nervous_tissue",
346
+ "thoracic lymph node": "immune_tissue",
347
+ "thymus": "immune_tissue",
348
+ "thyroid gland": "endocrine_tissue",
349
+ "tongue": "sensory_tissue",
350
+ "tonsil": "immune_tissue",
351
+ "trachea": "respiratory_tissue",
352
+ "tracheal epithelial cell": "respiratory_tissue",
353
+ "transition zone of prostate": "reproductive_tissue",
354
+ "transverse colon": "digestive_tissue",
355
+ "trophoblast": "embryonic_tissue",
356
+ "trophoblast cell": "embryonic_tissue",
357
+ "umbilical cord blood": "hematopoietic_tissue",
358
+ "upper leg skin": "integumentary_tissue",
359
+ "upper lobe of left lung": "respiratory_tissue",
360
+ "upper lobe of right lung": "respiratory_tissue",
361
+ "upper outer quadrant of breast": "exocrine_tissue",
362
+ "ureter": "renal_tissue",
363
+ "urethra": "renal_tissue",
364
+ "urinary bladder": "renal_tissue",
365
+ "uterine cervix": "reproductive_tissue",
366
+ "uterus": "reproductive_tissue",
367
+ "vasculature": "cardiovascular_tissue",
368
+ "vault of skull": "hematopoietic_tissue",
369
+ "vein": "cardiovascular_tissue",
370
+ "venous blood": "cardiovascular_tissue",
371
+ "ventral lateral nucleus of thalamus": "cardiovascular_tissue",
372
+ "ventral thalamus": "cardiovascular_tissue",
373
+ "ventricular system of brain": "central_nervous_tissue",
374
+ "vermiform appendix": "digestive_tissue",
375
+ "visceral abdominal adipose tissue": "adipose_tissue",
376
+ "visual cortex": "central_nervous_tissue",
377
+ "white matter": "central_nervous_tissue",
378
+ "white matter of cerebellum": "central_nervous_tissue",
379
+ "yolk sac": "embryonic_tissue",
380
+ "zone of skin": "integumentary_tissue",
381
+ "basolateral amygdaloid nuclear complex": "central_nervous_tissue",
382
+ "optic cup": "eye_tissue",
383
+ "pontine nuclear group": "central_nervous_tissue",
384
+ "arm skin": "integumentary_tissue",
385
+ "central amygdaloid nucleus": "central_nervous_tissue",
386
+ "caudate-putamen": "central_nervous_tissue",
387
+ "insula": "central_nervous_tissue",
388
+ "pulvinar nucleus": "central_nervous_tissue",
389
+ "cuneus cortex": "central_nervous_tissue",
390
+ "granular insular cortex": "central_nervous_tissue",
391
+ "hippocampal field": "central_nervous_tissue",
392
+ "T cell": "hematopoietic_tissue",
393
+ "dentate gyrus of hippocampal formation": "central_nervous_tissue",
394
+ "central nucleus of inferior colliculus": "central_nervous_tissue",
395
+ "olfactory cortex": "central_nervous_tissue",
396
+ "skeletal muscle tissue": "musculature_tissue",
397
+ "body of caudate nucleus": "central_nervous_tissue",
398
+ "substantia innominata": "central_nervous_tissue",
399
+ "corticomedial nuclear complex": "central_nervous_tissue",
400
+ "globus pallidus": "central_nervous_tissue",
401
+ "renal glomerulus": "renal_tissue",
402
+ "anterior cerebral artery": "cardiovascular_tissue",
403
+ "lateral septal complex": "central_nervous_tissue",
404
+ "coronal suture": "cardiovascular_tissue",
405
+ "bed nucleus of stria terminalis": "central_nervous_tissue",
406
+ "subiculum": "central_nervous_tissue",
407
+ "piriform cortex": "central_nervous_tissue",
408
+ "mesonephros": "renal_tissue",
409
+ "posterior parahippocampal gyrus": "central_nervous_tissue",
410
+ "cerebellar hemisphere": "central_nervous_tissue",
411
+ "Brodmann (1909) area 24": "central_nervous_tissue",
412
+ "septal nuclear complex": "central_nervous_tissue",
413
+ "anterior olfactory nucleus": "sensory_tissue",
414
+ "Brodmann (1909) area 38": "central_nervous_tissue"
415
+ }
teddy/data_processing/utils/bio_annotations/data/sampling_probs_for_collator/all_filtered_cell_probs.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "ciliated_cell": 0.9,
3
+ "connective_cell": 0.514,
4
+ "contractile_cell": 0.9,
5
+ "embryonic_cell": 0.386,
6
+ "epithelial_cell": 0.332,
7
+ "hematopoietic_cell": 0.358,
8
+ "immune_cell": 0.708,
9
+ "neural_cell": 0.073,
10
+ "perivascular_cell": 0.9,
11
+ "precursor_cell": 0.902,
12
+ "secretory_cell": 0.9,
13
+ "skeletal_muscle": 0.9,
14
+ "unknown": 0
15
+ }
teddy/data_processing/utils/bio_annotations/data/sampling_probs_for_collator/all_filtered_disease_probs.json ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "brain_disease": 0.952,
3
+ "cancer_disease": 0.531,
4
+ "cardiovascular_disease": 0.95,
5
+ "digestive_disease": 0.95,
6
+ "genetic_disease": 0.95,
7
+ "immune_disease": 0.95,
8
+ "infectious_disease": 0.95,
9
+ "kidney_disease": 0.95,
10
+ "other_disease": 0.95,
11
+ "respiratory_disease": 0.95,
12
+ "healthy": 0.112
13
+ }
teddy/data_processing/utils/bio_annotations/data/sampling_probs_for_collator/all_filtered_sex_probs.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "female":0.238,
3
+ "male":0.316,
4
+ "unknown":0
5
+ }
teddy/data_processing/utils/bio_annotations/data/sampling_probs_for_collator/all_filtered_tissue_probs.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "adipose_tissue": 0.9,
3
+ "cardiovascular_tissue": 0.853,
4
+ "central_nervous_tissue": 0.067,
5
+ "digestive_tissue": 0.9,
6
+ "embryonic_tissue": 0.106,
7
+ "endocrine_tissue": 0.9,
8
+ "exocrine_tissue": 0.508,
9
+ "eye_tissue": 0.634,
10
+ "hematopoietic_tissue": 0.384,
11
+ "hepatic_tissue": 0.9,
12
+ "immune_tissue": 0.9,
13
+ "integumentary_tissue": 0.9,
14
+ "musculature_tissue": 0.822,
15
+ "renal_tissue": 0.833,
16
+ "reproductive_tissue": 0.9,
17
+ "respiratory_tissue": 0.493,
18
+ "sensory_tissue": 0.9,
19
+ "unknown": 0
20
+ }
teddy/data_processing/utils/bio_annotations/data/sampling_probs_for_collator/cell_probs_for_classification.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "ciliated_cell": 1,
3
+ "connective_cell": 1,
4
+ "contractile_cell": 1,
5
+ "embryonic_cell": 1,
6
+ "epithelial_cell": 1,
7
+ "hematopoietic_cell": 1,
8
+ "immune_cell": 1,
9
+ "neural_cell": 1,
10
+ "perivascular_cell": 1,
11
+ "precursor_cell": 1,
12
+ "secretory_cell": 1,
13
+ "skeletal_muscle": 1,
14
+ "unknown": 0
15
+ }
teddy/data_processing/utils/bio_annotations/data/sampling_probs_for_collator/disease_probs_for_classification.json ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "brain_disease": 1,
3
+ "cancer_disease": 1,
4
+ "cardiovascular_disease": 1,
5
+ "digestive_disease": 1,
6
+ "genetic_disease": 1,
7
+ "immune_disease": 1,
8
+ "infectious_disease": 1,
9
+ "kidney_disease": 1,
10
+ "other_disease": 1,
11
+ "respiratory_disease": 1,
12
+ "healthy": 1
13
+ }
teddy/data_processing/utils/bio_annotations/data/sampling_probs_for_collator/sex_probs_for_classification.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "female": 1,
3
+ "male": 1,
4
+ "unknown":0
5
+ }
teddy/data_processing/utils/bio_annotations/data/sampling_probs_for_collator/tissue_probs_for_classification.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "adipose_tissue": 1,
3
+ "cardiovascular_tissue": 1,
4
+ "central_nervous_tissue": 1,
5
+ "digestive_tissue": 1,
6
+ "embryonic_tissue": 1,
7
+ "endocrine_tissue": 1,
8
+ "exocrine_tissue": 1,
9
+ "eye_tissue": 1,
10
+ "hematopoietic_tissue": 1,
11
+ "hepatic_tissue": 1,
12
+ "immune_tissue": 1,
13
+ "integumentary_tissue": 1,
14
+ "musculature_tissue": 1,
15
+ "renal_tissue": 1,
16
+ "reproductive_tissue": 1,
17
+ "respiratory_tissue": 1,
18
+ "sensory_tissue": 1,
19
+ "unknown": 0
20
+ }
teddy/data_processing/utils/gene_mapping/__init__.py ADDED
File without changes
teddy/data_processing/utils/gene_mapping/data/2407_ensembl_processed.txt ADDED
The diff for this file is too large to render. See raw diff
 
teddy/data_processing/utils/gene_mapping/data/2407_hgnc_mapping.any2any.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:014adbde393ed0655d41fc7bf841946f39c3dab0153515908624d84730130c37
3
+ size 12584454
teddy/data_processing/utils/gene_mapping/data/2407_mouse_gene_mapping.txt ADDED
The diff for this file is too large to render. See raw diff
 
teddy/data_processing/utils/gene_mapping/data/human_mapping.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e56e9aa46a7bddca6b769af5548e0553c5b14c3c8c0b534122a44f52bc960b82
3
+ size 22122103
teddy/data_processing/utils/gene_mapping/data/mouse_to_human_orthologs.one2one.txt ADDED
The diff for this file is too large to render. See raw diff
 
teddy/data_processing/utils/gene_mapping/gene_mapper.py ADDED
@@ -0,0 +1,629 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Module: gene_mapper.py
3
+
4
+ This module provides utilities for mapping gene identifiers between human and mouse datasets,
5
+ as well as handling orthology relationships. It is designed to process gene expression data
6
+ and map gene IDs to standardized formats for downstream analysis.
7
+
8
+ Main Features:
9
+ - Map human and mouse gene IDs to a common reference format.
10
+ - Handle orthology relationships to convert mouse gene symbols to human gene symbols.
11
+ - Combine mapping results from multiple sources and flag discrepancies.
12
+ - Transform wide-format gene data into long-format for easier processing.
13
+ - Categorize gene mappings based on their relationships (e.g., one-to-one, one-to-many).
14
+
15
+ Dependencies:
16
+ - pandas: For data manipulation.
17
+ - numpy: For numerical operations.
18
+ - warnings: For handling warnings during processing.
19
+
20
+ Usage:
21
+ - Import the functions and use them to map gene IDs or process gene data.
22
+ - Run the script directly to execute test cases for the implemented functions.
23
+
24
+ Why:
25
+ - This module is essential for harmonizing gene identifiers across datasets, enabling
26
+ consistent analysis of gene expression data from different species or sources.
27
+ """
28
+
29
+ import warnings
30
+
31
+ import numpy as np
32
+ import pandas as pd
33
+
34
+ # import re
35
+
36
+
37
+ def map_mouse_human(data_frame, query_column, human_map_db, mouse_map_db, orthology_db, verbose=False):
38
+ """
39
+ Maps gene IDs from a dataset to human and mouse reference databases, and resolves orthology relationships.
40
+
41
+ Args:
42
+ data_frame (pd.DataFrame): Input data containing gene IDs to map.
43
+ query_column (str): Column name in the input data containing gene IDs.
44
+ human_map_db (pd.DataFrame): Reference database for human gene mapping.
45
+ mouse_map_db (pd.DataFrame): Reference database for mouse gene mapping.
46
+ orthology_db (pd.DataFrame): Database containing orthology relationships between mouse and human genes.
47
+ verbose (bool): Whether to print detailed logs during processing.
48
+
49
+ Returns:
50
+ pd.DataFrame: A combined mapping result with discrepancies flagged.
51
+ """
52
+ if verbose:
53
+ print("------------ map human gene ids ------------")
54
+ mapped_hsap = map_genes(
55
+ expr_mat=data_frame,
56
+ expr_ids=query_column,
57
+ annot_mat=human_map_db,
58
+ annot_from="id",
59
+ annot_to="reference_id",
60
+ return_unmapped=True,
61
+ keep_prev_ids=True,
62
+ verbose=verbose,
63
+ )
64
+
65
+ if verbose:
66
+ print("------------ map mouse gene ids ------------")
67
+ mapped_mus = map_genes(
68
+ expr_mat=data_frame,
69
+ expr_ids=query_column,
70
+ annot_mat=mouse_map_db,
71
+ annot_from="id",
72
+ annot_to="reference_id",
73
+ return_unmapped=True,
74
+ keep_prev_ids=True,
75
+ verbose=verbose,
76
+ )
77
+
78
+ if verbose:
79
+ print("------------ mouse to human orthologs ------------")
80
+ mouse_hsap = orthologs_to_human(
81
+ mouse_df=mapped_mus,
82
+ mouse_col="reference_id",
83
+ orthology_df=orthology_db,
84
+ ortho_mouse_col="mouse_gene_symbol",
85
+ ortho_human_col="human_gene_symbol",
86
+ ortho_type_col="mouse_homology_type",
87
+ orthology_type="ortholog_one2one",
88
+ )
89
+
90
+ mouse_hsap = mouse_hsap.loc[:, ["previous_ids", "human_gene_symbol"]].drop_duplicates()
91
+ mouse_hsap = mouse_hsap.rename(columns={"human_gene_symbol": "reference_id"})
92
+
93
+ if verbose:
94
+ print("------------ combine results ------------")
95
+ both_mapped = combine_dataframe_columns(
96
+ df1=mapped_hsap, df2=mouse_hsap, id_column="previous_ids", reference_id_column="reference_id", verbose=verbose
97
+ )
98
+ both_mapped = both_mapped.loc[:, ["previous_ids", "reference_id", "discrepancy_flag"]].drop_duplicates()
99
+
100
+ return both_mapped
101
+
102
+
103
+ def map_mouse_human2(data_frame, query_column, human_map_db, mouse_map_db, orthology_db, verbose=False):
104
+ if verbose:
105
+ print("------------ map human gene ids ------------")
106
+ mapped_hsap = map_genes(
107
+ expr_mat=data_frame,
108
+ expr_ids=query_column,
109
+ annot_mat=human_map_db,
110
+ annot_from="id",
111
+ annot_to="reference_id",
112
+ return_unmapped=True,
113
+ keep_prev_ids=True,
114
+ verbose=verbose,
115
+ )
116
+
117
+ if verbose:
118
+ print("------------ map mouse gene ids ------------")
119
+ mapped_mus = map_genes(
120
+ expr_mat=data_frame,
121
+ expr_ids=query_column,
122
+ annot_mat=mouse_map_db,
123
+ annot_from="id",
124
+ annot_to="reference_id",
125
+ return_unmapped=True,
126
+ keep_prev_ids=True,
127
+ verbose=verbose,
128
+ )
129
+
130
+ if verbose:
131
+ print("------------ mouse to human orthologs ------------")
132
+ mouse_hsap = orthologs_to_human(
133
+ mouse_df=mapped_mus,
134
+ mouse_col="reference_id",
135
+ orthology_df=orthology_db,
136
+ ortho_mouse_col="mouse_gene_symbol",
137
+ ortho_human_col="human_gene_symbol",
138
+ ortho_type_col="mouse_homology_type",
139
+ orthology_type="ortholog_one2one",
140
+ )
141
+
142
+ ## this testing confirms that the filtering step produces the same result as the script below that takes ENSMUSG to fill the NA from orthologs that are not one2one
143
+ ## however not filtering causes discrepancies when combinding the two data_processing frames. this step is reqiured to avoid that
144
+
145
+ ## filter on mouse gene symbol - if not mapped then the input was not a mouse gene (or not a mouse gene that can be mapped)
146
+ ## alternative is to filter on ENSMUSG - but this will only work if the input list is ensembl gene ids, other ids will not be matched
147
+ if verbose:
148
+ print(mouse_hsap.shape)
149
+ mouse_hsap_filt = mouse_hsap.loc[
150
+ (mouse_hsap.previous_ids.str.contains("ENSMUS")) | (~mouse_hsap.mouse_gene_symbol.isnull()), :
151
+ ]
152
+ # mouse_hsap_remainder=mouse_hsap.loc[~((mouse_hsap.previous_ids.str.contains('ENSMUS')) | (~mouse_hsap.mouse_gene_symbol.isnull())),:]
153
+ if verbose:
154
+ print(mouse_hsap_filt.shape)
155
+ # (mouse_hsap_remainder)
156
+ mouse_hsap = mouse_hsap_filt
157
+
158
+ ## convert all gene human gene symbols to NA if they are not one2one orthologs
159
+ mouse_hsap.loc[mouse_hsap["mouse_homology_type"] != "ortholog_one2one", "human_gene_symbol"] = pd.NA
160
+
161
+ if verbose:
162
+ print("\n=========\tcount missing\t=========")
163
+ print(sum(mouse_hsap.human_gene_symbol.isnull()))
164
+ # fill missing human gene symbols with ENSMUSG
165
+ mouse_hsap["human_gene_symbol"] = mouse_hsap["human_gene_symbol"].fillna(mouse_hsap["previous_ids"])
166
+
167
+ if verbose:
168
+ print(sum(mouse_hsap.human_gene_symbol.str.contains("ENSMUSG")))
169
+
170
+ if verbose:
171
+ print("\n=========\tdoes not contain ENSMUSG\t=========")
172
+ print(mouse_hsap["previous_ids"][~mouse_hsap["previous_ids"].str.contains("ENSMUSG")].shape)
173
+ print(mouse_hsap["human_gene_symbol"][~mouse_hsap["human_gene_symbol"].str.contains("ENSMUSG")].shape)
174
+
175
+ print("\n=========\tcount missing\t=========")
176
+ print(sum(mouse_hsap.human_gene_symbol.isnull()))
177
+
178
+ mouse_hsap = mouse_hsap.loc[:, ["previous_ids", "human_gene_symbol"]].drop_duplicates()
179
+ mouse_hsap = mouse_hsap.rename(columns={"human_gene_symbol": "reference_id"})
180
+
181
+ if verbose:
182
+ print("------------ combine results ------------")
183
+ both_mapped = combine_dataframe_columns(
184
+ df1=mapped_hsap, df2=mouse_hsap, id_column="previous_ids", reference_id_column="reference_id", verbose=verbose
185
+ )
186
+ both_mapped = both_mapped.loc[:, ["previous_ids", "reference_id", "discrepancy_flag"]].drop_duplicates()
187
+
188
+ return both_mapped
189
+
190
+
191
+ def combine_dataframe_columns(df1, df2, id_column, reference_id_column, verbose=True):
192
+ """
193
+ Combines two dataframes by merging on a common ID column and flags discrepancies in reference IDs.
194
+
195
+ Args:
196
+ df1 (pd.DataFrame): First dataframe to merge.
197
+ df2 (pd.DataFrame): Second dataframe to merge.
198
+ id_column (str): Column name to merge on.
199
+ reference_id_column (str): Column name containing reference IDs.
200
+ verbose (bool): Whether to print detailed logs during processing.
201
+
202
+ Returns:
203
+ pd.DataFrame: A merged dataframe with discrepancies flagged.
204
+ """
205
+ # Standardize missing values by replacing empty strings with NaN
206
+ df1[reference_id_column] = df1[reference_id_column].replace("", pd.NA)
207
+ df2[reference_id_column] = df2[reference_id_column].replace("", pd.NA)
208
+
209
+ if verbose:
210
+ # Calculate and print the number of missing values in the reference_id columns of each dataframe
211
+ missing_df1 = df1[reference_id_column].isna().sum()
212
+ missing_df2 = df2[reference_id_column].isna().sum()
213
+ print(f"Missing values in {reference_id_column} of df1: {missing_df1}")
214
+ print(f"Missing values in {reference_id_column} of df2: {missing_df2}")
215
+
216
+ # Merge the dataframes on the specified 'id' column
217
+ merged_df = pd.merge(df1, df2, on=id_column, how="outer", suffixes=("_df1", "_df2"))
218
+
219
+ # Flag discrepancies where both reference IDs are present but do not match
220
+ merged_df["discrepancy_flag"] = np.where(
221
+ (merged_df[f"{reference_id_column}_df1"].notna())
222
+ & (merged_df[f"{reference_id_column}_df2"].notna())
223
+ & (merged_df[f"{reference_id_column}_df1"] != merged_df[f"{reference_id_column}_df2"]),
224
+ True,
225
+ False,
226
+ )
227
+
228
+ # Use numpy.where to combine the 'reference_id' columns, preferring non-null values from df1
229
+ merged_df[reference_id_column] = np.where(
230
+ merged_df[f"{reference_id_column}_df1"].notna(),
231
+ merged_df[f"{reference_id_column}_df1"],
232
+ merged_df[f"{reference_id_column}_df2"],
233
+ )
234
+
235
+ # Replace NaN with empty strings in the final dataframe
236
+ final_df = merged_df[
237
+ [id_column, reference_id_column, f"{reference_id_column}_df1", f"{reference_id_column}_df2", "discrepancy_flag"]
238
+ ].fillna("")
239
+
240
+ if verbose:
241
+ # Calculate and print the number of missing values in the final result
242
+ missing_final = final_df[reference_id_column].isna().sum()
243
+ print(f"Missing values in final merged {reference_id_column}: {missing_final}")
244
+
245
+ # Print a warning if there are any discrepancies
246
+ if final_df["discrepancy_flag"].any():
247
+ print("Warning: There are discrepancies in the reference IDs between the two dataframes.")
248
+
249
+ return final_df
250
+
251
+
252
+ def orthologs_to_human(
253
+ mouse_df,
254
+ orthology_df,
255
+ mouse_col,
256
+ ortho_mouse_col,
257
+ ortho_human_col,
258
+ ortho_type_col,
259
+ orthology_type="ortholog_one2one",
260
+ ):
261
+ """
262
+ Merges a mouse data_processing frame with an orthology data_processing frame to convert mouse gene symbols to human gene symbols.
263
+
264
+ Parameters:
265
+ - mouse_df: pd.DataFrame - The data_processing frame containing mouse gene symbols.
266
+ - orthology_df: pd.DataFrame - The data_processing frame containing orthology information.
267
+ - mouse_col: str - The column name in the mouse_df that contains mouse gene symbols.
268
+ - ortho_mouse_col: str - The column name in the orthology_df that contains mouse gene symbols.
269
+ - ortho_human_col: str - The column name in the orthology_df that contains human gene symbols.
270
+ - ortho_type_col: str - The column name in the orthology_df that contains the orthology type.
271
+ - orthology_type: str - The type of orthology to keep (default is 'ortholog_one2one').
272
+
273
+ Returns:
274
+ - merged_df: pd.DataFrame - The merged data_processing frame with human gene symbols included.
275
+ """
276
+
277
+ # Check if the specified orthology type exists in the orthology dataframe
278
+ unique_ortho_types = orthology_df[ortho_type_col].unique()
279
+
280
+ if orthology_type not in unique_ortho_types:
281
+ print(f"Error: Specified orthology type '{orthology_type}' not found.")
282
+ print("Available orthology types are:", unique_ortho_types)
283
+ return None
284
+
285
+ # Filter the orthology dataframe based on the specified orthology type
286
+ filtered_orthology_df = orthology_df[orthology_df[ortho_type_col] == orthology_type]
287
+
288
+ # Merge the mouse dataframe with the filtered orthology dataframe
289
+ merged_df = mouse_df.merge(
290
+ filtered_orthology_df[[ortho_mouse_col, ortho_human_col, ortho_type_col]],
291
+ left_on=mouse_col,
292
+ right_on=ortho_mouse_col,
293
+ how="left",
294
+ )
295
+
296
+ return merged_df
297
+
298
+
299
+ # Example usage:
300
+ # merged_df = merge_with_orthology(mouse_df, orthology_df, 'mouse_gene_column', 'ortho_mouse_gene_column', 'ortho_human_gene_column', 'orthology_type_column', 'ortholog_one2one')
301
+
302
+
303
+ def preprocess_wide_to_long(df, reference_id, sep="|", keep_id_type=True):
304
+ """
305
+ Transforms the given DataFrame into a long format table where one specified column represents reference IDs
306
+ and all the entries from the other columns, including the specified column, are put into the second column.
307
+ Entries separated by a specified separator are split into individual values. Removes any duplicate values.
308
+ Handles NaN values appropriately by skipping them and removes rows with NaN in the reference_id column.
309
+
310
+ Args:
311
+ df (pd.DataFrame): The input DataFrame with gene information.
312
+ reference_id (str): The column name to be used as the reference identifier.
313
+ sep (str): The separator used to split entries in the ID columns.
314
+ keep_id_type (bool): Whether to keep the id_type column in the final output.
315
+
316
+ Returns:
317
+ pd.DataFrame: The transformed long format DataFrame with split values.
318
+ """
319
+ # Check for duplicate column names
320
+ if df.columns.duplicated().any():
321
+ raise ValueError("Duplicate column names detected in the DataFrame.")
322
+
323
+ # Remove rows where reference_id is NaN
324
+ initial_row_count = df.shape[0]
325
+ df = df.dropna(subset=[reference_id])
326
+ final_row_count = df.shape[0]
327
+
328
+ if initial_row_count != final_row_count:
329
+ print(
330
+ f"Removed {initial_row_count - final_row_count} rows with NaN in '{reference_id}'. {final_row_count} rows remain."
331
+ )
332
+ else:
333
+ print("No rows with NaN in the reference_id were found.")
334
+
335
+ # Check for duplicate values in reference_id column
336
+ if df[reference_id].duplicated().any():
337
+ print(
338
+ f"Warning: Duplicate values found in the '{reference_id}' column. This may cause issues with the transformation."
339
+ )
340
+
341
+ long_format_data = []
342
+
343
+ # Process each column except the reference_id
344
+ for col in df.columns:
345
+ if col != reference_id:
346
+ # Convert numeric columns to string
347
+ if pd.api.types.is_numeric_dtype(df[col]):
348
+ df[col] = df[col].astype(str)
349
+ # Split the values by the separator and create a new DataFrame for each column
350
+ exploded_df = df[[reference_id, col]].dropna().assign(**{col: df[col].str.split(sep)})
351
+ exploded_df = exploded_df.explode(col)
352
+ exploded_df["id_type"] = col
353
+ exploded_df = exploded_df.rename(columns={col: "id"})
354
+ long_format_data.append(exploded_df)
355
+
356
+ # Concatenate all the long format DataFrames
357
+ long_df = pd.concat(long_format_data)
358
+
359
+ # Add the reference_id as its own column
360
+ reference_id_df = df[[reference_id]].dropna()
361
+ reference_id_df["id_type"] = reference_id
362
+ reference_id_df["id"] = reference_id_df[reference_id]
363
+ long_df = pd.concat([long_df, reference_id_df], ignore_index=True)
364
+
365
+ # Rename the reference_id column to "reference_id"
366
+ long_df = long_df.rename(columns={reference_id: "reference_id"})
367
+
368
+ # Drop duplicate values
369
+ long_df.drop_duplicates(inplace=True)
370
+
371
+ if not keep_id_type:
372
+ # Drop the id_type column and remove duplicates based only on 'id' and 'reference_id'
373
+ long_df = long_df.drop(columns=["id_type"]).drop_duplicates()
374
+
375
+ # Reorder the columns
376
+ columns_order = ["id", "reference_id"] if not keep_id_type else ["id", "id_type", "reference_id"]
377
+ long_df = long_df[columns_order]
378
+
379
+ return long_df
380
+
381
+
382
+ def categorise_mapping(df, ids_from_col, ids_to_col):
383
+ # Calculate the occurrences of each id and each gene_name
384
+ id_counts = df[ids_from_col].value_counts()
385
+ gene_counts = df[ids_to_col].value_counts()
386
+
387
+ # Map the counts back to the dataframe
388
+ df["id_count"] = df[ids_from_col].map(id_counts)
389
+ df["gene_count"] = df[ids_to_col].map(gene_counts)
390
+
391
+ # Determine match type based on counts
392
+ conditions = [(df["id_count"] > 1) & (df["gene_count"] > 1), (df["id_count"] > 1), (df["gene_count"] > 1)]
393
+ choices = ["many2many", "one2many", "many2one"]
394
+ df["match_type"] = np.select(conditions, choices, default="one2one")
395
+
396
+ # Drop the temporary columns used for counts
397
+ df.drop(columns=["id_count", "gene_count"], inplace=True)
398
+
399
+ return df
400
+
401
+
402
+ def remove_whitespace(series):
403
+ # return series.astype(str).str.replace(r'^\s+|\s+$', '', regex=True)
404
+ return series.astype(str).str.strip()
405
+
406
+
407
+ def unlist(nested_list):
408
+ """
409
+ Recursively flattens a nested list.
410
+
411
+ Args:
412
+ nested_list (list): A list that may contain nested lists.
413
+
414
+ Returns:
415
+ list: A flattened list.
416
+ """
417
+ flattened = []
418
+ for item in nested_list:
419
+ if isinstance(item, list):
420
+ flattened.extend(unlist(item))
421
+ else:
422
+ flattened.append(item)
423
+ return flattened
424
+
425
+
426
+ def map_genes(
427
+ expr_mat,
428
+ expr_ids=None,
429
+ annot_mat=None,
430
+ annot_from="id",
431
+ annot_to="hgnc_symbol",
432
+ return_unmapped=False,
433
+ verbose=True,
434
+ error=False,
435
+ keep_prev_ids=False,
436
+ ):
437
+ """TODO: The code currently breaks when expr_mat already has a column called referene_id. This is because the mapped = pd.merge(...) does not merge the reference_id columns. Try to fix this."""
438
+
439
+ if expr_ids is not None:
440
+ expr_mat = expr_mat.rename(columns={expr_ids: "previous_ids"})
441
+ expr_ids = "previous_ids"
442
+
443
+ if expr_ids is None:
444
+ expr_ids = "previous_ids"
445
+ expr_mat[expr_ids] = expr_mat.index
446
+
447
+ with warnings.catch_warnings():
448
+ warnings.simplefilter(action="ignore", category=pd.errors.SettingWithCopyWarning)
449
+ # Remove any whitespace - trailing or otherwise
450
+ expr_mat[expr_ids] = remove_whitespace(expr_mat[expr_ids])
451
+
452
+ if verbose:
453
+ print("\n [ gene ID mapping ] \n")
454
+ print(
455
+ f"\tdataset contains : {len(expr_mat['previous_ids'])} ids, of which unique: {len(expr_mat['previous_ids'].unique())} - {round(len(expr_mat['previous_ids'].unique()) / len(expr_mat['previous_ids']) * 100, 1)}%"
456
+ )
457
+
458
+ # Remove any missing ids
459
+ missing_genes = expr_mat[expr_mat[expr_ids].isin([None, "", "nan"])]
460
+ if not missing_genes.empty:
461
+ if verbose:
462
+ print(f"\tfound {len(missing_genes)} missing ids", list(missing_genes[expr_ids].unique())[:5])
463
+ expr_mat = expr_mat[~expr_mat[expr_ids].isin([None, "", "nan"])]
464
+
465
+ # Check for ids that are already mapping
466
+ premapped = expr_mat[expr_mat["previous_ids"].isin(annot_mat[annot_to])]
467
+ premapped.loc[:, annot_to] = premapped["previous_ids"]
468
+
469
+ if verbose:
470
+ print(
471
+ f'\n\texpr_mat - of {len(expr_mat["previous_ids"].unique())} ids {len(premapped["previous_ids"].unique())} - {round(len(premapped["previous_ids"].unique()) / len(expr_mat["previous_ids"].unique()) * 100, 3)}% directly map to annot_mat${annot_to}\n'
472
+ )
473
+
474
+ # Map using exact match
475
+ unmapped_hgnc = expr_mat[~expr_mat["previous_ids"].isin(premapped["previous_ids"])]
476
+ if unmapped_hgnc.empty:
477
+ if keep_prev_ids:
478
+ return premapped.drop_duplicates()
479
+ return premapped.drop(columns=["previous_ids"], errors="ignore").drop_duplicates()
480
+
481
+ mapped = pd.merge(
482
+ expr_mat[~expr_mat["previous_ids"].isin(premapped["previous_ids"])],
483
+ annot_mat[[annot_from, annot_to]].drop_duplicates(),
484
+ left_on="previous_ids",
485
+ right_on=annot_from,
486
+ how="inner",
487
+ )
488
+
489
+ mapped = pd.concat([mapped, premapped if not premapped.empty else None])
490
+
491
+ # Map the remainder using lowercase
492
+ remap = expr_mat[~expr_mat["previous_ids"].isin(mapped["previous_ids"])]
493
+ remap.loc[:, "previous_ids"] = remap["previous_ids"].str.lower()
494
+
495
+ reannot = annot_mat[[annot_from, annot_to]].drop_duplicates()
496
+ reannot[annot_from] = reannot[annot_from].str.lower()
497
+
498
+ remap = pd.merge(remap, reannot, left_on="previous_ids", right_on=annot_from, how="inner")
499
+
500
+ mapped = pd.concat([mapped, remap]).drop_duplicates()
501
+
502
+ dups = mapped[mapped.duplicated(subset=[annot_to], keep=False)][annot_to].unique()
503
+ uniq = mapped[~mapped[annot_to].isin(dups)][annot_to].unique()
504
+
505
+ if verbose:
506
+ print(f'\tone2one: {len(uniq)}\t{", ".join(uniq[:5])}')
507
+ print(f'\tmany2one: {len(dups)}\t{", ".join(dups[:5])}')
508
+
509
+ unmapped = expr_mat["previous_ids"][
510
+ ~expr_mat["previous_ids"].str.lower().isin(mapped["previous_ids"].str.lower())
511
+ ].unique()
512
+
513
+ if verbose:
514
+ print(f'\n\tunmapped genes: {len(unmapped)}\t:: {", ".join(unmapped[:5])}\n')
515
+ print("\n\n")
516
+
517
+ result = mapped
518
+
519
+ if return_unmapped:
520
+ unmapped_expr_mat = expr_mat[expr_mat["previous_ids"].isin(unmapped)]
521
+ if not unmapped_expr_mat.empty:
522
+ unmapped_expr_mat.loc[:, annot_to] = ""
523
+ result = pd.concat([result, unmapped_expr_mat])
524
+
525
+ result = result.loc[:, result.columns.isin(unlist([list(expr_mat.columns.values), annot_to]))]
526
+
527
+ if keep_prev_ids:
528
+ return result.drop_duplicates()
529
+ return result.drop(columns=["previous_ids"], errors="ignore").drop_duplicates()
530
+
531
+
532
+ ##========================================================================================================================
533
+ ##========== Test functions ================================================================================
534
+ ##========================================================================================================================
535
+
536
+
537
+ def test_transform_function():
538
+ """
539
+ Test case for the transform_and_split_to_long_format function using a toy example.
540
+ """
541
+ data = {
542
+ "Gene stable ID": ["ID1|ID2", "ID3", "ID4|ID5"],
543
+ "Gene stable ID version": ["ID1.1", "ID3.1", None],
544
+ "Gene Synonym": ["Syn1", None, "Syn4"],
545
+ "Gene name": ["GeneA", "GeneB", "GeneC"],
546
+ }
547
+
548
+ df = pd.DataFrame(data)
549
+
550
+ expected_data = {
551
+ "id": ["ID1", "ID2", "ID1.1", "Syn1", "GeneA", "ID3", "ID3.1", "GeneB", "ID4", "ID5", "Syn4", "GeneC"],
552
+ "id_type": [
553
+ "Gene stable ID",
554
+ "Gene stable ID",
555
+ "Gene stable ID version",
556
+ "Gene Synonym",
557
+ "Gene name",
558
+ "Gene stable ID",
559
+ "Gene stable ID version",
560
+ "Gene name",
561
+ "Gene stable ID",
562
+ "Gene stable ID",
563
+ "Gene Synonym",
564
+ "Gene name",
565
+ ],
566
+ "reference_id": [
567
+ "GeneA",
568
+ "GeneA",
569
+ "GeneA",
570
+ "GeneA",
571
+ "GeneA",
572
+ "GeneB",
573
+ "GeneB",
574
+ "GeneB",
575
+ "GeneC",
576
+ "GeneC",
577
+ "GeneC",
578
+ "GeneC",
579
+ ],
580
+ }
581
+
582
+ expected_df = pd.DataFrame(expected_data)
583
+
584
+ # Transform the DataFrame
585
+ long_df = transform_and_split_to_long_format(df, "Gene name") # noqa
586
+
587
+ # Sort the DataFrame for comparison
588
+ long_df = long_df.sort_values(by=["id", "id_type", "reference_id"]).reset_index(drop=True)
589
+ expected_df = expected_df.sort_values(by=["id", "id_type", "reference_id"]).reset_index(drop=True)
590
+
591
+ # Check if the transformed DataFrame matches the expected DataFrame
592
+ assert long_df.equals(expected_df), "test_transform_function\t\t- did not produce expected result"
593
+
594
+ print("test_transform_function\t\t- passed")
595
+
596
+
597
+ # Run tests
598
+ def test_categorise_function():
599
+ mapping_test_data = {
600
+ "ids": ["id1", "id2", "id3", "id4", "id1", "id5"],
601
+ "gene_names": ["gene1", "gene2", "gene3", "gene3", "gene4", "gene5"],
602
+ "expected_match_type": ["one2many", "one2one", "many2one", "many2one", "one2many", "one2one"],
603
+ }
604
+
605
+ mapping_test_data = pd.DataFrame(mapping_test_data)
606
+
607
+ test_data = {
608
+ "ids": ["id1", "id2", "id3", "id4", "id1", "id5"],
609
+ "gene_names": ["gene1", "gene2", "gene3", "gene3", "gene4", "gene5"],
610
+ }
611
+
612
+ df_test = pd.DataFrame(test_data)
613
+
614
+ print("\nRunning optimized version:")
615
+ annotated_df_optimized = categorise_mapping(df_test.copy(), "ids", "gene_names")
616
+ print(annotated_df_optimized)
617
+
618
+ # Verify the results
619
+ assert (
620
+ annotated_df_optimized["match type"].tolist() == mapping_test_data["expected_match_type"].tolist()
621
+ ), "Optimized version failed"
622
+
623
+ print("\ntest_categorise_function\t\t- passed")
624
+
625
+
626
+ # Only scripts the test if this script is executed directly (not imported)
627
+ if __name__ == "__main__":
628
+ test_transform_function()
629
+ test_categorise_function()
teddy/data_processing/utils/medians/data/teddy_gene_medians.json ADDED
The diff for this file is too large to render. See raw diff
 
teddy/models/.DS_Store ADDED
Binary file (6.15 kB). View file
 
teddy/models/__init__.py ADDED
File without changes
teddy/models/classification_heads.py ADDED
@@ -0,0 +1,285 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Module: classification_heads.py
3
+
4
+ This module defines various classification and decoder heads for use in transformer-based models,
5
+ specifically tailored for single-cell biology tasks. These heads are designed to handle tasks such as
6
+ classification, regression, and expression value prediction, and they integrate seamlessly with
7
+ transformer architectures.
8
+
9
+ Main Features:
10
+ - **ClsDecoder**: A simple decoder for classification tasks, supporting multiple layers and activations.
11
+ - **ClassificationHead**: A RoBERTa-style classification head for downstream tasks.
12
+ - **ClassificationHeadAnalysis**: An extended classification head that provides intermediate hidden states for analysis.
13
+ - **ClsDecoderAnalysis**: A classification decoder with support for hidden state extraction.
14
+ - **TrainingHead**: A dense layer with activation and normalization for training tasks.
15
+ - **AnnotationDecoderHead**: A lightweight decoder for annotation tasks with simplified weight initialization.
16
+ - **ExprDecoder**: A decoder for predicting gene expression values, with optional explicit zero probability prediction.
17
+ - **AffineExprDecoder**: A decoder for predicting gene expression values in an affine form (Ax + b), with support for
18
+ advanced features like adaptive bias and explicit zero probabilities.
19
+
20
+ Dependencies:
21
+ - PyTorch: For defining and training neural network components.
22
+ - Transformers: For activation functions and integration with transformer-based models.
23
+
24
+ Usage:
25
+ Import the desired classification or decoder head into your model:
26
+ ```python
27
+ from teddy.models.classification_heads import ClsDecoder, ClassificationHead
28
+ ```
29
+ """
30
+
31
+ from typing import Dict, Optional
32
+
33
+ import torch
34
+ import torch.nn as nn
35
+ from torch import Tensor
36
+ from transformers.activations import ACT2FN
37
+
38
+
39
+ class ClsDecoder(nn.Module): # taken from scGPT. Delete when not needed any more.
40
+ """
41
+ Decoder for classification task.
42
+ """
43
+
44
+ def __init__(
45
+ self,
46
+ d_model: int,
47
+ n_cls: int,
48
+ nlayers: int = 1,
49
+ activation: callable = nn.ReLU,
50
+ ):
51
+ super().__init__()
52
+ # module list
53
+ self._decoder = nn.ModuleList()
54
+ for _i in range(nlayers - 1):
55
+ self._decoder.append(nn.Linear(d_model, d_model))
56
+ self._decoder.append(activation())
57
+ self._decoder.append(nn.LayerNorm(d_model))
58
+ self.out_layer = nn.Linear(d_model, n_cls)
59
+
60
+ def forward(self, x: Tensor) -> Tensor:
61
+ """
62
+ Args:
63
+ x: Tensor, shape [batch_size, embsize]
64
+ """
65
+ for layer in self._decoder:
66
+ x = layer(x)
67
+ return {"output": self.out_layer(x)}
68
+
69
+
70
+ class ClassificationHead(nn.Module):
71
+ """RoBERTa-style classification head"""
72
+
73
+ def __init__(self, config, n_cls, nlayers):
74
+ super().__init__()
75
+ self._decoder = nn.ModuleList()
76
+ self.activation = nn.ReLU() if config.layer_activation == "relu" else nn.GELU()
77
+
78
+ for _i in range(nlayers):
79
+ self._decoder.append(nn.Dropout(config.dropout))
80
+ self._decoder.append(nn.Linear(config.d_model, config.d_model))
81
+ self._decoder.append(self.activation)
82
+ self._decoder.append(nn.Dropout(config.dropout))
83
+ self._decoder.append(nn.Linear(config.d_model, n_cls))
84
+
85
+ def forward(self, x):
86
+ for module in self._decoder:
87
+ x = module(x)
88
+ return {"output": x}
89
+
90
+
91
+ class ClassificationHeadAnalysis(nn.Module):
92
+ """RoBERTa-style classification head"""
93
+
94
+ def __init__(self, config, n_cls, nlayers):
95
+ super().__init__()
96
+ self.dropout = nn.Dropout(config.dropout)
97
+ self._decoder = nn.ModuleList()
98
+ self.activation = nn.ReLU() if config.layer_activation == "relu" else nn.GELU()
99
+
100
+ for _i in range(nlayers):
101
+ self._decoder.append(self.dropout)
102
+ self._decoder.append(nn.Linear(config.d_model, config.d_model))
103
+ self._decoder.append(self.activation)
104
+ self._decoder.append(self.dropout)
105
+ self._decoder.append(nn.Linear(config.d_model, n_cls))
106
+
107
+ def forward(self, x):
108
+ hidden_states = []
109
+ for module in self._decoder:
110
+ x = module(x)
111
+ if isinstance(module, nn.Linear):
112
+ hidden_states.append(x)
113
+ return {"output": x, "hidden_states": hidden_states}
114
+
115
+
116
+ class ClsDecoderAnalysis(nn.Module):
117
+ """
118
+ Decoder for classification task.
119
+ """
120
+
121
+ def __init__(
122
+ self,
123
+ d_model: int,
124
+ n_cls: int,
125
+ nlayers: int = 3,
126
+ activation: callable = nn.ReLU,
127
+ ):
128
+ super().__init__()
129
+ # module list
130
+ self._decoder = nn.ModuleList()
131
+ for _i in range(nlayers - 1):
132
+ self._decoder.append(nn.Linear(d_model, d_model))
133
+ self._decoder.append(activation())
134
+ self._decoder.append(nn.LayerNorm(d_model))
135
+ self.out_layer = nn.Linear(d_model, n_cls)
136
+
137
+ def forward(self, x: Tensor) -> Tensor:
138
+ """
139
+ Args:
140
+ x: Tensor, shape [batch_size, embsize]
141
+ """
142
+ hidden_states = []
143
+ for layer in self._decoder:
144
+ x = layer(x)
145
+ hidden_states.append(x)
146
+ return {"output": self.out_layer(x), "hidden_states": hidden_states}
147
+
148
+
149
+ class TrainingHead(nn.Module):
150
+ def __init__(self, config):
151
+ super().__init__()
152
+ self.dense = nn.Linear(config.d_model, config.d_model)
153
+ self.activation = ACT2FN[config.layer_activation]
154
+ self.LayerNorm = nn.LayerNorm(config.d_model, config.layer_norm_eps)
155
+
156
+ def forward(self, hidden_states):
157
+ hidden_states = self.dense(hidden_states)
158
+ hidden_states = self.activation(hidden_states)
159
+ hidden_states = self.LayerNorm(hidden_states)
160
+ return hidden_states
161
+
162
+
163
+ class AnnotationDecoderHead(nn.Linear):
164
+ """Small class to make weight initialization easier"""
165
+
166
+ def __init__(self, d_model, n_token):
167
+ super().__init__(d_model, n_token, bias=False)
168
+
169
+
170
+ class ExprDecoder(nn.Module):
171
+ def __init__(
172
+ self,
173
+ d_model: int,
174
+ explicit_zero_prob: bool = False,
175
+ use_batch_labels: bool = False,
176
+ ):
177
+ super().__init__()
178
+ d_in = d_model * 2 if use_batch_labels else d_model
179
+ self.fc = nn.Sequential(
180
+ nn.Linear(d_in, d_model),
181
+ nn.LeakyReLU(),
182
+ nn.Linear(d_model, d_model),
183
+ nn.LeakyReLU(),
184
+ nn.Linear(d_model, 1),
185
+ )
186
+ self.explicit_zero_prob = explicit_zero_prob
187
+ if explicit_zero_prob:
188
+ self.zero_logit = nn.Sequential(
189
+ nn.Linear(d_in, d_model),
190
+ nn.LeakyReLU(),
191
+ nn.Linear(d_model, d_model),
192
+ nn.LeakyReLU(),
193
+ nn.Linear(d_model, 1),
194
+ )
195
+
196
+ def forward(self, x: Tensor, values: Tensor = None) -> Dict[str, Tensor]:
197
+ """x is the output of the transformer, (batch, seq_len, d_model)"""
198
+ pred_value = self.fc(x).squeeze(-1) # (batch, seq_len)
199
+
200
+ if not self.explicit_zero_prob:
201
+ return {"pred": pred_value}
202
+ zero_logits = self.zero_logit(x).squeeze(-1) # (batch, seq_len)
203
+ zero_probs = torch.sigmoid(zero_logits)
204
+ return {"pred": pred_value, "zero_probs": zero_probs}
205
+ # TODO: note that the return currently is only for training. Since decoder
206
+ # is not used in the test setting for the integration task, the experiments/inference
207
+ # logic is not implemented yet. However, remember to implement it when
208
+ # the decoder is used in any test setting. The inference logic will need
209
+ # to sample from the bernoulli distribution with the zero_probs.
210
+
211
+
212
+ class AffineExprDecoder(nn.Module):
213
+ def __init__(
214
+ self,
215
+ d_model: int,
216
+ explicit_zero_prob: bool = False,
217
+ activation: Optional[str] = None,
218
+ tanh_coeff: bool = False,
219
+ adaptive_bias: bool = False,
220
+ ):
221
+ """
222
+ Predict the expression value of each gene in an affine like form of Ax + b.
223
+ This decoder takes two ExprDecoder intrinsically to genrate the coefficient A and bias b.
224
+
225
+ Args:
226
+ d_model: The embedding dimension.
227
+ explicit_zero_prob: If True, predict the probability of each gene being
228
+ zero.
229
+ activation: The activation function for the coefficient A and bias b.
230
+ tanh_coeff: If True, use tanh activation for the coefficient A.
231
+ adaptive_bias: If True, use a learnable bias for the bias b.
232
+ """
233
+ super().__init__()
234
+ self.explicit_zero_prob = explicit_zero_prob
235
+ self.tanh_coeff = tanh_coeff
236
+ self.adaptive_bias = adaptive_bias
237
+ self.coeff_decoder = ExprDecoder(d_model, explicit_zero_prob=explicit_zero_prob)
238
+ self.bias_decoder = ExprDecoder(d_model, explicit_zero_prob=explicit_zero_prob)
239
+ self.activation = activation
240
+
241
+ if activation is not None:
242
+ # Normalize activation name to lowercase for flexibility
243
+ activation = activation.lower()
244
+ # Mapping of known activation functions
245
+ activations_map = {
246
+ "gelu": "GELU",
247
+ "relu": "ReLU",
248
+ "tanh": "Tanh",
249
+ "sigmoid": "Sigmoid",
250
+ }
251
+ assert activation in activations_map, f"Unknown activation: {activation}"
252
+ assert hasattr(nn, activations_map[activation]), f"Unknown activation: {activation}"
253
+ self.activation = getattr(nn, activations_map[activation])()
254
+
255
+ def forward(self, x: Tensor, values: Tensor) -> Tensor:
256
+ """
257
+ Args:
258
+ x: Tensor, shape [batch_size, seq_len, embsize]
259
+ values: Tensor, shape [batch_size, seq_len]
260
+
261
+ Returns:
262
+ output Tensor of shape [batch_size, seq_len]
263
+ """
264
+ coeff = self.coeff_decoder(x)
265
+ bias = self.bias_decoder(x)
266
+
267
+ if self.activation is not None:
268
+ coeff["pred"] = self.activation(coeff["pred"])
269
+ bias["pred"] = self.activation(bias["pred"])
270
+
271
+ # if self.tanh_coeff:
272
+ # coeff["pred"] = 1 + torch.tanh(coeff["pred"])
273
+
274
+ if self.adaptive_bias:
275
+ # bias["pred"] = bias["pred"] * values.mean(dim=1, keepdim=True)
276
+ non_zero_value_mean = values.sum(dim=1, keepdim=True) / (values != 0).sum(dim=1, keepdim=True)
277
+ bias["pred"] = bias["pred"] * non_zero_value_mean
278
+
279
+ if self.explicit_zero_prob:
280
+ return {
281
+ "pred": coeff["pred"] * values + bias["pred"],
282
+ "zero_probs": coeff["zero_probs"],
283
+ }
284
+
285
+ return {"pred": coeff["pred"] * values + bias["pred"]}
teddy/models/model_directory.py ADDED
@@ -0,0 +1,53 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Module: model_directory.py
3
+
4
+ This module provides a centralized directory for managing and accessing different model architectures
5
+ used in the TEDDY project. It defines a dictionary of supported models and their configurations,
6
+ allowing for easy integration and dynamic loading of models based on their names or paths.
7
+
8
+ Main Features:
9
+ - **model_dict**: A dictionary mapping model names to their corresponding classes, configurations,
10
+ and masking keys. This enables seamless switching between different model architectures.
11
+ - **get_architecture**: A utility function to retrieve the architecture name from a model's configuration file.
12
+
13
+ Dependencies:
14
+ - json: For loading model configuration files.
15
+ - os: For handling file paths.
16
+ - teddy.models.teddy_g.model: For importing the `TeddyGModel`, `TeddyGConfig`, and `TeddyGModelAnalysis` classes.
17
+
18
+ Usage:
19
+ 1. Access a model and its configuration from the `model_dict`:
20
+ ```python
21
+ model_info = model_dict["TeddyGModel"]
22
+ model_cls = model_info["model_cls"]
23
+ config_cls = model_info["config_cls"]
24
+ ```
25
+ 2. Retrieve the architecture name from a model's configuration file:
26
+ ```python
27
+ architecture = get_architecture(model_name_or_path)
28
+ ```
29
+ """
30
+
31
+ import json
32
+ import os
33
+
34
+ from teddy.models.teddy_g.model import (
35
+ TeddyGConfig,
36
+ TeddyGModel,
37
+ TeddyGModelAnalysis,
38
+ )
39
+
40
+ model_dict = {
41
+ "TeddyGModel": {"model_cls": TeddyGModel, "config_cls": TeddyGConfig, "masking_key": "gene_ids"},
42
+ "TeddyGModelAnalysis": {
43
+ "model_cls": TeddyGModelAnalysis,
44
+ "config_cls": TeddyGConfig,
45
+ "masking_key": "gene_ids",
46
+ },
47
+ }
48
+
49
+
50
+ def get_architecture(model_name_or_path):
51
+ with open(os.path.join(model_name_or_path, "config.json")) as f:
52
+ config = json.load(f)
53
+ return config["architectures"][0]
teddy/models/teddy_g/.DS_Store ADDED
Binary file (6.15 kB). View file
 
teddy/models/teddy_g/160M/added_tokens.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "<cls>": 43811,
3
+ "<mask>": 43812,
4
+ "<pad>": 43810,
5
+ "<sep>": 43809,
6
+ "<unk>": 43808
7
+ }
teddy/models/teddy_g/160M/config.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "TeddyGModel"
4
+ ],
5
+ "cls_loss": false,
6
+ "annotation_loss_weight": null,
7
+ "modeling_loss_weight": null,
8
+ "d_hid": 3072,
9
+ "d_model": 768,
10
+ "dropout": 0.02,
11
+ "gradient_checkpointing": false,
12
+ "initializer_range": 0.02,
13
+ "layer_activation": "gelu",
14
+ "mask_token": "<mask>",
15
+ "mask_token_id": 1,
16
+ "masking_loss": false,
17
+ "max_position_embeddings": 2048,
18
+ "n_cls": 0,
19
+ "n_layers_cls": 0,
20
+ "nheads": 12,
21
+ "nlayers": 12,
22
+ "ntoken": 43840,
23
+ "pad_token_id": -100,
24
+ "pre_norm": false,
25
+ "torch_dtype": "float32"
26
+ }