TEDDY / teddy /data_processing /tokenization /README.md

Upload folder using huggingface_hub

4527b5f verified 6 months ago

4.35 kB

	The `tokenize_for_model.py` script is designed to tokenize gene expression data for use in our models. It takes in processesed the, applies various tokenization techniques, and prepares it for training or inference.

	# General Workflow
	The script follows these main steps:
	0. Load Tokenization Arguments: The script starts by loading the tokenization arguments from a configuration file or dictionary.
	1. Load Gene Tokenizer: It loads a pre-trained gene tokenizer based on the provided tokenization arguments.
	2. Load AnnData: The script reads the gene expression data from an AnnData file.
	3. Check Genes in Tokenizer: It verifies that the genes in the dataset are present in the tokenizer's vocabulary.
	4. Build Token Array: The script constructs a token array for the genes in the dataset.
	5. Convert Processed Layer to Dense: It converts the processed layer of the AnnData object to a dense matrix.
	6. Tokenize in Batches: The script processes the data in batches, applying tokenization and optional binning or ranking.
	7. Save Tokenized Data: Finally, the script saves the tokenized data to disk.

	# Tokenization Arguments
	The script uses several tokenization arguments to control its behavior. Here is an explanation of each argument and the steps they influence:

	- `max_seq_len`
	- Description: Specifies the maximum sequence length for the tokenized data.
	- Impact: Determines the number of genes to include in each tokenized sequence (cell). If add_cls is enabled, the sequence length is reduced by one to accommodate the CLS token.
	- `add_cls`
	- Description: Indicates whether to prepend a CLS token to each sequence.
	- Impact: If enabled, a CLS token is added to the beginning of each sequence, and the sequence length is adjusted accordingly.
	- `cls_token_id`
	- Description: The token ID to use for the CLS token.
	- Impact: If add_cls is enabled, this token ID is used for the CLS token.
	- `random_genes`
	- Description: Specifies whether to select a random subset of genes before applying top-k selection
	- Impact: If enabled, a random subset of genes is selected for each batch, and then the top-k values are determined from this subset.
	- `include_zero_genes`
	- Description: Indicates whether to include zero-expression genes in the tokenized data.
	- Impact: If enabled, zero-expression genes are included in the tokenized sequences. Otherwise, they are filtered out.
	- `bins`
	- Description: Specifies the number of bins to use for binning expression values.
	- Impact: If set, the script bins the expression values into the specified number of bins. This argument is only relevant for TEDDY-X.
	- `continuous_rank`
	- Description: Indicates whether to rank expression values continuously.
	- Impact: If enabled, the script ranks the expression values in the range [-1, 1]. This argument is only relevant for TEDDY-X.
	- `gene_seed`
	- Description: A random seed for reproducibility.
	- Impact: If set, the script uses this seed to ensure reproducible random operations.
	- `gene_id_column`
	- Description: The column name in the AnnData object that contains gene IDs.
	- Impact: The script uses this column to identify genes from vocab in the dataset.
	- `label_column`
	- Description: The column name in the AnnData object that contains classification labels
	- Impact: If set, the script adds these labels to the tokenized data.
	- `bio_annotations`
	- Description: Indicates whether to add biological annotations to the tokenized data.
	- Impact: If enabled, the script adds annotations such as disease, tissue, cell type, and sex to the tokenized data.
	- `disease_mapping`, `tissue_mapping`, `cell_mapping`, `sex_mapping`
	- Description: File paths to JSON files containing mappings for biological annotations.
	- Impact: The script uses these mappings to convert biological annotations to token IDs.
	- `add_disease_annotation`
	- Description: Indicates whether to override labels with disease annotations.
	- Impact: If enabled, the script overrides the labels with disease annotations.
	- `max_shard_samples`
	- Description: The maximum number of samples per shard when saving the tokenized data.
	- Impact: The script splits the tokenized data into shards with the specified maximum number of samples.