| The `tokenize_for_model.py` script is designed to tokenize gene expression data for use in our models. It takes in processesed the, applies various tokenization techniques, and prepares it for training or inference. | |
| # General Workflow | |
| The script follows these main steps: | |
| 0. **Load Tokenization Arguments**: The script starts by loading the tokenization arguments from a configuration file or dictionary. | |
| 1. **Load Gene Tokenizer**: It loads a pre-trained gene tokenizer based on the provided tokenization arguments. | |
| 2. **Load AnnData**: The script reads the gene expression data from an AnnData file. | |
| 3. **Check Genes in Tokenizer**: It verifies that the genes in the dataset are present in the tokenizer's vocabulary. | |
| 4. **Build Token Array**: The script constructs a token array for the genes in the dataset. | |
| 5. **Convert Processed Layer to Dense**: It converts the processed layer of the AnnData object to a dense matrix. | |
| 6. **Tokenize in Batches**: The script processes the data in batches, applying tokenization and optional binning or ranking. | |
| 7. **Save Tokenized Data**: Finally, the script saves the tokenized data to disk. | |
| # Tokenization Arguments | |
| The script uses several tokenization arguments to control its behavior. Here is an explanation of each argument and the steps they influence: | |
| - `max_seq_len` | |
| - Description: Specifies the maximum sequence length for the tokenized data. | |
| - Impact: Determines the number of genes to include in each tokenized sequence (cell). If add_cls is enabled, the sequence length is reduced by one to accommodate the CLS token. | |
| - `add_cls` | |
| - Description: Indicates whether to prepend a CLS token to each sequence. | |
| - Impact: If enabled, a CLS token is added to the beginning of each sequence, and the sequence length is adjusted accordingly. | |
| - `cls_token_id` | |
| - Description: The token ID to use for the CLS token. | |
| - Impact: If add_cls is enabled, this token ID is used for the CLS token. | |
| - `random_genes` | |
| - Description: Specifies whether to select a random subset of genes before applying top-k selection | |
| - Impact: If enabled, a random subset of genes is selected for each batch, and then the top-k values are determined from this subset. | |
| - `include_zero_genes` | |
| - Description: Indicates whether to include zero-expression genes in the tokenized data. | |
| - Impact: If enabled, zero-expression genes are included in the tokenized sequences. Otherwise, they are filtered out. | |
| - `bins` | |
| - Description: Specifies the number of bins to use for binning expression values. | |
| - Impact: If set, the script bins the expression values into the specified number of bins. This argument is only relevant for TEDDY-X. | |
| - `continuous_rank` | |
| - Description: Indicates whether to rank expression values continuously. | |
| - Impact: If enabled, the script ranks the expression values in the range [-1, 1]. This argument is only relevant for TEDDY-X. | |
| - `gene_seed` | |
| - Description: A random seed for reproducibility. | |
| - Impact: If set, the script uses this seed to ensure reproducible random operations. | |
| - `gene_id_column` | |
| - Description: The column name in the AnnData object that contains gene IDs. | |
| - Impact: The script uses this column to identify genes from vocab in the dataset. | |
| - `label_column` | |
| - Description: The column name in the AnnData object that contains classification labels | |
| - Impact: If set, the script adds these labels to the tokenized data. | |
| - `bio_annotations` | |
| - Description: Indicates whether to add biological annotations to the tokenized data. | |
| - Impact: If enabled, the script adds annotations such as disease, tissue, cell type, and sex to the tokenized data. | |
| - `disease_mapping`, `tissue_mapping`, `cell_mapping`, `sex_mapping` | |
| - Description: File paths to JSON files containing mappings for biological annotations. | |
| - Impact: The script uses these mappings to convert biological annotations to token IDs. | |
| - `add_disease_annotation` | |
| - Description: Indicates whether to override labels with disease annotations. | |
| - Impact: If enabled, the script overrides the labels with disease annotations. | |
| - `max_shard_samples` | |
| - Description: The maximum number of samples per shard when saving the tokenized data. | |
| - Impact: The script splits the tokenized data into shards with the specified maximum number of samples. | |