The `tokenize_for_model.py` script is designed to tokenize gene expression data for use in our models. It takes in processesed the, applies various tokenization techniques, and prepares it for training or inference. # General Workflow The script follows these main steps: 0. **Load Tokenization Arguments**: The script starts by loading the tokenization arguments from a configuration file or dictionary. 1. **Load Gene Tokenizer**: It loads a pre-trained gene tokenizer based on the provided tokenization arguments. 2. **Load AnnData**: The script reads the gene expression data from an AnnData file. 3. **Check Genes in Tokenizer**: It verifies that the genes in the dataset are present in the tokenizer's vocabulary. 4. **Build Token Array**: The script constructs a token array for the genes in the dataset. 5. **Convert Processed Layer to Dense**: It converts the processed layer of the AnnData object to a dense matrix. 6. **Tokenize in Batches**: The script processes the data in batches, applying tokenization and optional binning or ranking. 7. **Save Tokenized Data**: Finally, the script saves the tokenized data to disk. # Tokenization Arguments The script uses several tokenization arguments to control its behavior. Here is an explanation of each argument and the steps they influence: - `max_seq_len` - Description: Specifies the maximum sequence length for the tokenized data. - Impact: Determines the number of genes to include in each tokenized sequence (cell). If add_cls is enabled, the sequence length is reduced by one to accommodate the CLS token. - `add_cls` - Description: Indicates whether to prepend a CLS token to each sequence. - Impact: If enabled, a CLS token is added to the beginning of each sequence, and the sequence length is adjusted accordingly. - `cls_token_id` - Description: The token ID to use for the CLS token. - Impact: If add_cls is enabled, this token ID is used for the CLS token. - `random_genes` - Description: Specifies whether to select a random subset of genes before applying top-k selection - Impact: If enabled, a random subset of genes is selected for each batch, and then the top-k values are determined from this subset. - `include_zero_genes` - Description: Indicates whether to include zero-expression genes in the tokenized data. - Impact: If enabled, zero-expression genes are included in the tokenized sequences. Otherwise, they are filtered out. - `bins` - Description: Specifies the number of bins to use for binning expression values. - Impact: If set, the script bins the expression values into the specified number of bins. This argument is only relevant for TEDDY-X. - `continuous_rank` - Description: Indicates whether to rank expression values continuously. - Impact: If enabled, the script ranks the expression values in the range [-1, 1]. This argument is only relevant for TEDDY-X. - `gene_seed` - Description: A random seed for reproducibility. - Impact: If set, the script uses this seed to ensure reproducible random operations. - `gene_id_column` - Description: The column name in the AnnData object that contains gene IDs. - Impact: The script uses this column to identify genes from vocab in the dataset. - `label_column` - Description: The column name in the AnnData object that contains classification labels - Impact: If set, the script adds these labels to the tokenized data. - `bio_annotations` - Description: Indicates whether to add biological annotations to the tokenized data. - Impact: If enabled, the script adds annotations such as disease, tissue, cell type, and sex to the tokenized data. - `disease_mapping`, `tissue_mapping`, `cell_mapping`, `sex_mapping` - Description: File paths to JSON files containing mappings for biological annotations. - Impact: The script uses these mappings to convert biological annotations to token IDs. - `add_disease_annotation` - Description: Indicates whether to override labels with disease annotations. - Impact: If enabled, the script overrides the labels with disease annotations. - `max_shard_samples` - Description: The maximum number of samples per shard when saving the tokenized data. - Impact: The script splits the tokenized data into shards with the specified maximum number of samples.