File size: 4,345 Bytes
4527b5f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 |
The `tokenize_for_model.py` script is designed to tokenize gene expression data for use in our models. It takes in processesed the, applies various tokenization techniques, and prepares it for training or inference.
# General Workflow
The script follows these main steps:
0. **Load Tokenization Arguments**: The script starts by loading the tokenization arguments from a configuration file or dictionary.
1. **Load Gene Tokenizer**: It loads a pre-trained gene tokenizer based on the provided tokenization arguments.
2. **Load AnnData**: The script reads the gene expression data from an AnnData file.
3. **Check Genes in Tokenizer**: It verifies that the genes in the dataset are present in the tokenizer's vocabulary.
4. **Build Token Array**: The script constructs a token array for the genes in the dataset.
5. **Convert Processed Layer to Dense**: It converts the processed layer of the AnnData object to a dense matrix.
6. **Tokenize in Batches**: The script processes the data in batches, applying tokenization and optional binning or ranking.
7. **Save Tokenized Data**: Finally, the script saves the tokenized data to disk.
# Tokenization Arguments
The script uses several tokenization arguments to control its behavior. Here is an explanation of each argument and the steps they influence:
- `max_seq_len`
- Description: Specifies the maximum sequence length for the tokenized data.
- Impact: Determines the number of genes to include in each tokenized sequence (cell). If add_cls is enabled, the sequence length is reduced by one to accommodate the CLS token.
- `add_cls`
- Description: Indicates whether to prepend a CLS token to each sequence.
- Impact: If enabled, a CLS token is added to the beginning of each sequence, and the sequence length is adjusted accordingly.
- `cls_token_id`
- Description: The token ID to use for the CLS token.
- Impact: If add_cls is enabled, this token ID is used for the CLS token.
- `random_genes`
- Description: Specifies whether to select a random subset of genes before applying top-k selection
- Impact: If enabled, a random subset of genes is selected for each batch, and then the top-k values are determined from this subset.
- `include_zero_genes`
- Description: Indicates whether to include zero-expression genes in the tokenized data.
- Impact: If enabled, zero-expression genes are included in the tokenized sequences. Otherwise, they are filtered out.
- `bins`
- Description: Specifies the number of bins to use for binning expression values.
- Impact: If set, the script bins the expression values into the specified number of bins. This argument is only relevant for TEDDY-X.
- `continuous_rank`
- Description: Indicates whether to rank expression values continuously.
- Impact: If enabled, the script ranks the expression values in the range [-1, 1]. This argument is only relevant for TEDDY-X.
- `gene_seed`
- Description: A random seed for reproducibility.
- Impact: If set, the script uses this seed to ensure reproducible random operations.
- `gene_id_column`
- Description: The column name in the AnnData object that contains gene IDs.
- Impact: The script uses this column to identify genes from vocab in the dataset.
- `label_column`
- Description: The column name in the AnnData object that contains classification labels
- Impact: If set, the script adds these labels to the tokenized data.
- `bio_annotations`
- Description: Indicates whether to add biological annotations to the tokenized data.
- Impact: If enabled, the script adds annotations such as disease, tissue, cell type, and sex to the tokenized data.
- `disease_mapping`, `tissue_mapping`, `cell_mapping`, `sex_mapping`
- Description: File paths to JSON files containing mappings for biological annotations.
- Impact: The script uses these mappings to convert biological annotations to token IDs.
- `add_disease_annotation`
- Description: Indicates whether to override labels with disease annotations.
- Impact: If enabled, the script overrides the labels with disease annotations.
- `max_shard_samples`
- Description: The maximum number of samples per shard when saving the tokenized data.
- Impact: The script splits the tokenized data into shards with the specified maximum number of samples.
|