| | --- |
| | license: apache-2.0 |
| | base_model: answerdotai/ModernBERT-base |
| | tags: |
| | - sentence-transformers |
| | - feature-extraction |
| | - sentence-similarity |
| | - biomedical |
| | - embeddings |
| | - life-sciences |
| | - scientific-text |
| | - SODA-VEC |
| | - EMBO |
| | datasets: |
| | - EMBO/soda-vec-data-full_pmc_title_abstract_paired |
| | metrics: |
| | - cosine-similarity |
| | --- |
| | |
| | # VICReg Our Contrast Model |
| |
|
| | ## Model Description |
| |
|
| | SODA-VEC embedding model trained with VICReg Our Contrast loss function. This model uses normalized embeddings with covariance, feature, and dot product losses (including off-diagonal terms) to learn rich biomedical text representations. |
| |
|
| | This model is part of the **SODA-VEC** (Scientific Open Domain Adaptation for Vector Embeddings) project, which focuses on creating high-quality embedding models for biomedical and life sciences text. |
| |
|
| | **Key Features:** |
| | - Trained on **26.5M biomedical title-abstract pairs** from PubMed Central |
| | - Based on **ModernBERT-base** architecture |
| | - Optimized for **biomedical text similarity** and **semantic search** |
| | - Produces **768-dimensional embeddings** with mean pooling |
| |
|
| | ## Training Details |
| |
|
| | ### Training Data |
| |
|
| | - **Dataset**: [`EMBO/soda-vec-data-full_pmc_title_abstract_paired`](https://huggingface.co/datasets/EMBO/soda-vec-data-full_pmc_title_abstract_paired) |
| | - **Size**: 26,473,900 training pairs |
| | - **Source**: Complete PubMed Central baseline (July 2024) |
| | - **Format**: Paired title-abstract examples optimized for contrastive learning |
| |
|
| | ### Training Procedure |
| |
|
| | **Loss Function**: VICReg Our Contrast: normalized embeddings with covariance loss, feature loss, and dot product loss (diagonal + off-diagonal) |
| |
|
| | We have implemented a series of changes from the original [VICREG in the paper from Meta](https://arxiv.org/pdf/2105.04906). Here we show the main differences: |
| |
|
| | | Feature | Original VICReg | VICReg Our | VICReg Our Contrast | |
| | |---------|----------------|------------|---------------------| |
| | | Normalization | No | Yes (L2-normalized) | Yes (L2-normalized) | |
| | | Invariance (MSE) | Yes | No | No | |
| | | Variance (hinge) | Yes | No | No | |
| | | Covariance | Yes (unnormalized) | Yes (normalized) | Yes (normalized) | |
| | | Feature correlation | No | Yes (cross-view) | Yes (cross-view) | |
| | | Sample similarity | No | Yes (diagonal only) | Yes (diagonal + off-diagonal) | |
| |
|
| |
|
| | **Coefficients**: cov=1.0, feature=1.0, dot=1.0 |
| | **Base Model**: `answerdotai/ModernBERT-base` |
| |
|
| | **Training Configuration:** |
| | - **GPUs**: 4 |
| | - **Batch Size per GPU**: 16 |
| | - **Gradient Accumulation**: 4 |
| | - **Effective Batch Size**: 256 |
| | - **Learning Rate**: 2e-05 |
| | - **Warmup Steps**: 100 |
| | - **Pooling Strategy**: mean |
| | - **Epochs**: 1 (full dataset pass) |
| |
|
| | **Training Command:** |
| | ```bash |
| | python scripts/soda-vec-train.py --config vicreg_our_contrast --coeff_cov 1 --coeff_feature 1 --coeff_dot 1 --push_to_hub --hub_org EMBO --save_limit 5 |
| | ``` |
| |
|
| | ### Model Architecture |
| |
|
| | - **Base Architecture**: ModernBERT-base (12 layers, 768 hidden size) |
| | - **Pooling**: Mean pooling over token embeddings |
| | - **Output Dimension**: 768 |
| | - **Normalization**: L2-normalized embeddings (for VICReg-based models) |
| |
|
| | ## Usage |
| |
|
| | ### Using Sentence-Transformers |
| |
|
| | ```python |
| | from sentence_transformers import SentenceTransformer |
| | |
| | # Load the model |
| | model = SentenceTransformer("EMBO/vicreg_our_contrast") |
| | |
| | # Encode sentences |
| | sentences = [ |
| | "CRISPR-Cas9 gene editing in human cells", |
| | "Genome editing using CRISPR technology" |
| | ] |
| | |
| | embeddings = model.encode(sentences) |
| | print(f"Embedding shape: {embeddings.shape}") |
| | |
| | # Compute similarity |
| | from sentence_transformers.util import cos_sim |
| | similarity = cos_sim(embeddings[0], embeddings[1]) |
| | print(f"Similarity: {similarity.item():.4f}") |
| | ``` |
| |
|
| | ### Using Hugging Face Transformers |
| |
|
| | ```python |
| | from transformers import AutoTokenizer, AutoModel |
| | import torch |
| | import torch.nn.functional as F |
| | |
| | # Load model and tokenizer |
| | tokenizer = AutoTokenizer.from_pretrained("EMBO/vicreg_our_contrast") |
| | model = AutoModel.from_pretrained("EMBO/vicreg_our_contrast") |
| | |
| | # Encode sentences |
| | sentences = [ |
| | "CRISPR-Cas9 gene editing in human cells", |
| | "Genome editing using CRISPR technology" |
| | ] |
| | |
| | inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt") |
| | with torch.no_grad(): |
| | outputs = model(**inputs) |
| | |
| | # Mean pooling |
| | embeddings = outputs.last_hidden_state.mean(dim=1) |
| | |
| | # Normalize (for VICReg models) |
| | embeddings = F.normalize(embeddings, p=2, dim=1) |
| | |
| | # Compute similarity |
| | similarity = F.cosine_similarity(embeddings[0:1], embeddings[1:2]) |
| | print(f"Similarity: {similarity.item():.4f}") |
| | ``` |
| |
|
| | ## Evaluation |
| |
|
| | The model has been evaluated on comprehensive biomedical benchmarks including: |
| |
|
| | - **Journal-Category Classification**: Matching journals to BioRxiv subject categories |
| | - **Title-Abstract Similarity**: Discriminating between related and unrelated paper pairs |
| | - **Field-Specific Separability**: Distinguishing between different biological fields |
| | - **Semantic Search**: Retrieval quality on biomedical text corpora |
| |
|
| | For detailed evaluation results, see the [SODA-VEC benchmark notebooks](https://github.com/source-data/soda-vec). |
| |
|
| | ## Intended Use |
| |
|
| | This model is designed for: |
| |
|
| | - **Biomedical Semantic Search**: Finding relevant papers, abstracts, or text passages |
| | - **Scientific Text Similarity**: Computing similarity between biomedical texts |
| | - **Information Retrieval**: Building search systems for scientific literature |
| | - **Downstream Tasks**: As a base for fine-tuning on specific biomedical tasks |
| | - **Research Applications**: Academic and research use in life sciences |
| |
|
| | ## Limitations |
| |
|
| | - **Domain Specificity**: Optimized for biomedical and life sciences text; may not perform as well on general domain text |
| | - **Language**: English only |
| | - **Text Length**: Optimized for titles and abstracts; longer documents may require chunking |
| | - **Bias**: Inherits biases from the training data (PubMed Central corpus) |
| |
|
| | ## Citation |
| |
|
| | If you use this model, please cite: |
| |
|
| | ```bibtex |
| | @software{soda_vec, |
| | title = {SODA-VEC: Scientific Open Domain Adaptation for Vector Embeddings}, |
| | author = {EMBO}, |
| | year = {2024}, |
| | url = {https://github.com/EMBO/soda-vec} |
| | } |
| | ``` |
| |
|
| | ## Model Card Contact |
| |
|
| | For questions or issues, please open an issue on the [SODA-VEC GitHub repository](https://github.com/source-data/soda-vec). |
| |
|
| | --- |
| |
|
| | **Model Card Generated**: 2025-11-10 |
| |
|