--- license: mit datasets: - dleemiller/wiki-sim - sentence-transformers/stsb language: - en metrics: - spearmanr - pearsonr base_model: - NeuML/bert-hash-pico pipeline_tag: text-ranking library_name: sentence-transformers tags: - cross-encoder - modernbert - sts - stsb - stsbenchmark-sts model-index: - name: CrossEncoder based on NeuML/bert-hash-pico results: - task: type: semantic-similarity name: Semantic Similarity dataset: name: sts test type: sts-test metrics: - type: pearson_cosine value: 0.7594692671867559 name: Pearson Cosine - type: spearman_cosine value: 0.747410618220483 name: Spearman Cosine - task: type: semantic-similarity name: Semantic Similarity dataset: name: sts dev type: sts-dev metrics: - type: pearson_cosine value: 0.8216995594169731 name: Pearson Cosine - type: spearman_cosine value: 0.8226789104514981 name: Spearman Cosine --- # BERT Hash Cross-Encoder: Semantic Similarity (STS) Cross encoders are high performing encoder models that compare two texts and output a 0-1 score. I've found the `cross-encoders/roberta-large-stsb` model to be very useful in creating evaluators for LLM outputs. They're simple to use, fast and very accurate. The BERT hash uses a bucketing technique with projection to decrease the size of the embedding parameters (all <1M parameters). These models are very small and good for inference at the edge. --- ## Features - **Performance:** Achieves **Pearson: 0.7595** and **Spearman: 0.7474** on the STS-Benchmark test set. - **Efficient architecture:** Based on the BERT Hash model architecture, offering lightweight models. - **Extended context length:** Processes sequences up to 8192 tokens, great for LLM output evals. - **Diversified training:** Pretrained on `dleemiller/wiki-sim` and fine-tuned on `sentence-transformers/stsb`. --- ## Performance | Model | STS-B Test Pearson | STS-B Test Spearman | Context Length | Parameters | Speed | |--------------------------------|--------------------|---------------------|----------------|------------|---------| | `dleemiller/ModernCE-large-sts` | **0.9256** | **0.9215** | **8192** | 395M | **Medium** | | `dleemiller/CrossGemma-sts-300m` | 0.9175 | 0.9135 | 2048 | 303M | **Medium** | | `dleemiller/ModernCE-base-sts` | 0.9162 | 0.9122 | **8192** | 149M | **Fast** | | `cross-encoder/stsb-roberta-large` | 0.9147 | - | 512 | 355M | Slow | | `dleemiller/EttinX-sts-m` | 0.9143 | 0.9102 | **8192** | 149M | **Fast** | | `dleemiller/NeoCE-sts` | 0.9124 | 0.9087 | 4096 | 250M | **Fast** | | `dleemiller/EttinX-sts-s` | 0.9004 | 0.8926 | **8192** | 68M | **Very Fast** | | `cross-encoder/stsb-distilroberta-base` | 0.8792 | - | 512 | 82M | Fast | | `dleemiller/EttinX-sts-xs` | 0.8763 | 0.8689 | **8192** | 32M | **Very Fast** | | `dleemiller/EttinX-sts-xxs` | 0.8414 | 0.8311 | **8192** | 17M | **Very Fast** | | `dleemiller/sts-bert-hash-nano` | 0.7904 | 0.7743 | **8192** | 0.97M | **Very Fast** | | `dleemiller/sts-bert-hash-pico` | 0.7595 | 0.7474 | **8192** | 0.45M | **Very Fast** | --- ## Usage To use sts-bert-hash for semantic similarity tasks, you can load the model with the Hugging Face `sentence-transformers` library: ```python from sentence_transformers import CrossEncoder # Load CrossEncoder model model = CrossEncoder("dleemiller/sts-bert-hash-nano", trust_remote_code=True) # Predict similarity scores for sentence pairs sentence_pairs = [ ("It's a wonderful day outside.", "It's so sunny today!"), ("It's a wonderful day outside.", "He drove to work earlier."), ] scores = model.predict(sentence_pairs) print(scores) # Outputs: array([0.9184, 0.0123], dtype=float32) ``` ### Output The model returns similarity scores in the range `[0, 1]`, where higher scores indicate stronger semantic similarity. --- ## Training Details ### Pretraining The model was pretrained on the `pair-score-sampled` subset of the [`dleemiller/wiki-sim`](https://huggingface.co/datasets/dleemiller/wiki-sim) dataset. This dataset provides diverse sentence pairs with semantic similarity scores, helping the model build a robust understanding of relationships between sentences. - **Classifier Dropout:** a somewhat large classifier dropout of 0.15, to reduce overreliance on teacher scores. - **Objective:** STS-B scores from `dleemiller/MocernCE-large-sts`. ### Fine-Tuning Fine-tuning was performed on the [`sentence-transformers/stsb`](https://huggingface.co/datasets/sentence-transformers/stsb) dataset. ### Validation Results The model achieved the following test set performance after fine-tuning: - **Pearson Correlation:** 0.7595 - **Spearman Correlation:** 0.7474 --- ## Model Card - **Architecture:** bert-hash-nano - **Tokenizer:** Custom tokenizer trained with modern techniques for long-context handling. - **Pretraining Data:** `dleemiller/wiki-sim (pair-score-sampled)` - **Fine-Tuning Data:** `sentence-transformers/stsb` --- ## Thank You Thanks to the NeuML team for providing the BERT Hash models, and the Sentence Transformers team for their leadership in transformer encoder models. --- ## Citation If you use this model in your research, please cite: ```bibtex @misc{stsnano2025, author = {Miller, D. Lee}, title = {Bert Hash STS: An STS cross encoder model}, year = {2025}, publisher = {Hugging Face Hub}, url = {https://huggingface.co/dleemiller/sts-bert-hash-pico}, } ``` --- ## License This model is licensed under the [MIT License](LICENSE).