---
license: mit
datasets:
- dleemiller/wiki-sim
- sentence-transformers/stsb
language:
- en
metrics:
- spearmanr
- pearsonr
base_model:
- NeuML/bert-hash-pico
pipeline_tag: text-ranking
library_name: sentence-transformers
tags:
- cross-encoder
- modernbert
- sts
- stsb
- stsbenchmark-sts
model-index:
- name: CrossEncoder based on NeuML/bert-hash-pico
  results:
  - task:
      type: semantic-similarity
      name: Semantic Similarity
    dataset:
      name: sts test
      type: sts-test
    metrics:
    - type: pearson_cosine
      value: 0.7594692671867559
      name: Pearson Cosine
    - type: spearman_cosine
      value: 0.747410618220483
      name: Spearman Cosine
  - task:
      type: semantic-similarity
      name: Semantic Similarity
    dataset:
      name: sts dev
      type: sts-dev
    metrics:
    - type: pearson_cosine
      value: 0.8216995594169731
      name: Pearson Cosine
    - type: spearman_cosine
      value: 0.8226789104514981
      name: Spearman Cosine
---

# BERT Hash Cross-Encoder: Semantic Similarity (STS)

Cross encoders are high performing encoder models that compare two texts and output a 0-1 score.
I've found the `cross-encoders/roberta-large-stsb` model to be very useful in creating evaluators for LLM outputs.
They're simple to use, fast and very accurate.

The BERT hash uses a bucketing technique with projection to decrease the size of the embedding parameters (all <1M parameters).
These models are very small and good for inference at the edge.

---

## Features
- **Performance:** Achieves **Pearson: 0.7595** and **Spearman: 0.7474** on the STS-Benchmark test set.
- **Efficient architecture:** Based on the BERT Hash model architecture, offering lightweight models.
- **Extended context length:** Processes sequences up to 8192 tokens, great for LLM output evals.
- **Diversified training:** Pretrained on `dleemiller/wiki-sim` and fine-tuned on `sentence-transformers/stsb`.

---

## Performance

| Model                          | STS-B Test Pearson | STS-B Test Spearman | Context Length | Parameters | Speed  |
|--------------------------------|--------------------|---------------------|----------------|------------|---------|
| `dleemiller/ModernCE-large-sts`           | **0.9256**         | **0.9215**          | **8192**       | 395M       | **Medium** |
| `dleemiller/CrossGemma-sts-300m`          | 0.9175         | 0.9135          | 2048       | 303M       | **Medium** |
| `dleemiller/ModernCE-base-sts`            | 0.9162         | 0.9122          | **8192**       | 149M       | **Fast** |
| `cross-encoder/stsb-roberta-large`        | 0.9147            | -              | 512            | 355M       | Slow    |
| `dleemiller/EttinX-sts-m`                 | 0.9143        | 0.9102          | **8192**       | 149M       | **Fast** |
| `dleemiller/NeoCE-sts`                    | 0.9124         | 0.9087          | 4096       | 250M       | **Fast** |
| `dleemiller/EttinX-sts-s`                 | 0.9004        | 0.8926          | **8192**       | 68M       | **Very Fast** |
| `cross-encoder/stsb-distilroberta-base`   | 0.8792            | -              | 512            | 82M        | Fast    |
| `dleemiller/EttinX-sts-xs`                | 0.8763        | 0.8689          | **8192**       | 32M       | **Very Fast** |
| `dleemiller/EttinX-sts-xxs`               | 0.8414        | 0.8311          | **8192**       | 17M       | **Very Fast** |
| `dleemiller/sts-bert-hash-nano`           | 0.7904        | 0.7743          | **8192**       | 0.97M       | **Very Fast** |
| `dleemiller/sts-bert-hash-pico`           | 0.7595        | 0.7474          | **8192**       | 0.45M       | **Very Fast** |

---

## Usage

To use sts-bert-hash for semantic similarity tasks, you can load the model with the Hugging Face `sentence-transformers` library:

```python
from sentence_transformers import CrossEncoder

# Load CrossEncoder model
model = CrossEncoder("dleemiller/sts-bert-hash-nano", trust_remote_code=True)

# Predict similarity scores for sentence pairs
sentence_pairs = [
    ("It's a wonderful day outside.", "It's so sunny today!"),
    ("It's a wonderful day outside.", "He drove to work earlier."),
]
scores = model.predict(sentence_pairs)

print(scores)  # Outputs: array([0.9184, 0.0123], dtype=float32)
```

### Output
The model returns similarity scores in the range `[0, 1]`, where higher scores indicate stronger semantic similarity.

---

## Training Details

### Pretraining
The model was pretrained on the `pair-score-sampled` subset of the [`dleemiller/wiki-sim`](https://huggingface.co/datasets/dleemiller/wiki-sim) dataset.
This dataset provides diverse sentence pairs with semantic similarity scores, helping the model build a robust understanding of relationships between sentences.
- **Classifier Dropout:** a somewhat large classifier dropout of 0.15, to reduce overreliance on teacher scores.
- **Objective:** STS-B scores from `dleemiller/MocernCE-large-sts`.

### Fine-Tuning
Fine-tuning was performed on the [`sentence-transformers/stsb`](https://huggingface.co/datasets/sentence-transformers/stsb) dataset.

### Validation Results
The model achieved the following test set performance after fine-tuning:
- **Pearson Correlation:** 0.7595
- **Spearman Correlation:** 0.7474

---

## Model Card

- **Architecture:** bert-hash-nano
- **Tokenizer:** Custom tokenizer trained with modern techniques for long-context handling.
- **Pretraining Data:** `dleemiller/wiki-sim (pair-score-sampled)`
- **Fine-Tuning Data:** `sentence-transformers/stsb`

---

## Thank You

Thanks to the NeuML team for providing the BERT Hash models, and the Sentence Transformers team for their leadership in transformer encoder models.

---

## Citation

If you use this model in your research, please cite:

```bibtex
@misc{stsnano2025,
  author = {Miller, D. Lee},
  title = {Bert Hash STS: An STS cross encoder model},
  year = {2025},
  publisher = {Hugging Face Hub},
  url = {https://huggingface.co/dleemiller/sts-bert-hash-pico},
}
```

---

## License

This model is licensed under the [MIT License](LICENSE).