Sentence Similarity
sentence-transformers
Safetensors
English
Portuguese
splade
sparse
embeddings
bert
portuguese
ptbr
Instructions to use cnmoro/inference-free-splade-bert-tiny-en-ptbr with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use cnmoro/inference-free-splade-bert-tiny-en-ptbr with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("cnmoro/inference-free-splade-bert-tiny-en-ptbr") sentences = [ "That is a happy person", "That is a happy dog", "That is a very happy person", "Today is a sunny day" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Notebooks
- Google Colab
- Kaggle
metadata
license: apache-2.0
datasets:
- cnmoro/GPT4-500k-Augmented-PTBR-Clean
- cnmoro/WizardVicuna-PTBR-Instruct-Clean
- sentence-transformers/natural-questions
language:
- en
- pt
base_model:
- prajjwal1/bert-tiny
pipeline_tag: sentence-similarity
library_name: sentence-transformers
tags:
- splade
- sparse
- embeddings
- bert
- portuguese
- pt
- ptbr
SPLADE (BERT-Tiny) finetuned on merged NQ + PT-BR instruction datasets
This model is a sparse retriever trained with Sentence Transformers' SPLADE stack, using a merged multilingual corpus (English + Portuguese).
Usage
from sentence_transformers import SparseEncoder
sparse_model = SparseEncoder("cnmoro/inference-free-splade-bert-tiny-en-ptbr")
sparse_embeddings = sparse_model.encode(["Hello", "World"], show_progress_bar=True)
Base model
prajjwal1/bert-tiny
Training date
- Training completed on April 13, 2026.
Dataset composition
The training corpus was built by row-wise concatenation of:
sentence-transformers/natural-questionscnmoro/GPT4-500k-Augmented-PTBR-Cleancnmoro/WizardVicuna-PTBR-Instruct-Clean
Final merged size:
- Total rows: 869,365
- Train rows: 868,365
- Eval rows: 1,000
- Split seed:
12
Training objective
- Loss:
SpladeLoss(SparseMultipleNegativesRankingLoss) - Document regularizer weight:
0.03 - Query regularizer weight:
0
Core hyperparameters
- Epochs: 1
- Per-device batch size: 64
- Max sequence length: 256
- SPLADE pooling chunk size: 64
- Learning rate:
2e-5 - Warmup ratio:
0.1 - Mixed precision:
fp16=True - Batch sampler:
NO_DUPLICATES - Router mapping:
query -> query,answer -> document
Final training metrics
train_runtime: 2898.9331 strain_steps_per_second: 4.681train_samples_per_second: 299.546train_loss: 0.81497