Sentence Similarity
sentence-transformers
Safetensors
English
Portuguese
splade
sparse
embeddings
bert
portuguese
ptbr
cnmoro's picture
Update README.md
e3ea598 verified
metadata
license: apache-2.0
datasets:
  - cnmoro/GPT4-500k-Augmented-PTBR-Clean
  - cnmoro/WizardVicuna-PTBR-Instruct-Clean
  - sentence-transformers/natural-questions
language:
  - en
  - pt
base_model:
  - prajjwal1/bert-tiny
pipeline_tag: sentence-similarity
library_name: sentence-transformers
tags:
  - splade
  - sparse
  - embeddings
  - bert
  - portuguese
  - pt
  - ptbr

SPLADE (BERT-Tiny) finetuned on merged NQ + PT-BR instruction datasets

This model is a sparse retriever trained with Sentence Transformers' SPLADE stack, using a merged multilingual corpus (English + Portuguese).

Usage

from sentence_transformers import SparseEncoder

sparse_model = SparseEncoder("cnmoro/inference-free-splade-bert-tiny-en-ptbr")

sparse_embeddings = sparse_model.encode(["Hello", "World"], show_progress_bar=True)

Base model

  • prajjwal1/bert-tiny

Training date

  • Training completed on April 13, 2026.

Dataset composition

The training corpus was built by row-wise concatenation of:

  1. sentence-transformers/natural-questions
  2. cnmoro/GPT4-500k-Augmented-PTBR-Clean
  3. cnmoro/WizardVicuna-PTBR-Instruct-Clean

Final merged size:

  • Total rows: 869,365
  • Train rows: 868,365
  • Eval rows: 1,000
  • Split seed: 12

Training objective

  • Loss: SpladeLoss(SparseMultipleNegativesRankingLoss)
  • Document regularizer weight: 0.03
  • Query regularizer weight: 0

Core hyperparameters

  • Epochs: 1
  • Per-device batch size: 64
  • Max sequence length: 256
  • SPLADE pooling chunk size: 64
  • Learning rate: 2e-5
  • Warmup ratio: 0.1
  • Mixed precision: fp16=True
  • Batch sampler: NO_DUPLICATES
  • Router mapping: query -> query, answer -> document

Final training metrics

  • train_runtime: 2898.9331 s
  • train_steps_per_second: 4.681
  • train_samples_per_second: 299.546
  • train_loss: 0.81497