cnmoro
/

inference-free-splade-bert-tiny-en-ptbr

Sentence Similarity

sentence-transformers

Model card Files Files and versions

inference-free-splade-bert-tiny-en-ptbr / README.md

cnmoro's picture

Update README.md

e3ea598 verified about 2 months ago

|

history blame contribute delete

1.88 kB

license: apache-2.0
datasets:
  - cnmoro/GPT4-500k-Augmented-PTBR-Clean
  - cnmoro/WizardVicuna-PTBR-Instruct-Clean
  - sentence-transformers/natural-questions
language:
  - en
  - pt
base_model:
  - prajjwal1/bert-tiny
pipeline_tag: sentence-similarity
library_name: sentence-transformers
tags:
  - splade
  - sparse
  - embeddings
  - bert
  - portuguese
  - pt
  - ptbr

SPLADE (BERT-Tiny) finetuned on merged NQ + PT-BR instruction datasets

This model is a sparse retriever trained with Sentence Transformers' SPLADE stack, using a merged multilingual corpus (English + Portuguese).

Usage

from sentence_transformers import SparseEncoder

sparse_model = SparseEncoder("cnmoro/inference-free-splade-bert-tiny-en-ptbr")

sparse_embeddings = sparse_model.encode(["Hello", "World"], show_progress_bar=True)

Base model

prajjwal1/bert-tiny

Training date

Training completed on April 13, 2026.

Dataset composition

The training corpus was built by row-wise concatenation of:

sentence-transformers/natural-questions
cnmoro/GPT4-500k-Augmented-PTBR-Clean
cnmoro/WizardVicuna-PTBR-Instruct-Clean

Final merged size:

Total rows: 869,365
Train rows: 868,365
Eval rows: 1,000
Split seed: 12

Training objective

Loss: SpladeLoss(SparseMultipleNegativesRankingLoss)
Document regularizer weight: 0.03
Query regularizer weight: 0

Core hyperparameters

Epochs: 1
Per-device batch size: 64
Max sequence length: 256
SPLADE pooling chunk size: 64
Learning rate: 2e-5
Warmup ratio: 0.1
Mixed precision: fp16=True
Batch sampler: NO_DUPLICATES
Router mapping: query -> query, answer -> document

Final training metrics

train_runtime: 2898.9331 s
train_steps_per_second: 4.681
train_samples_per_second: 299.546
train_loss: 0.81497