--- license: apache-2.0 datasets: - cnmoro/GPT4-500k-Augmented-PTBR-Clean - cnmoro/WizardVicuna-PTBR-Instruct-Clean - sentence-transformers/natural-questions language: - en - pt base_model: - prajjwal1/bert-tiny pipeline_tag: sentence-similarity library_name: sentence-transformers tags: - splade - sparse - embeddings - bert - portuguese - pt - ptbr --- # SPLADE (BERT-Tiny) finetuned on merged NQ + PT-BR instruction datasets This model is a sparse retriever trained with Sentence Transformers' SPLADE stack, using a merged multilingual corpus (English + Portuguese). ## Usage ```python from sentence_transformers import SparseEncoder sparse_model = SparseEncoder("cnmoro/inference-free-splade-bert-tiny-en-ptbr") sparse_embeddings = sparse_model.encode(["Hello", "World"], show_progress_bar=True) ``` ## Base model - `prajjwal1/bert-tiny` ## Training date - Training completed on **April 13, 2026**. ## Dataset composition The training corpus was built by row-wise concatenation of: 1. `sentence-transformers/natural-questions` 2. `cnmoro/GPT4-500k-Augmented-PTBR-Clean` 3. `cnmoro/WizardVicuna-PTBR-Instruct-Clean` Final merged size: - Total rows: **869,365** - Train rows: **868,365** - Eval rows: **1,000** - Split seed: `12` ## Training objective - Loss: `SpladeLoss(SparseMultipleNegativesRankingLoss)` - Document regularizer weight: `0.03` - Query regularizer weight: `0` ## Core hyperparameters - Epochs: **1** - Per-device batch size: **64** - Max sequence length: **256** - SPLADE pooling chunk size: **64** - Learning rate: `2e-5` - Warmup ratio: `0.1` - Mixed precision: `fp16=True` - Batch sampler: `NO_DUPLICATES` - Router mapping: `query -> query`, `answer -> document` ## Final training metrics - `train_runtime`: **2898.9331 s** - `train_steps_per_second`: **4.681** - `train_samples_per_second`: **299.546** - `train_loss`: **0.81497**