BERT Base Spoiler Detection
Model Description
This model is a fine-tuned version of bert-base-uncased for detecting spoilers in movie and TV show reviews. It classifies reviews as either containing spoilers or being spoiler-free.
Developed by: Tyler Jordan
Model type: Text Classification
Language: English
License: MIT
Base model: bert-base-uncased
Intended Use
Primary Use Case
Automatically detect spoilers in user-generated movie and TV show reviews to warn readers before they encounter plot-revealing content.
Intended Users
- Movie review platforms
- Content moderation systems
- Personal projects for filtering spoilers
Out-of-Scope Uses
- Reviews in languages other than English
- Non-entertainment content (news, academic papers, etc.)
- Legal or medical content requiring high accuracy
Training Data
Dataset: IMDB Review Dataset by Enam Biswas (2021)
Preprocessing:
- Sampled 200,000 balanced reviews (100k spoilers, 100k non-spoilers) from 5.5M total reviews
- Train/Validation/Test split: 140k/30k/30k (70%/15%/15%)
- Text cleaning: HTML tag removal, whitespace normalization
- Minimum review length: 30 characters
- Maximum sequence length: 512 tokens
Class Distribution:
- Spoiler: 50%
- Non-spoiler: 50%
Training Procedure
Training Hyperparameters
- Optimizer: AdamW
- Learning rate: 1e-5
- Batch size: 32
- Epochs: 5
- Max sequence length: 512
- Dropout: 0.3
- Weight decay: 0.01
- Warmup steps: 10% of total steps
- Learning rate schedule: Linear warmup with decay
Training Hardware
- GPU: NVIDIA T4 (Google Colab)
- Training time: ~2-3 hours
Framework
- PyTorch 2.5.1
- Transformers 4.x
- CUDA 12.1
Evaluation
Metrics
| Metric | Value |
|---|---|
| Test Accuracy | 76.0% |
| Validation Accuracy | 76.3% |
Evaluation Data
- 30,000 held-out reviews from the IMDB dataset
- Balanced split (50% spoilers, 50% non-spoilers)
Evaluation results
- Test Accuracy on IMDB Review Datasetself-reported0.760