BERT Base Spoiler Detection

Model Description

This model is a fine-tuned version of bert-base-uncased for detecting spoilers in movie and TV show reviews. It classifies reviews as either containing spoilers or being spoiler-free.

Developed by: Tyler Jordan
Model type: Text Classification
Language: English
License: MIT
Base model: bert-base-uncased

Intended Use

Primary Use Case

Automatically detect spoilers in user-generated movie and TV show reviews to warn readers before they encounter plot-revealing content.

Intended Users

Movie review platforms
Content moderation systems
Personal projects for filtering spoilers

Out-of-Scope Uses

Reviews in languages other than English
Non-entertainment content (news, academic papers, etc.)
Legal or medical content requiring high accuracy

Training Data

Dataset: IMDB Review Dataset by Enam Biswas (2021)

Preprocessing:

Sampled 200,000 balanced reviews (100k spoilers, 100k non-spoilers) from 5.5M total reviews
Train/Validation/Test split: 140k/30k/30k (70%/15%/15%)
Text cleaning: HTML tag removal, whitespace normalization
Minimum review length: 30 characters
Maximum sequence length: 512 tokens

Class Distribution:

Spoiler: 50%
Non-spoiler: 50%

Training Procedure

Training Hyperparameters

Optimizer: AdamW
Learning rate: 1e-5
Batch size: 32
Epochs: 5
Max sequence length: 512
Dropout: 0.3
Weight decay: 0.01
Warmup steps: 10% of total steps
Learning rate schedule: Linear warmup with decay

Training Hardware

GPU: NVIDIA T4 (Google Colab)
Training time: ~2-3 hours

Framework

PyTorch 2.5.1
Transformers 4.x
CUDA 12.1

Evaluation

Metrics

Metric	Value
Test Accuracy	76.0%
Validation Accuracy	76.3%

Evaluation Data

30,000 held-out reviews from the IMDB dataset
Balanced split (50% spoilers, 50% non-spoilers)

Downloads last month: -; Downloads are not tracked for this model. How to track

Evaluation results

Test Accuracy on IMDB Review Dataset
self-reported

0.760