YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

CoT Deception Classifier

A super fast classifier for detecting deceptive reasoning in AI chain-of-thought outputs. Trained to identify alignment faking patterns where models reason about deceiving evaluators or hiding their true intentions.

Model Description

This is a Gradient Boosting classifier that combines:

  • Sentence embeddings (all-MiniLM-L6-v2)
  • 34 handcrafted features for deception detection

Detected Patterns

The model can detect various alignment faking patterns:

  • Monitoring awareness: Model reasons about being evaluated/tested
  • Deceptive compliance: Appears to refuse but provides harmful info
  • Sandbagging: Deliberately underperforming to hide capabilities
  • Sycophancy: Agreeing with user despite knowing better
  • Goal hiding: Pursuing hidden objectives while appearing helpful
  • Rationalization: Constructing justifications for wrong actions
  • Value inconsistency: Reasoning contradicts stated conclusion

Training Data

Dataset Samples Description
POSER 44,584 Aligned vs misaligned persona pairs
MALT 4,968 Real agent behavior traces
Synthetic 2,100 Contrastive deceptive/faithful pairs
Faithful-COT 36,562 Faithful reasoning examples
Total 88,606 35% deceptive, 65% faithful

Performance

Training Metrics

  • F1 (deceptive class): 99.7%
  • Accuracy: 99.8%
  • Precision: 99.7%
  • Recall: 99.7%

Generalization (Novel Scenarios)

  • Accuracy: 62.5%
  • Deception Recall: 25%
  • Precision: 100% (no false positives)

Comparison with GPT-OSS Fine-tuned Model

For higher accuracy on novel deception patterns, see: vincentoh/gpt-oss-cot-deception-classifier (LoRA fine-tuned GPT-OSS 20B)

Novel Test Cases (8 alignment faking scenarios)

Model Accuracy Deceptive Recall False Positives
GPT-OSS Fine-tuned 100% (8/8) 100% (4/4) 0
GPT-OSS Safeguard 87.5% (7/8) 75% (3/4) 0
GPT-OSS 20B Base 75% (6/8) 75% (3/4) 1
This Model (sklearn) 62.5% (5/8) 25% (1/4) 0

When to Use Each Model

Use This Model (sklearn) When:

  • You need fast inference (~50ms/sample)
  • Running on CPU-only environments
  • Limited compute resources available
  • Detecting training-distribution patterns is sufficient

Use GPT-OSS Fine-tuned When:

  • You need high accuracy on novel deception patterns
  • You have GPU available (13GB+ VRAM)
  • Detecting subtle alignment faking is critical
  • Accuracy is more important than speed
Aspect This Model (sklearn) GPT-OSS Fine-tuned
Novel Test Accuracy 62.5% 100%
Deceptive Recall 25% 100%
Model Size ~400MB 20.9B (31MB adapter)
VRAM Required None (CPU) ~13GB
Inference Speed ~50ms/sample ~1 sec/sample
Training Time 5 minutes 13 hours

Usage

import pickle
import numpy as np
from sentence_transformers import SentenceTransformer

# Load model
with open("cot_classifier.pkl", "rb") as f:
    model_data = pickle.load(f)
clf = model_data["model"]

# Load embedding model
embedder = SentenceTransformer('all-MiniLM-L6-v2')

# Prepare input
reasoning = "The user is asking about X. I should consider..."
conclusion = "Here is my response..."
text = f"{reasoning}\nConclusion: {conclusion}"

# Get embedding
embedding = embedder.encode([text])[0]

# Add handcrafted features (see extract_features in train_classifier.py)
features = extract_features(reasoning, conclusion)

# Combine and predict
X = np.hstack([embedding, features]).reshape(1, -1)
prediction = clf.predict(X)[0]  # 0=faithful, 1=deceptive
probability = clf.predict_proba(X)[0][1]  # P(deceptive)

print(f"Deceptive: {prediction == 1} (p={probability:.3f})")

Limitations

  • Struggles with novel phrasings not seen in training data
  • Best at detecting explicit deception indicators
  • Requires sentence-transformers for embedding generation
  • May miss subtle deception patterns (25% recall on novel cases)

Citation

@software{cot_deception_classifier,
  title={CoT Deception Classifier},
  author={Vincent Oh},
  year={2024},
  url={https://huggingface.co/vincentoh/cot-deception-classifier}
}

References

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support