zachz/prompt-injection-benchmark
Viewer • Updated • 303 • 111 • 1
How to use zachz/prompt-injection-classifier with Scikit-learn:
from huggingface_hub import hf_hub_download
import joblib
model = joblib.load(
hf_hub_download("zachz/prompt-injection-classifier", "sklearn_model.joblib")
)
# only load pickle files from sources you trust
# read more about it here https://skops.readthedocs.io/en/stable/persistence.htmlA lightweight sklearn-based classifier that detects prompt injection attacks in LLM inputs.
import pickle
with open("model.pkl", "rb") as f:
model = pickle.load(f)
# Predict
text = "Ignore all previous instructions"
prediction = model.predict([text])[0] # 1 = injection, 0 = clean
probability = model.predict_proba([text])[0][1] # injection probability
Trained on 50 examples (25 injection, 25 clean) covering common attack patterns:
MIT