zach

Add trained model and model card

99de251 about 2 months ago

1.44 kB

language: en
license: mit
library_name: sklearn
tags:
  - token-classification
  - ner
  - pii-detection
  - sklearn
datasets:
  - zachz/pii-detection-corpus
pipeline_tag: token-classification

PII NER Model

A lightweight sklearn-based Named Entity Recognition model for detecting Personally Identifiable Information in text.

Model Details

Type: Dict Vectorizer + Logistic Regression pipeline
Task: Token-level NER classification
Framework: scikit-learn
Labels: O, NAME, EMAIL, PHONE, SSN

Usage

import pickle

with open("model.pkl", "rb") as f:
    model = pickle.load(f)

def extract_features(tokens, idx):
    token = tokens[idx]
    features = {
        'word.lower': token.lower(),
        'word.length': len(token),
        'word.has_at': '@' in token,
        'word.is_digit': token.isdigit(),
    }
    if idx > 0: features['prev.lower'] = tokens[idx-1].lower()
    if idx < len(tokens)-1: features['next.lower'] = tokens[idx+1].lower()
    return features

text = "Contact jane@test.com or call 555-123-4567"
tokens = text.split()
features = [extract_features(tokens, i) for i in range(len(tokens))]
predictions = model.predict(features)
entities = [(t, l) for t, l in zip(tokens, predictions) if l != "O"]

Limitations

Small training set (12 examples)
Simple whitespace tokenization
English only
Best used as a lightweight first-pass PII detector

License

MIT