--- language: en license: mit library_name: sklearn tags: - token-classification - ner - pii-detection - sklearn datasets: - zachz/pii-detection-corpus pipeline_tag: token-classification --- # PII NER Model A lightweight sklearn-based Named Entity Recognition model for detecting Personally Identifiable Information in text. ## Model Details - **Type:** Dict Vectorizer + Logistic Regression pipeline - **Task:** Token-level NER classification - **Framework:** scikit-learn - **Labels:** O, NAME, EMAIL, PHONE, SSN ## Usage ```python import pickle with open("model.pkl", "rb") as f: model = pickle.load(f) def extract_features(tokens, idx): token = tokens[idx] features = { 'word.lower': token.lower(), 'word.length': len(token), 'word.has_at': '@' in token, 'word.is_digit': token.isdigit(), } if idx > 0: features['prev.lower'] = tokens[idx-1].lower() if idx < len(tokens)-1: features['next.lower'] = tokens[idx+1].lower() return features text = "Contact jane@test.com or call 555-123-4567" tokens = text.split() features = [extract_features(tokens, i) for i in range(len(tokens))] predictions = model.predict(features) entities = [(t, l) for t, l in zip(tokens, predictions) if l != "O"] ``` ## Limitations - Small training set (12 examples) - Simple whitespace tokenization - English only - Best used as a lightweight first-pass PII detector ## License MIT