DCPG: Cross-Modal PHI Re-Identification Risk Scorer
Patent pending: US provisional filed 2025-07-05 GitHub: phi-exposure-guard
The problem
Most PHI de-identification tools process one record at a time. A single clinical note might be low risk on its own. But if the same patient shows up in a text note, a voice transcript, and an imaging report within the same time window, the combined signal is enough to re-identify them.
No existing open-source tool tracks that. This one does.
What it does
DCPG keeps a running exposure state per patient across streaming events. As events arrive from different data modalities, it scores the cumulative re-identification risk and fires a pseudonym rotation when cross-modal co-occurrence pushes risk over a threshold.
The trigger condition is a causal counterfactual check:
baseline_risk < threshold AND augmented_risk >= threshold AND L(t) > 0
The rotation only fires if the cross-modal linkage bonus L(t) is what caused the threshold crossing, not just any threshold crossing. This avoids false rotations from exposure accumulation alone.
Supported modalities
| Modality | PHI type |
|---|---|
text |
NAME_DATE_MRN_FACILITY |
asr |
NAME_DATE_MRN |
image_proxy |
FACE_IMAGE |
waveform_proxy |
WAVEFORM_HEADER |
audio_proxy |
VOICE |
image_link |
FACE_LINK |
audio_link |
VOICE_LINK |
Install
pip install dcpg-standalone
pip install sentence-transformers # optional, improves text embeddings
Usage
from dcpg_standalone import DCPGStandalone
scorer = DCPGStandalone()
scorer.add_event("patient_001", "text", "Clinical note with identifiers")
scorer.add_event("patient_001", "asr", "Voice transcript from same encounter")
scorer.add_event("patient_001", "image_proxy", None)
print(scorer.get_risk("patient_001"))
print(scorer.get_pseudonym_version("patient_001"))
print(scorer.full_report("patient_001"))
Risk formula
R(t) = clip(0.8 * U(t) + 0.2 * D(t) + L(t), 0, 1)
- U(t) = 1 - exp(-0.05 * effective_units)
- D(t) = 0.5 ^ (age_seconds / half_life)
- L(t) = 0.20 for 2-modality co-occurrence, 0.30 for 3+
The 0.8/0.2 weight split was selected by grid search over the MIMIC-III Demo cohort (100 ICU patients), minimizing false trigger rate while maximizing recall of genuine multi-modal co-occurrence windows.
Benchmark results
MIMIC-III Clinical Database Demo
100 ICU patients, open access via PhysioNet (ODbL license). Each structured table was mapped to a DCPG modality:
| Table | Modality |
|---|---|
| CHARTEVENTS | waveform_proxy |
| LABEVENTS, PRESCRIPTIONS | text |
| ADMISSIONS, TRANSFERS | asr |
| MICROBIOLOGYEVENTS | image_proxy |
| Metric | Result |
|---|---|
| Mean risk after full patient timeline | 0.84 |
| Patients triggering pseudonym rotation | 89 / 100 |
| Mean events before first rotation | 12.3 |
| Patients with 3-modality bonus fired | 71 / 100 |
MTSamples
5,000 de-identified clinical transcriptions, Apache-2.0. Specialties mapped to modalities by note origin (dictated vs written vs imaging).
| Metric | Result |
|---|---|
| Mean risk, single modality | 0.09 |
| Mean risk, 3+ modalities same encounter | 0.71 |
| Cross-modal trigger rate (multi-specialty) | 34.2% |
| False trigger rate (single modality only) | 0.0% |
Persistent state
scorer = DCPGStandalone(db_path="./phi_risk_state.db")
scorer.add_event("patient_001", "text", "clinical note")
scorer.close()
# next session
scorer = DCPGStandalone(db_path="./phi_risk_state.db")
print(scorer.get_risk("patient_001")) # accumulated risk preserved
Comparison
| Model | Stateful | Multimodal | Cross-modal trigger | Pseudonym rotation |
|---|---|---|---|---|
| obi/deid_bert_i2b2 | No | No | No | No |
| stanford-deidentifier-base | No | No | No | No |
| Philter | No | No | No | No |
| DCPG (this model) | Yes | Yes | Yes | Yes |
Federated use
Graph merging across federated nodes is supported via the CRDT module.
See dcpg_crdt.py in the GitHub repo.
Citation
@software{vkatg2026dcpg,
author = {[Venkata Krishna Azith Teja Ganti]},
title = {DCPG: Cross-Modal PHI Re-Identification Risk Scorer},
year = {2026},
publisher = {Hugging Face},
doi = {10.5281/zenodo.18865882},
url = {https://huggingface.co/vkatg/dcpg-cross-modal-phi-risk-scorer}
}
Disclaimer
Research artifact. Not a certified medical device. Not a substitute for a full compliance pipeline. Users are responsible for HIPAA, GDPR, and any other applicable regulations.
The US provisional patent (filed 2025-07-05) covers the deployed system architecture. The underlying algorithms and risk scoring formulas are freely available for academic research, experimentation, and non-commercial use under the MIT license. If you are building on this work in a research context, no permission is needed beyond standard citation.
- Downloads last month
- 8