DCPG: Cross-Modal PHI Re-Identification Risk Scorer

Patent pending: US provisional filed 2025-07-05 GitHub: phi-exposure-guard


The problem

Most PHI de-identification tools process one record at a time. A single clinical note might be low risk on its own. But if the same patient shows up in a text note, a voice transcript, and an imaging report within the same time window, the combined signal is enough to re-identify them.

No existing open-source tool tracks that. This one does.


What it does

DCPG keeps a running exposure state per patient across streaming events. As events arrive from different data modalities, it scores the cumulative re-identification risk and fires a pseudonym rotation when cross-modal co-occurrence pushes risk over a threshold.

The trigger condition is a causal counterfactual check:

baseline_risk < threshold  AND  augmented_risk >= threshold  AND  L(t) > 0

The rotation only fires if the cross-modal linkage bonus L(t) is what caused the threshold crossing, not just any threshold crossing. This avoids false rotations from exposure accumulation alone.


Supported modalities

Modality PHI type
text NAME_DATE_MRN_FACILITY
asr NAME_DATE_MRN
image_proxy FACE_IMAGE
waveform_proxy WAVEFORM_HEADER
audio_proxy VOICE
image_link FACE_LINK
audio_link VOICE_LINK

Install

pip install dcpg-standalone
pip install sentence-transformers  # optional, improves text embeddings

Usage

from dcpg_standalone import DCPGStandalone

scorer = DCPGStandalone()

scorer.add_event("patient_001", "text", "Clinical note with identifiers")
scorer.add_event("patient_001", "asr", "Voice transcript from same encounter")
scorer.add_event("patient_001", "image_proxy", None)

print(scorer.get_risk("patient_001"))
print(scorer.get_pseudonym_version("patient_001"))
print(scorer.full_report("patient_001"))

Risk formula

R(t) = clip(0.8 * U(t) + 0.2 * D(t) + L(t), 0, 1)
  • U(t) = 1 - exp(-0.05 * effective_units)
  • D(t) = 0.5 ^ (age_seconds / half_life)
  • L(t) = 0.20 for 2-modality co-occurrence, 0.30 for 3+

The 0.8/0.2 weight split was selected by grid search over the MIMIC-III Demo cohort (100 ICU patients), minimizing false trigger rate while maximizing recall of genuine multi-modal co-occurrence windows.


Benchmark results

MIMIC-III Clinical Database Demo

100 ICU patients, open access via PhysioNet (ODbL license). Each structured table was mapped to a DCPG modality:

Table Modality
CHARTEVENTS waveform_proxy
LABEVENTS, PRESCRIPTIONS text
ADMISSIONS, TRANSFERS asr
MICROBIOLOGYEVENTS image_proxy
Metric Result
Mean risk after full patient timeline 0.84
Patients triggering pseudonym rotation 89 / 100
Mean events before first rotation 12.3
Patients with 3-modality bonus fired 71 / 100

MTSamples

5,000 de-identified clinical transcriptions, Apache-2.0. Specialties mapped to modalities by note origin (dictated vs written vs imaging).

Metric Result
Mean risk, single modality 0.09
Mean risk, 3+ modalities same encounter 0.71
Cross-modal trigger rate (multi-specialty) 34.2%
False trigger rate (single modality only) 0.0%

Persistent state

scorer = DCPGStandalone(db_path="./phi_risk_state.db")
scorer.add_event("patient_001", "text", "clinical note")
scorer.close()

# next session
scorer = DCPGStandalone(db_path="./phi_risk_state.db")
print(scorer.get_risk("patient_001"))  # accumulated risk preserved

Comparison

Model Stateful Multimodal Cross-modal trigger Pseudonym rotation
obi/deid_bert_i2b2 No No No No
stanford-deidentifier-base No No No No
Philter No No No No
DCPG (this model) Yes Yes Yes Yes

Federated use

Graph merging across federated nodes is supported via the CRDT module. See dcpg_crdt.py in the GitHub repo.


Citation

@software{vkatg2026dcpg,
  author    = {[Venkata Krishna Azith Teja Ganti]},
  title     = {DCPG: Cross-Modal PHI Re-Identification Risk Scorer},
  year      = {2026},
  publisher = {Hugging Face},
  doi       = {10.5281/zenodo.18865882},
  url       = {https://huggingface.co/vkatg/dcpg-cross-modal-phi-risk-scorer}
}

Disclaimer

Research artifact. Not a certified medical device. Not a substitute for a full compliance pipeline. Users are responsible for HIPAA, GDPR, and any other applicable regulations.

The US provisional patent (filed 2025-07-05) covers the deployed system architecture. The underlying algorithms and risk scoring formulas are freely available for academic research, experimentation, and non-commercial use under the MIT license. If you are building on this work in a research context, no permission is needed beyond standard citation.

Downloads last month
8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train vkatg/dcpg-cross-modal-phi-risk-scorer

Spaces using vkatg/dcpg-cross-modal-phi-risk-scorer 2