DCPG: Cross-Modal PHI Re-Identification Risk Scorer

Patent pending: US provisional filed 2025-07-05 GitHub: phi-exposure-guard

The problem

Most PHI de-identification tools process one record at a time. A single clinical note might be low risk on its own. But if the same patient shows up in a text note, a voice transcript, and an imaging report within the same time window, the combined signal is enough to re-identify them.

No existing open-source tool tracks that. This one does.

What it does

DCPG keeps a running exposure state per patient across streaming events. As events arrive from different data modalities, it scores the cumulative re-identification risk and fires a pseudonym rotation when cross-modal co-occurrence pushes risk over a threshold.

The trigger condition is a causal counterfactual check:

baseline_risk < threshold  AND  augmented_risk >= threshold  AND  L(t) > 0

The rotation only fires if the cross-modal linkage bonus L(t) is what caused the threshold crossing, not just any threshold crossing. This avoids false rotations from exposure accumulation alone.

Supported modalities

Modality	PHI type
`text`	NAME_DATE_MRN_FACILITY
`asr`	NAME_DATE_MRN
`image_proxy`	FACE_IMAGE
`waveform_proxy`	WAVEFORM_HEADER
`audio_proxy`	VOICE
`image_link`	FACE_LINK
`audio_link`	VOICE_LINK

Install

pip install dcpg-standalone
pip install sentence-transformers  # optional, improves text embeddings

Usage

from dcpg_standalone import DCPGStandalone

scorer = DCPGStandalone()

scorer.add_event("patient_001", "text", "Clinical note with identifiers")
scorer.add_event("patient_001", "asr", "Voice transcript from same encounter")
scorer.add_event("patient_001", "image_proxy", None)

print(scorer.get_risk("patient_001"))
print(scorer.get_pseudonym_version("patient_001"))
print(scorer.full_report("patient_001"))

Risk formula

R(t) = clip(0.8 * U(t) + 0.2 * D(t) + L(t), 0, 1)

U(t) = 1 - exp(-0.05 * effective_units)
D(t) = 0.5 ^ (age_seconds / half_life)
L(t) = 0.20 for 2-modality co-occurrence, 0.30 for 3+

The 0.8/0.2 weight split was selected by grid search over the MIMIC-III Demo cohort (100 ICU patients), minimizing false trigger rate while maximizing recall of genuine multi-modal co-occurrence windows.

Benchmark results

MIMIC-III Clinical Database Demo

100 ICU patients, open access via PhysioNet (ODbL license). Each structured table was mapped to a DCPG modality:

Table	Modality
CHARTEVENTS	`waveform_proxy`
LABEVENTS, PRESCRIPTIONS	`text`
ADMISSIONS, TRANSFERS	`asr`
MICROBIOLOGYEVENTS	`image_proxy`

Metric	Result
Mean risk after full patient timeline	0.84
Patients triggering pseudonym rotation	89 / 100
Mean events before first rotation	12.3
Patients with 3-modality bonus fired	71 / 100

MTSamples

5,000 de-identified clinical transcriptions, Apache-2.0. Specialties mapped to modalities by note origin (dictated vs written vs imaging).

Metric	Result
Mean risk, single modality	0.09
Mean risk, 3+ modalities same encounter	0.71
Cross-modal trigger rate (multi-specialty)	34.2%
False trigger rate (single modality only)	0.0%

Persistent state

scorer = DCPGStandalone(db_path="./phi_risk_state.db")
scorer.add_event("patient_001", "text", "clinical note")
scorer.close()

# next session
scorer = DCPGStandalone(db_path="./phi_risk_state.db")
print(scorer.get_risk("patient_001"))  # accumulated risk preserved

Comparison

Model	Stateful	Multimodal	Cross-modal trigger	Pseudonym rotation
obi/deid_bert_i2b2	No	No	No	No
stanford-deidentifier-base	No	No	No	No
Philter	No	No	No	No
DCPG (this model)	Yes	Yes	Yes	Yes

Federated use

Graph merging across federated nodes is supported via the CRDT module. See dcpg_crdt.py in the GitHub repo.

Citation

@software{vkatg2026dcpg,
  author    = {[Venkata Krishna Azith Teja Ganti]},
  title     = {DCPG: Cross-Modal PHI Re-Identification Risk Scorer},
  year      = {2026},
  publisher = {Hugging Face},
  doi       = {10.5281/zenodo.18865882},
  url       = {https://huggingface.co/vkatg/dcpg-cross-modal-phi-risk-scorer}
}

Disclaimer

Research artifact. Not a certified medical device. Not a substitute for a full compliance pipeline. Users are responsible for HIPAA, GDPR, and any other applicable regulations.

The US provisional patent (filed 2025-07-05) covers the deployed system architecture. The underlying algorithms and risk scoring formulas are freely available for academic research, experimentation, and non-commercial use under the MIT license. If you are building on this work in a research context, no permission is needed beyond standard citation.

Downloads last month: 8

vkatg
/

dcpg-cross-modal-phi-risk-scorer