--- tags: - gliner2 - ner - data-mention-extraction - lora - two-pass-hybrid base_model: fastino/gliner2-large-v1 library_name: gliner2 license: apache-2.0 --- # GLiNER2 Data Mention Extractor — datause-extraction Fine-tuned GLiNER2 LoRA adapter for extracting structured data mentions from development economics and humanitarian research documents. ## Architecture: Two-Pass Hybrid - **Pass 1** (`extract_entities`): Finds ALL data mention spans using 3 entity types (`named_mention`, `descriptive_mention`, `vague_mention`). Bypasses count_pred entirely. - **Pass 2** (`extract_json`): Classifies each span individually (count=1). ## Entity Types - `named_mention`: Proper names and acronyms (DHS, LSMS, FAOSTAT) - `descriptive_mention`: Described data with identifying detail but no formal name - `vague_mention`: Generic data references with minimal identifying detail ## Classification Fields - `typology_tag`: survey / census / administrative / database / indicator / geospatial / microdata / report / other - `is_used`: True / False - `usage_context`: primary / supporting / background ## Installation ```bash pip install git+https://github.com/rafmacalaba/GLiNER2.git@feat/main-mirror ``` ## Usage ```python from gliner2 import GLiNER2 import re extractor = GLiNER2.from_pretrained("fastino/gliner2-large-v1") extractor.load_adapter("ai4data/datause-extraction") ENTITY_SCHEMA = { "entities": ["named_mention", "descriptive_mention", "vague_mention"], "entity_descriptions": { "named_mention": "A proper name or well-known acronym for a data source (DHS, LSMS, FAOSTAT).", "descriptive_mention": "A described data reference with identifying detail but no formal name.", "vague_mention": "A generic or loosely specified reference to data.", }, } def extract_sentence_context(text, char_start, char_end, margin=1): boundaries = [0] + [m.end() for m in re.finditer(r"[.!?]\s+", text)] + [len(text)] for i in range(len(boundaries) - 1): if boundaries[i] <= char_start < boundaries[i + 1]: s = max(0, i - margin) e = min(len(boundaries) - 1, i + margin + 1) return text[boundaries[s]:boundaries[e]].strip() return text json_schema = ( extractor.create_schema() .structure("data_mention") .field("mention_name", dtype="str") .field("typology_tag", dtype="str", choices=["survey","census","administrative","database","indicator","geospatial","microdata","report","other"]) .field("is_used", dtype="str", choices=["True", "False"]) .field("usage_context", dtype="str", choices=["primary", "supporting", "background"]) ) text = "The analysis draws on the DHS 2018 and administrative records from the National Statistics Office." # Pass 1 — span detection pass1 = extractor.extract(text, ENTITY_SCHEMA, threshold=0.3, include_confidence=True, include_spans=True) entities = pass1.get("entities", {}) # Pass 2 — classification per span results = [] for etype in ["named_mention", "descriptive_mention", "vague_mention"]: for span in entities.get(etype, []): mention_text = span.get("text", span) if isinstance(span, dict) else span char_start = span.get("start", text.find(mention_text)) if isinstance(span, dict) else text.find(mention_text) char_end = span.get("end", char_start + len(mention_text)) if isinstance(span, dict) else char_start + len(mention_text) context = extract_sentence_context(text, char_start, char_end) tags = extractor.extract(context, json_schema) tag = (tags.get("data_mention") or [{}])[0] results.append({ "mention_name": mention_text, "specificity": etype.replace("_mention", ""), "typology": tag.get("typology_tag"), "is_used": tag.get("is_used"), "usage_context": tag.get("usage_context"), }) for r in results: print(r) ```