| --- |
| tags: |
| - gliner2 |
| - ner |
| - data-mention-extraction |
| - lora |
| - two-pass-hybrid |
| base_model: fastino/gliner2-large-v1 |
| library_name: gliner2 |
| license: apache-2.0 |
| --- |
| |
| # GLiNER2 Data Mention Extractor — datause-extraction |
|
|
| Fine-tuned GLiNER2 LoRA adapter for extracting structured data mentions from |
| development economics and humanitarian research documents. |
|
|
|
|
| ## Architecture: Two-Pass Hybrid |
|
|
| - **Pass 1** (`extract_entities`): Finds ALL data mention spans using 3 entity types |
| (`named_mention`, `descriptive_mention`, `vague_mention`). Bypasses count_pred entirely. |
| - **Pass 2** (`extract_json`): Classifies each span individually (count=1). |
|
|
| ## Entity Types |
|
|
| - `named_mention`: Proper names and acronyms (DHS, LSMS, FAOSTAT) |
| - `descriptive_mention`: Described data with identifying detail but no formal name |
| - `vague_mention`: Generic data references with minimal identifying detail |
|
|
| ## Classification Fields |
|
|
| - `typology_tag`: survey / census / administrative / database / indicator / geospatial / microdata / report / other |
| - `is_used`: True / False |
| - `usage_context`: primary / supporting / background |
|
|
| ## Installation |
|
|
| ```bash |
| pip install git+https://github.com/rafmacalaba/GLiNER2.git@feat/main-mirror |
| ``` |
|
|
| ## Usage |
| ```python |
| from gliner2 import GLiNER2 |
| import re |
| |
| extractor = GLiNER2.from_pretrained("fastino/gliner2-large-v1") |
| extractor.load_adapter("ai4data/datause-extraction") |
| |
| ENTITY_SCHEMA = { |
| "entities": ["named_mention", "descriptive_mention", "vague_mention"], |
| "entity_descriptions": { |
| "named_mention": "A proper name or well-known acronym for a data source (DHS, LSMS, FAOSTAT).", |
| "descriptive_mention": "A described data reference with identifying detail but no formal name.", |
| "vague_mention": "A generic or loosely specified reference to data.", |
| }, |
| } |
| |
| def extract_sentence_context(text, char_start, char_end, margin=1): |
| boundaries = [0] + [m.end() for m in re.finditer(r"[.!?]\s+", text)] + [len(text)] |
| for i in range(len(boundaries) - 1): |
| if boundaries[i] <= char_start < boundaries[i + 1]: |
| s = max(0, i - margin) |
| e = min(len(boundaries) - 1, i + margin + 1) |
| return text[boundaries[s]:boundaries[e]].strip() |
| return text |
| |
| json_schema = ( |
| extractor.create_schema() |
| .structure("data_mention") |
| .field("mention_name", dtype="str") |
| .field("typology_tag", dtype="str", choices=["survey","census","administrative","database","indicator","geospatial","microdata","report","other"]) |
| .field("is_used", dtype="str", choices=["True", "False"]) |
| .field("usage_context", dtype="str", choices=["primary", "supporting", "background"]) |
| ) |
| |
| text = "The analysis draws on the DHS 2018 and administrative records from the National Statistics Office." |
| |
| # Pass 1 — span detection |
| pass1 = extractor.extract(text, ENTITY_SCHEMA, threshold=0.3, include_confidence=True, include_spans=True) |
| entities = pass1.get("entities", {}) |
| |
| # Pass 2 — classification per span |
| results = [] |
| for etype in ["named_mention", "descriptive_mention", "vague_mention"]: |
| for span in entities.get(etype, []): |
| mention_text = span.get("text", span) if isinstance(span, dict) else span |
| char_start = span.get("start", text.find(mention_text)) if isinstance(span, dict) else text.find(mention_text) |
| char_end = span.get("end", char_start + len(mention_text)) if isinstance(span, dict) else char_start + len(mention_text) |
| context = extract_sentence_context(text, char_start, char_end) |
| tags = extractor.extract(context, json_schema) |
| tag = (tags.get("data_mention") or [{}])[0] |
| results.append({ |
| "mention_name": mention_text, |
| "specificity": etype.replace("_mention", ""), |
| "typology": tag.get("typology_tag"), |
| "is_used": tag.get("is_used"), |
| "usage_context": tag.get("usage_context"), |
| }) |
| |
| for r in results: |
| print(r) |
| ``` |
|
|