Mirror rafmacalaba/gliner2-datause-large-v1-hybrid-entities → ai4data/datause-extraction (two-pass hybrid adapter)

Browse files

Files changed (3) hide show

README.md +57 -146
adapter_config.json +2 -5
adapter_weights.safetensors +2 -2

README.md CHANGED Viewed

@@ -1,175 +1,86 @@
 ---
-language:
-  - en
-license: apache-2.0
 tags:
   - gliner2
   - ner
-  - dataset-extraction
   - lora
-  - world-bank
 base_model: fastino/gliner2-large-v1
 library_name: gliner2
-pipeline_tag: token-classification
-datasets:
-  - rafmacalaba/datause-v8
-model-index:
-  - name: datause-extraction
-    results:
-      - task:
-          type: token-classification
-          name: Dataset Mention Extraction
-        metrics:
-          - type: f1
-            value: 84.8
-            name: F1 (max_tokens=512)
-          - type: precision
-            value: 90.0
-            name: Precision
-          - type: recall
-            value: 80.2
-            name: Recall
 ---
-# Dataset Use Extraction
-A fine-tuned [GLiNER2](https://huggingface.co/fastino/gliner2-large-v1) adapter for extracting structured dataset mentions from research documents and policy papers.
-Developed as part of the **AI for Data—Data for AI** program, a collaboration between the **World Bank** and **UNHCR**, to monitor and measure data use across development research.
-## Overview
-This model identifies and extracts structured information about datasets mentioned in text, including formal survey names, descriptive data references, and vague data allusions. It extracts rich metadata for each mention including the dataset name, acronym, producer, geography, data type, and usage context.
-## Performance
-Evaluated on a held-out test set of 199 annotated text passages:
-| Metric | Score |
-|---|---|
-| **F1** | **84.8%** |
-| Precision | 90.0% |
-| Recall | 80.2% |
-### Performance by mention type
-| Tag | Total | Found | Recall |
-|---|---|---|---|
-| Named | 394 | 317 | 80.5% |
-| Descriptive | 135 | 108 | 80.0% |
-| Vague | 87 | 70 | 80.5% |
-## Extracted Fields
-For each dataset mention, the model extracts up to 13 structured fields:
-| Field | Type | Description |
-|---|---|---|
-| `dataset_name` | string | Name or description of the dataset |
-| `acronym` | string | Abbreviation (e.g., "DHS", "LSMS") |
-| `author` | string | Individual author(s) |
-| `producer` | string | Organization that created the dataset |
-| `publication_year` | string | Year published |
-| `reference_year` | string | Year data was collected |
-| `reference_population` | string | Target population |
-| `geography` | string | Geographic coverage |
-| `description` | string | Content description |
-| `data_type` | choice | survey, census, database, administrative, indicator, geospatial, microdata, report, other |
-| `dataset_tag` | choice | named, descriptive, vague |
-| `usage_context` | choice | primary, supporting, background |
-| `is_used` | choice | True, False |
 ## Usage
-### With `ai4data` library (recommended)
-```bash
-pip install git+https://github.com/rafmacalaba/monitoring_of_datause.git
-```
-```python
-from ai4data import extract_from_text, extract_from_document
-# Extract from text
-text = """We use the Demographic and Health Survey (DHS) from 2020 as our
-primary data source to analyze outcomes in Ghana. For robustness checks,
-we also reference the Ghana Living Standard Survey (GLSS) from 2012."""
-results = extract_from_text(text)
-for ds in results["datasets"]:
-    print(f"  {ds['dataset_name']} [{ds['dataset_tag']}]")
-# Extract from PDF (URL or local file)
-url = "https://documents1.worldbank.org/curated/en/.../report.pdf"
-results = extract_from_document(url)
-```
-### With GLiNER2 directly
 ```python
 from gliner2 import GLiNER2
-from huggingface_hub import snapshot_download
-# Load base model + adapter
-model = GLiNER2.from_pretrained("fastino/gliner2-large-v1")
-adapter_path = snapshot_download("ai4data/datause-extraction")
-model.load_adapter(adapter_path)
-# Define extraction schema
-schema = (
-    model.create_schema()
-    .structure("dataset_mention")
-        .field("dataset_name", dtype="str")
-        .field("acronym", dtype="str")
-        .field("producer", dtype="str")
-        .field("geography", dtype="str")
-        .field("description", dtype="str")
-        .field("data_type", dtype="str",
-               choices=["survey", "census", "database", "administrative",
-                        "indicator", "geospatial", "microdata", "report", "other"])
-        .field("dataset_tag", dtype="str",
-               choices=["named", "descriptive", "vague"])
-        .field("usage_context", dtype="str",
-               choices=["primary", "supporting", "background"])
-        .field("is_used", dtype="str", choices=["True", "False"])
-)
-results = model.extract(text, schema)
-for mention in results["dataset_mention"]:
-    print(mention)
-```
-## Training Details
-- **Base model**: [fastino/gliner2-large-v1](https://huggingface.co/fastino/gliner2-large-v1) (DeBERTa-v3-large encoder)
-- **Method**: LoRA (r=16, alpha=32)
-- **Training data**: ~3,400 synthetic examples (v8 dataset) generated with GPT-4o and Gemini 2.5 Flash
-- **Max context**: 512 tokens (aligned with DeBERTa-v3 position embeddings)
-- **Data format**: Context-aware passages with markdown formatting, footnotes, and structured annotations
-## Limitations
-- Optimized for English-language research documents and policy papers
-- Best suited for World Bank-style development research documents
-- May not generalize well to non-research text (news articles, social media, etc.)
-- Requires the `fastino/gliner2-large-v1` base model
-## Citation
-If you use this model, please cite:
-```bibtex
-@misc{ai4data-datause-extraction,
-  title={Dataset Use Extraction Model},
-  author={AI for Data—Data for AI},
-  year={2025},
-  publisher={Hugging Face},
-  url={https://huggingface.co/ai4data/datause-extraction}
 }
 ```
-## Links
-- **Library**: [ai4data](https://github.com/rafmacalaba/monitoring_of_datause)
-- **Base model**: [fastino/gliner2-large-v1](https://huggingface.co/fastino/gliner2-large-v1)
-- **Program**: [AI for Data—Data for AI](https://www.worldbank.org/en/programs/ai4data) (World Bank & UNHCR)

 ---
 tags:
   - gliner2
   - ner
+  - data-mention-extraction
   - lora
+  - two-pass-hybrid
 base_model: fastino/gliner2-large-v1
 library_name: gliner2
 ---
+# GLiNER2 Data Mention Extractor (v1-hybrid-entities)
+Fine-tuned GLiNER2 LoRA adapter for extracting structured data mentions from
+development economics and humanitarian research documents.
+## Architecture: Two-Pass Hybrid
+This adapter uses a **two-pass** inference strategy to bypass the count_pred/count_embed
+mode collapse that limits native `extract_json` to 1 mention per chunk:
+- **Pass 1** (`extract_entities`): Finds ALL data mention spans using 3 entity types
+  (`named_mention`, `descriptive_mention`, `vague_mention`). Bypasses count_pred entirely.
+- **Pass 2** (`extract_json`): Classifies each span individually using sentence-level context.
+  count=1 is always correct since each call contains exactly 1 mention.
+See `finetuning/ARCHITECTURE.md` for the full rationale.
+## Task
+Given a document passage, extracts structured information about each dataset mentioned:
+- **Entity types** (Pass 1 — span detection):
+  - `named_mention`: Proper names and acronyms (DHS, LSMS, FAOSTAT)
+  - `descriptive_mention`: Described data with identifying detail but no formal name
+  - `vague_mention`: Generic data references with minimal identifying detail
+- **Classification fields** (Pass 2 — fixed choices):
+  - `typology_tag`: survey / census / database / administrative / indicator / geospatial / microdata / report / other
+  - `is_used`: True / False
+  - `usage_context`: primary / supporting / background
+## Training
+- **Base model**: `fastino/gliner2-large-v1`
+- **Method**: LoRA (r=16, alpha=32.0)
+- **Target modules**: ['encoder', 'span_rep']
+- **Training examples**: 8087
+- **Val examples**: 563
+- **Best val loss**: None
 ## Usage
 ```python
 from gliner2 import GLiNER2
+# Install the patched library first
+# pip install git+https://github.com/rafmacalaba/GLiNER2.git@feat/main-mirror
+extractor = GLiNER2.from_pretrained("fastino/gliner2-large-v1")
+extractor.load_adapter("rafmacalaba/gliner2-datause-large-v1-hybrid-entities")
+# Pass 1: Extract all mention spans
+entity_schema = {
+    "entities": ["named_mention", "descriptive_mention", "vague_mention"],
+    "entity_descriptions": {
+        "named_mention": "A proper name or well-known acronym for a data source...",
+        "descriptive_mention": "A described data reference with enough detail...",
+        "vague_mention": "A generic or loosely specified reference to data...",
+    },
+}
+spans = extractor.extract(text, entity_schema, threshold=0.3)
+# Pass 2: Classify each span
+json_schema = {
+    "data_mention": {
+        "mention_name": "",
+        "typology_tag": {"choices": ["survey", "census", "administrative", "database",
+                                     "indicator", "geospatial", "microdata", "report", "other"]},
+        "is_used": {"choices": ["True", "False"]},
+        "usage_context": {"choices": ["primary", "supporting", "background"]},
+    },
 }
+for span in spans.get("named_mention", []):
+    context = extract_sentence_context(text, span)
+    tags = extractor.extract(context, json_schema)
 ```

adapter_config.json CHANGED Viewed

@@ -2,14 +2,11 @@
   "adapter_type": "lora",
   "adapter_version": "1.0",
   "lora_r": 16,
-  "lora_alpha": 32,
   "lora_dropout": 0.1,
   "target_modules": [
-    "classifier",
-    "count_embed",
-    "count_pred",
     "encoder",
     "span_rep"
   ],
-  "created_at": "2026-02-26T03:10:27.735839Z"
 }

   "adapter_type": "lora",
   "adapter_version": "1.0",
   "lora_r": 16,
+  "lora_alpha": 32.0,
   "lora_dropout": 0.1,
   "target_modules": [
     "encoder",
     "span_rep"
   ],
+  "created_at": "2026-04-06T22:28:30.225894Z"
 }

adapter_weights.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:c15dab49980c290057d22b78977d3fb03cfc8437314ef23b6a9a75175a49904f
-size 31758920

 version https://git-lfs.github.com/spec/v1
+oid sha256:651065028cbf29f7aa1cdb7dc3b85990189808be3c849e9e357030dbfa64c5d0
+size 30380176