Remove mirrored attribute

45098c1 verified 5 days ago

3.94 kB

	---
	tags:
	- gliner2
	- ner
	- data-mention-extraction
	- lora
	- two-pass-hybrid
	base_model: fastino/gliner2-large-v1
	library_name: gliner2
	license: apache-2.0
	---

	# GLiNER2 Data Mention Extractor — datause-extraction

	Fine-tuned GLiNER2 LoRA adapter for extracting structured data mentions from
	development economics and humanitarian research documents.


	## Architecture: Two-Pass Hybrid

	- Pass 1 (`extract_entities`): Finds ALL data mention spans using 3 entity types
	(`named_mention`, `descriptive_mention`, `vague_mention`). Bypasses count_pred entirely.
	- Pass 2 (`extract_json`): Classifies each span individually (count=1).

	## Entity Types

	- `named_mention`: Proper names and acronyms (DHS, LSMS, FAOSTAT)
	- `descriptive_mention`: Described data with identifying detail but no formal name
	- `vague_mention`: Generic data references with minimal identifying detail

	## Classification Fields

	- `typology_tag`: survey / census / administrative / database / indicator / geospatial / microdata / report / other
	- `is_used`: True / False
	- `usage_context`: primary / supporting / background

	## Installation

	```bash
	pip install git+https://github.com/rafmacalaba/GLiNER2.git@feat/main-mirror
	```

	## Usage
	```python
	from gliner2 import GLiNER2
	import re

	extractor = GLiNER2.from_pretrained("fastino/gliner2-large-v1")
	extractor.load_adapter("ai4data/datause-extraction")

	ENTITY_SCHEMA = {
	"entities": ["named_mention", "descriptive_mention", "vague_mention"],
	"entity_descriptions": {
	"named_mention": "A proper name or well-known acronym for a data source (DHS, LSMS, FAOSTAT).",
	"descriptive_mention": "A described data reference with identifying detail but no formal name.",
	"vague_mention": "A generic or loosely specified reference to data.",
	},
	}

	def extract_sentence_context(text, char_start, char_end, margin=1):
	boundaries = [0] + [m.end() for m in re.finditer(r"[.!?]\s+", text)] + [len(text)]
	for i in range(len(boundaries) - 1):
	if boundaries[i] <= char_start < boundaries[i + 1]:
	s = max(0, i - margin)
	e = min(len(boundaries) - 1, i + margin + 1)
	return text[boundaries[s]:boundaries[e]].strip()
	return text

	json_schema = (
	extractor.create_schema()
	.structure("data_mention")
	.field("mention_name", dtype="str")
	.field("typology_tag", dtype="str", choices=["survey","census","administrative","database","indicator","geospatial","microdata","report","other"])
	.field("is_used", dtype="str", choices=["True", "False"])
	.field("usage_context", dtype="str", choices=["primary", "supporting", "background"])
	)

	text = "The analysis draws on the DHS 2018 and administrative records from the National Statistics Office."

	# Pass 1 — span detection
	pass1 = extractor.extract(text, ENTITY_SCHEMA, threshold=0.3, include_confidence=True, include_spans=True)
	entities = pass1.get("entities", {})

	# Pass 2 — classification per span
	results = []
	for etype in ["named_mention", "descriptive_mention", "vague_mention"]:
	for span in entities.get(etype, []):
	mention_text = span.get("text", span) if isinstance(span, dict) else span
	char_start = span.get("start", text.find(mention_text)) if isinstance(span, dict) else text.find(mention_text)
	char_end = span.get("end", char_start + len(mention_text)) if isinstance(span, dict) else char_start + len(mention_text)
	context = extract_sentence_context(text, char_start, char_end)
	tags = extractor.extract(context, json_schema)
	tag = (tags.get("data_mention") or [{}])[0]
	results.append({
	"mention_name": mention_text,
	"specificity": etype.replace("_mention", ""),
	"typology": tag.get("typology_tag"),
	"is_used": tag.get("is_used"),
	"usage_context": tag.get("usage_context"),
	})

	for r in results:
	print(r)
	```