Mirror rafmacalaba/gliner2-datause-large-v1-deval-synth-v2 -> production

Browse files

Files changed (3) hide show

README.md +120 -77
adapter_config.json +4 -1
adapter_weights.safetensors +2 -2

README.md CHANGED Viewed

@@ -1,104 +1,147 @@
 ---
 tags:
-  - gliner2
   - ner
   - data-mention-extraction
   - lora
-  - two-pass-hybrid
-base_model: fastino/gliner2-large-v1
-library_name: gliner2
-license: apache-2.0
 ---
-# GLiNER2 Data Mention Extractor — datause-extraction
 Fine-tuned GLiNER2 LoRA adapter for extracting structured data mentions from
 development economics and humanitarian research documents.
-## Architecture: Two-Pass Hybrid
-- **Pass 1** (`extract_entities`): Finds ALL data mention spans using 3 entity types
-  (`named_mention`, `descriptive_mention`, `vague_mention`). Bypasses count_pred entirely.
-- **Pass 2** (`extract_json`): Classifies each span individually (count=1).
-## Entity Types
-- `named_mention`: Proper names and acronyms (DHS, LSMS, FAOSTAT)
-- `descriptive_mention`: Described data with identifying detail but no formal name
-- `vague_mention`: Generic data references with minimal identifying detail
-## Classification Fields
-- `typology_tag`: survey / census / administrative / database / indicator / geospatial / microdata / report / other
-- `is_used`: True / False
-- `usage_context`: primary / supporting / background
-## Installation
-```bash
-pip install git+https://github.com/rafmacalaba/GLiNER2.git@feat/main-mirror
 ```
-## Usage
 ```python
-from gliner2 import GLiNER2
-import re
-extractor = GLiNER2.from_pretrained("fastino/gliner2-large-v1")
-extractor.load_adapter("ai4data/datause-extraction")
-ENTITY_SCHEMA = {
-    "entities": ["named_mention", "descriptive_mention", "vague_mention"],
-    "entity_descriptions": {
-        "named_mention": "A proper name or well-known acronym for a data source (DHS, LSMS, FAOSTAT).",
-        "descriptive_mention": "A described data reference with identifying detail but no formal name.",
-        "vague_mention": "A generic or loosely specified reference to data.",
-    },
-}
-def extract_sentence_context(text, char_start, char_end, margin=1):
-    boundaries = [0] + [m.end() for m in re.finditer(r"[.!?]\s+", text)] + [len(text)]
-    for i in range(len(boundaries) - 1):
-        if boundaries[i] <= char_start < boundaries[i + 1]:
-            s = max(0, i - margin)
-            e = min(len(boundaries) - 1, i + margin + 1)
-            return text[boundaries[s]:boundaries[e]].strip()
-    return text
-json_schema = (
-    extractor.create_schema()
-    .structure("data_mention")
-    .field("mention_name", dtype="str")
-    .field("typology_tag", dtype="str", choices=["survey","census","administrative","database","indicator","geospatial","microdata","report","other"])
-    .field("is_used", dtype="str", choices=["True", "False"])
-    .field("usage_context", dtype="str", choices=["primary", "supporting", "background"])
 )
-text = "The analysis draws on the DHS 2018 and administrative records from the National Statistics Office."
-# Pass 1 — span detection
-pass1 = extractor.extract(text, ENTITY_SCHEMA, threshold=0.3, include_confidence=True, include_spans=True)
-entities = pass1.get("entities", {})
-# Pass 2 — classification per span
-results = []
-for etype in ["named_mention", "descriptive_mention", "vague_mention"]:
-    for span in entities.get(etype, []):
-        mention_text = span.get("text", span) if isinstance(span, dict) else span
-        char_start = span.get("start", text.find(mention_text)) if isinstance(span, dict) else text.find(mention_text)
-        char_end = span.get("end", char_start + len(mention_text)) if isinstance(span, dict) else char_start + len(mention_text)
-        context = extract_sentence_context(text, char_start, char_end)
-        tags = extractor.extract(context, json_schema)
-        tag = (tags.get("data_mention") or [{}])[0]
-        results.append({
-            "mention_name": mention_text,
-            "specificity": etype.replace("_mention", ""),
-            "typology": tag.get("typology_tag"),
-            "is_used": tag.get("is_used"),
-            "usage_context": tag.get("usage_context"),
-        })
-for r in results:
-    print(r)
-```

 ---
+library_name: gliner2
+license: mit
+base_model: fastino/gliner2-large-v1
+datasets:
+  - ai4data/datause-train
 tags:
   - ner
   - data-mention-extraction
   - lora
+  - gliner2
+  - development-economics
 ---
+# datause-extraction
 Fine-tuned GLiNER2 LoRA adapter for extracting structured data mentions from
 development economics and humanitarian research documents.
+This is the production release of
+[rafmacalaba/gliner2-datause-large-v1-deval-synth-v2](https://huggingface.co/rafmacalaba/gliner2-datause-large-v1-deval-synth-v2).
+## Task
+Given a passage of text, the model identifies every data source mentioned and
+classifies it across four dimensions:
+| Field | Type | Values |
+|---|---|---|
+| `mention_name` | Extractive span | Verbatim text from the passage |
+| `specificity_tag` | Classification | `named` / `descriptive` / `vague` |
+| `typology_tag` | Classification | `survey` / `census` / `administrative` / `database` / `indicator` / `geospatial` / `microdata` / `report` / `other` |
+| `is_used` | Classification | `True` / `False` |
+| `usage_context` | Classification | `primary` / `supporting` / `background` |
+## Inference — Two-Pass Hybrid
+This model uses a **two-pass** architecture. A single-pass structured extract
+will not produce correct results.
+```python
+from gliner2 import GLiNER2
+from huggingface_hub import snapshot_download
+# Install the patched GLiNER2 library:
+# pip install git+https://github.com/rafmacalaba/GLiNER2.git@feat/main-mirror
+BASE_MODEL  = "fastino/gliner2-large-v1"
+ADAPTER_ID  = "ai4data/datause-extraction"
+extractor = GLiNER2.from_pretrained(BASE_MODEL)
+extractor.load_adapter(snapshot_download(ADAPTER_ID))
+extractor.eval()
+CLASSIFICATION_TASKS = {
+    "specificity_tag": ["named", "descriptive", "vague"],
+    "typology_tag": [
+        "survey", "census", "administrative", "database",
+        "indicator", "geospatial", "microdata", "report", "other",
+    ],
+    "is_used": ["True", "False"],
+    "usage_context": ["primary", "supporting", "background"],
+}
+text = "We use the Demographic and Health Survey (DHS) 2020 as our primary data source."
+# Pass 1 — extract entity spans
+entity_result = extractor.extract_entities(
+    text, ["data_mention"], threshold=0.3, include_confidence=True
+)
+spans = entity_result.get("entities", {}).get("data_mention", [])
+# Pass 2 — classify each span using its context window
+CONTEXT = 150
+results = []
+for span in spans:
+    mention = span.get("text", "")
+    start   = text.find(mention)
+    ctx     = text[max(0, start - CONTEXT) : start + len(mention) + CONTEXT]
+    context_str = f"Mention: {mention} | Context: {ctx}"
+    classes = extractor.classify_text(context_str, CLASSIFICATION_TASKS, threshold=0.3)
+    results.append({
+        "mention_name":   mention,
+        "confidence":     span.get("confidence", 0),
+        "specificity_tag": classes.get("specificity_tag", ("", 0))[0],
+        "typology_tag":    classes.get("typology_tag",    ("", 0))[0],
+        "is_used":         classes.get("is_used",         ("", 0))[0],
+        "usage_context":   classes.get("usage_context",   ("", 0))[0],
+    })
+print(results)
 ```
+### Batch inference (recommended for documents)
 ```python
+# Pass 1 — batched
+all_res_ent = extractor.batch_extract_entities(
+    texts, ["data_mention"], threshold=0.3, batch_size=8, include_confidence=True
+)
+# Build context strings for every extracted span, then Pass 2 — batched
+classification_queue = []
+for idx, (res_ent, text) in enumerate(zip(all_res_ent, texts)):
+    for span in res_ent.get("entities", {}).get("data_mention", []):
+        mention = span.get("text", "")
+        start   = text.find(mention)
+        ctx     = text[max(0, start - 150) : start + len(mention) + 150]
+        classification_queue.append((idx, mention, span.get("confidence", 0),
+                                     f"Mention: {mention} | Context: {ctx}"))
+all_classes = extractor.batch_classify_text(
+    [q[3] for q in classification_queue],
+    CLASSIFICATION_TASKS,
+    threshold=0.3,
+    batch_size=8,
 )
+```
+## Training Details
+| Property | Value |
+|---|---|
+| Base model | `fastino/gliner2-large-v1` |
+| Method | LoRA (r=16, alpha=32.0) |
+| Target modules | `encoder`, `span_rep`, `classifier`, `count_embed`, `count_pred` |
+| Training examples | 8,791 |
+| Validation examples | 651 |
+| Best val loss | 439.45 |
+| GLiNER2 branch | `rafmacalaba/GLiNER2@feat/main-mirror` |
+| Training dataset | [ai4data/datause-train](https://huggingface.co/datasets/ai4data/datause-train) |
+## Evaluation
+Evaluated on a 630-chunk human-annotated holdout set using Jaccard similarity
+matching (threshold 0.5) at confidence threshold 0.30:
+| Metric | Score |
+|---|---|
+| F1 | see [DataUse Evaluation Hub](https://github.com/rafmacalaba/monitoring_of_datause) |
+| Precision | — |
+| Recall | — |
+## Citation
+If you use this model, please cite the monitoring_of_datause project.

adapter_config.json CHANGED Viewed

@@ -5,8 +5,11 @@
   "lora_alpha": 32.0,
   "lora_dropout": 0.1,
   "target_modules": [
     "encoder",
     "span_rep"
   ],
-  "created_at": "2026-04-06T22:28:30.225894Z"
 }

   "lora_alpha": 32.0,
   "lora_dropout": 0.1,
   "target_modules": [
+    "classifier",
+    "count_embed",
+    "count_pred",
     "encoder",
     "span_rep"
   ],
+  "created_at": "2026-04-06T13:46:19.060075Z"
 }

adapter_weights.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:651065028cbf29f7aa1cdb7dc3b85990189808be3c849e9e357030dbfa64c5d0
-size 30380176

 version https://git-lfs.github.com/spec/v1
+oid sha256:3f789f443becc3ec63f509d050e5a9e79072f25c25172b52e0d13e86cb496372
+size 31758920