rafmacalaba commited on
Commit
18addf6
·
verified ·
1 Parent(s): 45098c1

Mirror rafmacalaba/gliner2-datause-large-v1-deval-synth-v2 -> production

Browse files
Files changed (3) hide show
  1. README.md +120 -77
  2. adapter_config.json +4 -1
  3. adapter_weights.safetensors +2 -2
README.md CHANGED
@@ -1,104 +1,147 @@
1
  ---
 
 
 
 
 
2
  tags:
3
- - gliner2
4
  - ner
5
  - data-mention-extraction
6
  - lora
7
- - two-pass-hybrid
8
- base_model: fastino/gliner2-large-v1
9
- library_name: gliner2
10
- license: apache-2.0
11
  ---
12
 
13
- # GLiNER2 Data Mention Extractor — datause-extraction
14
 
15
  Fine-tuned GLiNER2 LoRA adapter for extracting structured data mentions from
16
  development economics and humanitarian research documents.
17
 
 
 
18
 
19
- ## Architecture: Two-Pass Hybrid
20
 
21
- - **Pass 1** (`extract_entities`): Finds ALL data mention spans using 3 entity types
22
- (`named_mention`, `descriptive_mention`, `vague_mention`). Bypasses count_pred entirely.
23
- - **Pass 2** (`extract_json`): Classifies each span individually (count=1).
24
 
25
- ## Entity Types
 
 
 
 
 
 
26
 
27
- - `named_mention`: Proper names and acronyms (DHS, LSMS, FAOSTAT)
28
- - `descriptive_mention`: Described data with identifying detail but no formal name
29
- - `vague_mention`: Generic data references with minimal identifying detail
30
 
31
- ## Classification Fields
 
32
 
33
- - `typology_tag`: survey / census / administrative / database / indicator / geospatial / microdata / report / other
34
- - `is_used`: True / False
35
- - `usage_context`: primary / supporting / background
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
 
37
- ## Installation
38
 
39
- ```bash
40
- pip install git+https://github.com/rafmacalaba/GLiNER2.git@feat/main-mirror
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
  ```
42
 
43
- ## Usage
 
44
  ```python
45
- from gliner2 import GLiNER2
46
- import re
47
-
48
- extractor = GLiNER2.from_pretrained("fastino/gliner2-large-v1")
49
- extractor.load_adapter("ai4data/datause-extraction")
50
-
51
- ENTITY_SCHEMA = {
52
- "entities": ["named_mention", "descriptive_mention", "vague_mention"],
53
- "entity_descriptions": {
54
- "named_mention": "A proper name or well-known acronym for a data source (DHS, LSMS, FAOSTAT).",
55
- "descriptive_mention": "A described data reference with identifying detail but no formal name.",
56
- "vague_mention": "A generic or loosely specified reference to data.",
57
- },
58
- }
59
 
60
- def extract_sentence_context(text, char_start, char_end, margin=1):
61
- boundaries = [0] + [m.end() for m in re.finditer(r"[.!?]\s+", text)] + [len(text)]
62
- for i in range(len(boundaries) - 1):
63
- if boundaries[i] <= char_start < boundaries[i + 1]:
64
- s = max(0, i - margin)
65
- e = min(len(boundaries) - 1, i + margin + 1)
66
- return text[boundaries[s]:boundaries[e]].strip()
67
- return text
68
-
69
- json_schema = (
70
- extractor.create_schema()
71
- .structure("data_mention")
72
- .field("mention_name", dtype="str")
73
- .field("typology_tag", dtype="str", choices=["survey","census","administrative","database","indicator","geospatial","microdata","report","other"])
74
- .field("is_used", dtype="str", choices=["True", "False"])
75
- .field("usage_context", dtype="str", choices=["primary", "supporting", "background"])
76
  )
 
77
 
78
- text = "The analysis draws on the DHS 2018 and administrative records from the National Statistics Office."
79
 
80
- # Pass 1 span detection
81
- pass1 = extractor.extract(text, ENTITY_SCHEMA, threshold=0.3, include_confidence=True, include_spans=True)
82
- entities = pass1.get("entities", {})
 
 
 
 
 
 
 
83
 
84
- # Pass 2 — classification per span
85
- results = []
86
- for etype in ["named_mention", "descriptive_mention", "vague_mention"]:
87
- for span in entities.get(etype, []):
88
- mention_text = span.get("text", span) if isinstance(span, dict) else span
89
- char_start = span.get("start", text.find(mention_text)) if isinstance(span, dict) else text.find(mention_text)
90
- char_end = span.get("end", char_start + len(mention_text)) if isinstance(span, dict) else char_start + len(mention_text)
91
- context = extract_sentence_context(text, char_start, char_end)
92
- tags = extractor.extract(context, json_schema)
93
- tag = (tags.get("data_mention") or [{}])[0]
94
- results.append({
95
- "mention_name": mention_text,
96
- "specificity": etype.replace("_mention", ""),
97
- "typology": tag.get("typology_tag"),
98
- "is_used": tag.get("is_used"),
99
- "usage_context": tag.get("usage_context"),
100
- })
101
-
102
- for r in results:
103
- print(r)
104
- ```
 
1
  ---
2
+ library_name: gliner2
3
+ license: mit
4
+ base_model: fastino/gliner2-large-v1
5
+ datasets:
6
+ - ai4data/datause-train
7
  tags:
 
8
  - ner
9
  - data-mention-extraction
10
  - lora
11
+ - gliner2
12
+ - development-economics
 
 
13
  ---
14
 
15
+ # datause-extraction
16
 
17
  Fine-tuned GLiNER2 LoRA adapter for extracting structured data mentions from
18
  development economics and humanitarian research documents.
19
 
20
+ This is the production release of
21
+ [rafmacalaba/gliner2-datause-large-v1-deval-synth-v2](https://huggingface.co/rafmacalaba/gliner2-datause-large-v1-deval-synth-v2).
22
 
23
+ ## Task
24
 
25
+ Given a passage of text, the model identifies every data source mentioned and
26
+ classifies it across four dimensions:
 
27
 
28
+ | Field | Type | Values |
29
+ |---|---|---|
30
+ | `mention_name` | Extractive span | Verbatim text from the passage |
31
+ | `specificity_tag` | Classification | `named` / `descriptive` / `vague` |
32
+ | `typology_tag` | Classification | `survey` / `census` / `administrative` / `database` / `indicator` / `geospatial` / `microdata` / `report` / `other` |
33
+ | `is_used` | Classification | `True` / `False` |
34
+ | `usage_context` | Classification | `primary` / `supporting` / `background` |
35
 
36
+ ## Inference Two-Pass Hybrid
 
 
37
 
38
+ This model uses a **two-pass** architecture. A single-pass structured extract
39
+ will not produce correct results.
40
 
41
+ ```python
42
+ from gliner2 import GLiNER2
43
+ from huggingface_hub import snapshot_download
44
+
45
+ # Install the patched GLiNER2 library:
46
+ # pip install git+https://github.com/rafmacalaba/GLiNER2.git@feat/main-mirror
47
+
48
+ BASE_MODEL = "fastino/gliner2-large-v1"
49
+ ADAPTER_ID = "ai4data/datause-extraction"
50
+
51
+ extractor = GLiNER2.from_pretrained(BASE_MODEL)
52
+ extractor.load_adapter(snapshot_download(ADAPTER_ID))
53
+ extractor.eval()
54
+
55
+ CLASSIFICATION_TASKS = {
56
+ "specificity_tag": ["named", "descriptive", "vague"],
57
+ "typology_tag": [
58
+ "survey", "census", "administrative", "database",
59
+ "indicator", "geospatial", "microdata", "report", "other",
60
+ ],
61
+ "is_used": ["True", "False"],
62
+ "usage_context": ["primary", "supporting", "background"],
63
+ }
64
 
65
+ text = "We use the Demographic and Health Survey (DHS) 2020 as our primary data source."
66
 
67
+ # Pass 1 — extract entity spans
68
+ entity_result = extractor.extract_entities(
69
+ text, ["data_mention"], threshold=0.3, include_confidence=True
70
+ )
71
+ spans = entity_result.get("entities", {}).get("data_mention", [])
72
+
73
+ # Pass 2 — classify each span using its context window
74
+ CONTEXT = 150
75
+ results = []
76
+ for span in spans:
77
+ mention = span.get("text", "")
78
+ start = text.find(mention)
79
+ ctx = text[max(0, start - CONTEXT) : start + len(mention) + CONTEXT]
80
+ context_str = f"Mention: {mention} | Context: {ctx}"
81
+
82
+ classes = extractor.classify_text(context_str, CLASSIFICATION_TASKS, threshold=0.3)
83
+ results.append({
84
+ "mention_name": mention,
85
+ "confidence": span.get("confidence", 0),
86
+ "specificity_tag": classes.get("specificity_tag", ("", 0))[0],
87
+ "typology_tag": classes.get("typology_tag", ("", 0))[0],
88
+ "is_used": classes.get("is_used", ("", 0))[0],
89
+ "usage_context": classes.get("usage_context", ("", 0))[0],
90
+ })
91
+
92
+ print(results)
93
  ```
94
 
95
+ ### Batch inference (recommended for documents)
96
+
97
  ```python
98
+ # Pass 1 — batched
99
+ all_res_ent = extractor.batch_extract_entities(
100
+ texts, ["data_mention"], threshold=0.3, batch_size=8, include_confidence=True
101
+ )
 
 
 
 
 
 
 
 
 
 
102
 
103
+ # Build context strings for every extracted span, then Pass 2 — batched
104
+ classification_queue = []
105
+ for idx, (res_ent, text) in enumerate(zip(all_res_ent, texts)):
106
+ for span in res_ent.get("entities", {}).get("data_mention", []):
107
+ mention = span.get("text", "")
108
+ start = text.find(mention)
109
+ ctx = text[max(0, start - 150) : start + len(mention) + 150]
110
+ classification_queue.append((idx, mention, span.get("confidence", 0),
111
+ f"Mention: {mention} | Context: {ctx}"))
112
+
113
+ all_classes = extractor.batch_classify_text(
114
+ [q[3] for q in classification_queue],
115
+ CLASSIFICATION_TASKS,
116
+ threshold=0.3,
117
+ batch_size=8,
 
118
  )
119
+ ```
120
 
121
+ ## Training Details
122
 
123
+ | Property | Value |
124
+ |---|---|
125
+ | Base model | `fastino/gliner2-large-v1` |
126
+ | Method | LoRA (r=16, alpha=32.0) |
127
+ | Target modules | `encoder`, `span_rep`, `classifier`, `count_embed`, `count_pred` |
128
+ | Training examples | 8,791 |
129
+ | Validation examples | 651 |
130
+ | Best val loss | 439.45 |
131
+ | GLiNER2 branch | `rafmacalaba/GLiNER2@feat/main-mirror` |
132
+ | Training dataset | [ai4data/datause-train](https://huggingface.co/datasets/ai4data/datause-train) |
133
 
134
+ ## Evaluation
135
+
136
+ Evaluated on a 630-chunk human-annotated holdout set using Jaccard similarity
137
+ matching (threshold 0.5) at confidence threshold 0.30:
138
+
139
+ | Metric | Score |
140
+ |---|---|
141
+ | F1 | see [DataUse Evaluation Hub](https://github.com/rafmacalaba/monitoring_of_datause) |
142
+ | Precision | — |
143
+ | Recall | |
144
+
145
+ ## Citation
146
+
147
+ If you use this model, please cite the monitoring_of_datause project.
 
 
 
 
 
 
 
adapter_config.json CHANGED
@@ -5,8 +5,11 @@
5
  "lora_alpha": 32.0,
6
  "lora_dropout": 0.1,
7
  "target_modules": [
 
 
 
8
  "encoder",
9
  "span_rep"
10
  ],
11
- "created_at": "2026-04-06T22:28:30.225894Z"
12
  }
 
5
  "lora_alpha": 32.0,
6
  "lora_dropout": 0.1,
7
  "target_modules": [
8
+ "classifier",
9
+ "count_embed",
10
+ "count_pred",
11
  "encoder",
12
  "span_rep"
13
  ],
14
+ "created_at": "2026-04-06T13:46:19.060075Z"
15
  }
adapter_weights.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:651065028cbf29f7aa1cdb7dc3b85990189808be3c849e9e357030dbfa64c5d0
3
- size 30380176
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3f789f443becc3ec63f509d050e5a9e79072f25c25172b52e0d13e86cb496372
3
+ size 31758920