GLiNER2 Data Mention Extractor (v1-hybrid-entities)

Fine-tuned GLiNER2 LoRA adapter for extracting structured data mentions from development economics and humanitarian research documents.

Architecture: Two-Pass Hybrid

This adapter uses a two-pass inference strategy to bypass the count_pred/count_embed mode collapse that limits native extract_json to 1 mention per chunk:

  • Pass 1 (extract_entities): Finds ALL data mention spans using 3 entity types (named_mention, descriptive_mention, vague_mention). Bypasses count_pred entirely.
  • Pass 2 (extract_json): Classifies each span individually using sentence-level context. count=1 is always correct since each call contains exactly 1 mention.

See finetuning/ARCHITECTURE.md for the full rationale.

Task

Given a document passage, extracts structured information about each dataset mentioned:

  • Entity types (Pass 1 โ€” span detection):
    • named_mention: Proper names and acronyms (DHS, LSMS, FAOSTAT)
    • descriptive_mention: Described data with identifying detail but no formal name
    • vague_mention: Generic data references with minimal identifying detail
  • Classification fields (Pass 2 โ€” fixed choices):
    • typology_tag: survey / census / database / administrative / indicator / geospatial / microdata / report / other
    • is_used: True / False
    • usage_context: primary / supporting / background

Training

  • Base model: fastino/gliner2-large-v1
  • Method: LoRA (r=16, alpha=32.0)
  • Target modules: ['encoder', 'span_rep']
  • Training examples: 8087
  • Val examples: 563
  • Best val loss: None

Usage

from gliner2 import GLiNER2

# Install the patched library first
# pip install git+https://github.com/rafmacalaba/GLiNER2.git@feat/main-mirror

extractor = GLiNER2.from_pretrained("fastino/gliner2-large-v1")
extractor.load_adapter("rafmacalaba/gliner2-datause-large-v1-hybrid-entities")

# Pass 1: Extract all mention spans
entity_schema = {
    "entities": ["named_mention", "descriptive_mention", "vague_mention"],
    "entity_descriptions": {
        "named_mention": "A proper name or well-known acronym for a data source...",
        "descriptive_mention": "A described data reference with enough detail...",
        "vague_mention": "A generic or loosely specified reference to data...",
    },
}
spans = extractor.extract(text, entity_schema, threshold=0.3)

# Pass 2: Classify each span
json_schema = {
    "data_mention": {
        "mention_name": "",
        "typology_tag": {"choices": ["survey", "census", "administrative", "database",
                                     "indicator", "geospatial", "microdata", "report", "other"]},
        "is_used": {"choices": ["True", "False"]},
        "usage_context": {"choices": ["primary", "supporting", "background"]},
    },
}
for span in spans.get("named_mention", []):
    context = extract_sentence_context(text, span)
    tags = extractor.extract(context, json_schema)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for rafmacalaba/gliner2-datause-large-v1-hybrid-entities

Adapter
(6)
this model