Arabic NER Model - Qwen2.5-0.5B Fine-tuned on Wojood Dataset

Model Description

This model is a fine-tuned version of Qwen/Qwen2.5-0.5B-Instruct for Arabic Named Entity Recognition (NER). It was trained on a sample of the Wojood dataset provided by SinaLab.

Dataset

Original Source: SinaLab/ArabicNER
Important: This dataset represents only a sample of the full Wojood dataset, as SinaLab has not released the complete dataset publicly.

Processed Dataset: AhmedNabil1/wojood-arabic-ner
The data has been processed and converted into JSON format, structured specifically for fine-tuning NER tasks with proper formatting and tokenization.

Supported Entity Types

PERS (Person), ORG, GPE (Geopolitical entities, countries, cities), LOC (Locations), DATE, TIME, CARDINAL, ORDINAL, PERCENT, MONEY, QUANTITY, EVENT, FAC (Facilities), NORP (Nationalities, religious/political groups), OCC (Occupations), LANGUAGE, WEBSITE, UNIT (Units of measurement), LAW (Legal documents), PRODUCT, CURR (Currencies)

Training Details

Base Model: Qwen/Qwen2.5-0.5B-Instruct
Fine-tuned using Unsloth with QLoRA.

Usage

Installation

pip install torch transformers unsloth

Loading the Model

from unsloth import FastLanguageModel

# Load model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="AhmedNabil1/arabic_ner_qwen_model",
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True,
)

# Enable inference mode
model = FastLanguageModel.for_inference(model)

Entity Extraction Function

# Define entity types and schema
from pydantic import BaseModel, Field
from typing import List, Literal

EntityType = Literal[
    "PERS", "NORP", "OCC", "ORG", "GPE", "LOC", "FAC", "EVENT",
    "DATE", "TIME", "CARDINAL", "ORDINAL", "PERCENT", "LANGUAGE",
    "QUANTITY", "WEBSITE", "UNIT", "LAW", "MONEY", "PRODUCT", "CURR"
]

class NEREntity(BaseModel):
    entity_value: str = Field(..., description="The actual named entity found in the text.")
    entity_type: EntityType = Field(..., description="The entity type")

class NERData(BaseModel):
    story_entities: List[NEREntity] = Field(..., description="A list of entities found in the text.")

def extract_entities_from_story(story, model, tokenizer):
    """
    Extract named entities from Arabic text.
    This function demonstrates the recommended approach for optimal results.
    """
    entities_extraction_messages = [
        {
            "role": "system",
            "content": "\n".join([
                "You are an advanced NLP entity extraction assistant.",
                "Your task is to extract named entities from Arabic text according to a given Pydantic schema.",
                "Ensure that the extracted entities exactly match how they appear in the text, without modifications.",
                "Follow the schema strictly, maintaining the correct entity types and structure.",
                "Output the extracted entities in JSON format, structured according to the provided Pydantic schema.",
                "Do not add explanations, introductions, or extra text, Only return the formatted JSON output."
            ])
        },
        {
            "role": "user",
            "content": "\n".join([
                "## Text:",
                story.strip(),
                "",
                "## Pydantic Schema:",
                json.dumps(NERData.model_json_schema(), ensure_ascii=False, indent=2),
                "",
                "## Text Entities:",
                "```json"
            ])
        }
    ]

    # Apply chat template
    text = tokenizer.apply_chat_template(
        entities_extraction_messages,
        tokenize=False,
        add_generation_prompt=True
    )
    
    # Generate response
    model_inputs = tokenizer([text], return_tensors="pt").to("cuda")
    generated_ids = model.generate(
        model_inputs.input_ids,
        max_new_tokens=1024,
        do_sample=False,
    )
    
    # Decode response
    generated_ids = [
        output_ids[len(input_ids):]
        for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]
    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    
    return response

Example Usage

# Example Arabic text
story = """
مضابط بلدية نابلس عام ( 1308 ) هجري مضبط رقم 435 .
"""

# Extract entities
response = extract_entities_from_story(story, model, tokenizer)
print(response)

# Parse JSON response
import json
entities = json.loads(response)
print(entities)

Output:

{
  "story_entities": [
    {"entity_value": "بلدية نابلس", "entity_type": "ORG"},
    {"entity_value": "نابلس", "entity_type": "GPE"},
    {"entity_value": "عام ( 1308 ) هجري", "entity_type": "DATE"},
    {"entity_value": "435", "entity_type": "ORDINAL"}
  ]
}

Model Performance

The model performs well on Arabic NER tasks within the scope of the available training data. It was trained on a limited sample of the Wojood dataset. The available sample exhibits some class imbalance across different entity types, which may result in varying recognition accuracy for certain entities.

Citation

License

This model follows the license terms of the base Qwen2.5 model and the Wojood dataset.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AhmedNabil1/arabic_ner_qwen_model

Base model

Qwen/Qwen2.5-0.5B
Finetuned
(604)
this model