Arabic NER Model - Qwen2.5-0.5B Fine-tuned on Wojood Dataset
Model Description
This model is a fine-tuned version of Qwen/Qwen2.5-0.5B-Instruct for Arabic Named Entity Recognition (NER). It was trained on a sample of the Wojood dataset provided by SinaLab.
Dataset
Original Source: SinaLab/ArabicNER
Important: This dataset represents only a sample of the full Wojood dataset, as SinaLab has not released the complete dataset publicly.
Processed Dataset: AhmedNabil1/wojood-arabic-ner
The data has been processed and converted into JSON format, structured specifically for fine-tuning NER tasks with proper formatting and tokenization.
Supported Entity Types
PERS (Person), ORG, GPE (Geopolitical entities, countries, cities), LOC (Locations), DATE, TIME, CARDINAL, ORDINAL, PERCENT, MONEY, QUANTITY, EVENT, FAC (Facilities), NORP (Nationalities, religious/political groups), OCC (Occupations), LANGUAGE, WEBSITE, UNIT (Units of measurement), LAW (Legal documents), PRODUCT, CURR (Currencies)
Training Details
Base Model: Qwen/Qwen2.5-0.5B-Instruct
Fine-tuned using Unsloth with QLoRA.
Usage
Installation
pip install torch transformers unsloth
Loading the Model
from unsloth import FastLanguageModel
# Load model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="AhmedNabil1/arabic_ner_qwen_model",
max_seq_length=2048,
dtype=None,
load_in_4bit=True,
)
# Enable inference mode
model = FastLanguageModel.for_inference(model)
Entity Extraction Function
# Define entity types and schema
from pydantic import BaseModel, Field
from typing import List, Literal
EntityType = Literal[
"PERS", "NORP", "OCC", "ORG", "GPE", "LOC", "FAC", "EVENT",
"DATE", "TIME", "CARDINAL", "ORDINAL", "PERCENT", "LANGUAGE",
"QUANTITY", "WEBSITE", "UNIT", "LAW", "MONEY", "PRODUCT", "CURR"
]
class NEREntity(BaseModel):
entity_value: str = Field(..., description="The actual named entity found in the text.")
entity_type: EntityType = Field(..., description="The entity type")
class NERData(BaseModel):
story_entities: List[NEREntity] = Field(..., description="A list of entities found in the text.")
def extract_entities_from_story(story, model, tokenizer):
"""
Extract named entities from Arabic text.
This function demonstrates the recommended approach for optimal results.
"""
entities_extraction_messages = [
{
"role": "system",
"content": "\n".join([
"You are an advanced NLP entity extraction assistant.",
"Your task is to extract named entities from Arabic text according to a given Pydantic schema.",
"Ensure that the extracted entities exactly match how they appear in the text, without modifications.",
"Follow the schema strictly, maintaining the correct entity types and structure.",
"Output the extracted entities in JSON format, structured according to the provided Pydantic schema.",
"Do not add explanations, introductions, or extra text, Only return the formatted JSON output."
])
},
{
"role": "user",
"content": "\n".join([
"## Text:",
story.strip(),
"",
"## Pydantic Schema:",
json.dumps(NERData.model_json_schema(), ensure_ascii=False, indent=2),
"",
"## Text Entities:",
"```json"
])
}
]
# Apply chat template
text = tokenizer.apply_chat_template(
entities_extraction_messages,
tokenize=False,
add_generation_prompt=True
)
# Generate response
model_inputs = tokenizer([text], return_tensors="pt").to("cuda")
generated_ids = model.generate(
model_inputs.input_ids,
max_new_tokens=1024,
do_sample=False,
)
# Decode response
generated_ids = [
output_ids[len(input_ids):]
for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
return response
Example Usage
# Example Arabic text
story = """
مضابط بلدية نابلس عام ( 1308 ) هجري مضبط رقم 435 .
"""
# Extract entities
response = extract_entities_from_story(story, model, tokenizer)
print(response)
# Parse JSON response
import json
entities = json.loads(response)
print(entities)
Output:
{
"story_entities": [
{"entity_value": "بلدية نابلس", "entity_type": "ORG"},
{"entity_value": "نابلس", "entity_type": "GPE"},
{"entity_value": "عام ( 1308 ) هجري", "entity_type": "DATE"},
{"entity_value": "435", "entity_type": "ORDINAL"}
]
}
Model Performance
The model performs well on Arabic NER tasks within the scope of the available training data. It was trained on a limited sample of the Wojood dataset. The available sample exhibits some class imbalance across different entity types, which may result in varying recognition accuracy for certain entities.
Citation
- Wojood dataset: SinaLab/ArabicNER
- Base Qwen2.5 model: Qwen/Qwen2.5-0.5B-Instruct
License
This model follows the license terms of the base Qwen2.5 model and the Wojood dataset.
Model tree for AhmedNabil1/arabic_ner_qwen_model
Base model
Qwen/Qwen2.5-0.5B