AI & ML interests

None defined yet.

Recent Activity

Gustaaaa  updated a Space 1 day ago
OttomanNLP/README
Gustaaaa  published a Space 1 day ago
OttomanNLP/README
View all activity

Ottoman Turkish NLP Suite 📜🤖

Bridging the gap between Ottoman Archives and Large Language Models.

🌍 Mission / Misyon

Ottoman Turkish NLP Suite is an open-science initiative dedicated to developing state-of-the-art resources for Low-Resource Historical Languages, with a specific focus on Ottoman Turkish.

Despite the vast richness of Ottoman archives, open-source datasets and models capable of handling Rika (handwritten) scripts, Matbu (printed) texts, and accurate Transliteration (Arabic to Latin script) are scarce. This organization aims to fill this gap by providing:

  1. Curated Datasets: High-quality instruction-tuning datasets extracted from archives.
  2. Fine-Tuned LLMs: Specialized models (based on Qwen, DeepSeek, Llama) for historical text understanding.
  3. OCR/HTR Tools: Post-processing and recognition tools for digital humanities.

Ottoman Turkish NLP Suite, Osmanlı Türkçesi ve tarihsel düşük kaynaklı diller için son teknoloji yapay zeka kaynakları geliştirmeyi hedefleyen bir açık bilim inisiyatifidir.

Osmanlı arşivlerinin zenginliğine rağmen, Rika (el yazısı), Matbu metinler ve Transkripsiyon (Arap harflerinden Latin harflerine) konularında başarılı açık kaynak veri setleri ve modellerin eksikliği büyük bir sorundur. Bu organizasyon, aşağıdaki alanlarda üretim yaparak bu boşluğu doldurmayı amaçlar:

  • Derlenmiş Veri Setleri: Arşivlerden ve tezlerden çıkarılmış, eğitime hazır veri setleri.
  • İnce Ayarlı (Fine-Tuned) LLM'ler: Tarihsel metinleri anlamak için özelleştirilmiş modeller (Qwen, DeepSeek vb.).
  • OCR/HTR Araçları: Dijital beşeri bilimler için metin tanıma ve düzeltme araçları.

🚀 Key Focus Areas / Odak Alanları

  • Text Transliteration: Converting Ottoman script (Arabic alphabet) to Modern Turkish (Latin alphabet) with high accuracy.
  • Historical NER (Named Entity Recognition): Extracting entities like Place Names, Person Names from 19th and 20th-century texts.
  • Semantic Search & RAG: Building Knowledge Graphs and retrieval systems for library archives.
  • OCR Correction: Post-processing raw OCR outputs using context-aware LLMs.

📚 Featured Datasets / Öne Çıkan Veri Setleri

  • ottoman-place-names-gazetteer: A dataset of 20,000+ unique Ottoman-Turkish place name pairs derived from official state archives. Ideal for vocabulary expansion and transliteration tasks.

🛠 Tech Stack

  • Base Models: Qwen 2.5, DeepSeek-V3, DeepSeek-OCR
  • Frameworks: Hugging Face Transformers, Unsloth, vLLM
  • Domain: Digital Humanities, Library and Information Science (LIS)

👥 Lead Researcher / Araştırmacı

This initiative is led by Dr. Gökhan Usta (Istanbul Technical University, Library and Information Science).


For collaborations, please open a discussion in the Community tab.

models 0

None public yet