Ottoman Turkish NLP Suite
AI & ML interests
None defined yet.
Recent Activity
Ottoman Turkish NLP Suite 📜🤖
Bridging the gap between Ottoman Archives and Large Language Models.
🌍 Mission / Misyon
Ottoman Turkish NLP Suite is an open-science initiative dedicated to developing state-of-the-art resources for Low-Resource Historical Languages, with a specific focus on Ottoman Turkish.
Despite the vast richness of Ottoman archives, open-source datasets and models capable of handling Rika (handwritten) scripts, Matbu (printed) texts, and accurate Transliteration (Arabic to Latin script) are scarce. This organization aims to fill this gap by providing:
- Curated Datasets: High-quality instruction-tuning datasets extracted from archives.
- Fine-Tuned LLMs: Specialized models (based on Qwen, DeepSeek, Llama) for historical text understanding.
- OCR/HTR Tools: Post-processing and recognition tools for digital humanities.
Ottoman Turkish NLP Suite, Osmanlı Türkçesi ve tarihsel düşük kaynaklı diller için son teknoloji yapay zeka kaynakları geliştirmeyi hedefleyen bir açık bilim inisiyatifidir.
Osmanlı arşivlerinin zenginliğine rağmen, Rika (el yazısı), Matbu metinler ve Transkripsiyon (Arap harflerinden Latin harflerine) konularında başarılı açık kaynak veri setleri ve modellerin eksikliği büyük bir sorundur. Bu organizasyon, aşağıdaki alanlarda üretim yaparak bu boşluğu doldurmayı amaçlar:
- Derlenmiş Veri Setleri: Arşivlerden ve tezlerden çıkarılmış, eğitime hazır veri setleri.
- İnce Ayarlı (Fine-Tuned) LLM'ler: Tarihsel metinleri anlamak için özelleştirilmiş modeller (Qwen, DeepSeek vb.).
- OCR/HTR Araçları: Dijital beşeri bilimler için metin tanıma ve düzeltme araçları.
🚀 Key Focus Areas / Odak Alanları
- Text Transliteration: Converting Ottoman script (Arabic alphabet) to Modern Turkish (Latin alphabet) with high accuracy.
- Historical NER (Named Entity Recognition): Extracting entities like Place Names, Person Names from 19th and 20th-century texts.
- Semantic Search & RAG: Building Knowledge Graphs and retrieval systems for library archives.
- OCR Correction: Post-processing raw OCR outputs using context-aware LLMs.
📚 Featured Datasets / Öne Çıkan Veri Setleri
ottoman-place-names-gazetteer: A dataset of 20,000+ unique Ottoman-Turkish place name pairs derived from official state archives. Ideal for vocabulary expansion and transliteration tasks.
🛠 Tech Stack
- Base Models: Qwen 2.5, DeepSeek-V3, DeepSeek-OCR
- Frameworks: Hugging Face Transformers, Unsloth, vLLM
- Domain: Digital Humanities, Library and Information Science (LIS)
👥 Lead Researcher / Araştırmacı
This initiative is led by Dr. Gökhan Usta (Istanbul Technical University, Library and Information Science).
- Website: Classifyes
- Institution: Istanbul Technical University (ITU)
For collaborations, please open a discussion in the Community tab.