Ottoman Turkish NLP Suite

community

Activity Feed

AI & ML interests

None defined yet.

Recent Activity

Gustaaaa updated a Space 1 day ago

OttomanNLP/README

Gustaaaa published a Space 1 day ago

OttomanNLP/README

Gustaaaa updated a dataset 1 day ago

OttomanNLP/ottoman-place-names-gazetteer

View all activity

Organization Card

Community About org cards

Ottoman Turkish NLP Suite 📜🤖

Bridging the gap between Ottoman Archives and Large Language Models.

🌍 Mission / Misyon

Ottoman Turkish NLP Suite is an open-science initiative dedicated to developing state-of-the-art resources for Low-Resource Historical Languages, with a specific focus on Ottoman Turkish.

Despite the vast richness of Ottoman archives, open-source datasets and models capable of handling Rika (handwritten) scripts, Matbu (printed) texts, and accurate Transliteration (Arabic to Latin script) are scarce. This organization aims to fill this gap by providing:

Curated Datasets: High-quality instruction-tuning datasets extracted from archives.
Fine-Tuned LLMs: Specialized models (based on Qwen, DeepSeek, Llama) for historical text understanding.
OCR/HTR Tools: Post-processing and recognition tools for digital humanities.

Ottoman Turkish NLP Suite, Osmanlı Türkçesi ve tarihsel düşük kaynaklı diller için son teknoloji yapay zeka kaynakları geliştirmeyi hedefleyen bir açık bilim inisiyatifidir.

Osmanlı arşivlerinin zenginliğine rağmen, Rika (el yazısı), Matbu metinler ve Transkripsiyon (Arap harflerinden Latin harflerine) konularında başarılı açık kaynak veri setleri ve modellerin eksikliği büyük bir sorundur. Bu organizasyon, aşağıdaki alanlarda üretim yaparak bu boşluğu doldurmayı amaçlar:

Derlenmiş Veri Setleri: Arşivlerden ve tezlerden çıkarılmış, eğitime hazır veri setleri.
İnce Ayarlı (Fine-Tuned) LLM'ler: Tarihsel metinleri anlamak için özelleştirilmiş modeller (Qwen, DeepSeek vb.).
OCR/HTR Araçları: Dijital beşeri bilimler için metin tanıma ve düzeltme araçları.

🚀 Key Focus Areas / Odak Alanları

Text Transliteration: Converting Ottoman script (Arabic alphabet) to Modern Turkish (Latin alphabet) with high accuracy.
Historical NER (Named Entity Recognition): Extracting entities like Place Names, Person Names from 19th and 20th-century texts.
Semantic Search & RAG: Building Knowledge Graphs and retrieval systems for library archives.
OCR Correction: Post-processing raw OCR outputs using context-aware LLMs.

📚 Featured Datasets / Öne Çıkan Veri Setleri

ottoman-place-names-gazetteer: A dataset of 20,000+ unique Ottoman-Turkish place name pairs derived from official state archives. Ideal for vocabulary expansion and transliteration tasks.