NLLB-200-600M for Ancient Greek to Modern Greek (LoRA)

DOI

This model is a fine-tuned version of facebook/nllb-200-distilled-600M for translating Ancient Greek to Modern Greek.

It was fine-tuned using LoRA (Low-Rank Adaptation) on the sentence-level AG-MG Parallel Corpus.

Crucially, the tokenizer has been expanded with 148 Ancient Greek characters (Polytonic) that were missing from the original NLLB200 vocabulary, significantly reducing hallucinations and <unk> tokens.

This model was trained by Spyridon Mavromatis at the Institute for Language and Speech Processing (ILSP), "Athena" RC, and the National and Kapodistrian University of Athens (NKUA) as part of an M.Sc. thesis.


Model Details

  • Base Model: facebook/nllb-200-distilled-600M

  • Method: LoRA (Rank=16, Alpha=32, Dropout=0.05)

  • Vocabulary: Expanded with 148 Polytonic Greek characters.

  • Training Data: ~130k sentence pairs from the AG-MG Corpus.


Usage

You need to load the base model, resize the embeddings, and then load the Peft adapter.


import torch

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

from peft import PeftModel

# 1. Load Tokenizer (from THIS repo to get the added tokens)

adapter_repo = "ilsp/nllb-200-600M-ag-mg-lora"

tokenizer = AutoTokenizer.from_pretrained(adapter_repo, src_lang="ell_Grek")

# 2. Load Base Model

base_model_id = "facebook/nllb-200-distilled-600M"

model = AutoModelForSeq2SeqLM.from_pretrained(base_model_id, device_map="auto")

# 3. Resize Embeddings (CRITICAL: prevents size mismatch error)

model.resize_token_embeddings(len(tokenizer))

# 4. Load LoRA Adapter

model = PeftModel.from_pretrained(model, adapter_repo)

model.eval()

# 5. Inference

text = "Ὦ ξεῖν', ἀγγέλλειν Λακεδαιμονίοις ὅτι τῇδε κείμεθα."

inputs = tokenizer(text, return_tensors="pt").to(model.device)

# We force the target language to be Modern Greek ("ell_Grek")

target_lang_id = tokenizer.convert_tokens_to_ids("ell_Grek")

translated_tokens = model.generate(

    **inputs,

    forced_bos_token_id=target_lang_id,

    max_length=100

)

print(tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0])

Performance

Main Test Set Results

Evaluated on the 2,000 sentence-pairs Test Set (Attic & Koine Hellenistic dialects).

Model Method BLEU ↑ chrF++ ↑ TER ↓ BERTScore F1 ↑ COMET ↑ ΔBLEU
NLLB-600M Base 1.55 16.86 106.80 0.880 0.539 -
👉 LoRA 7.43 29.31 88.32 0.903 0.667 +5.88
NLLB-1.3B Base 2.15 17.78 106.41 0.885 0.573 -
LoRA 8.01 30.02 87.74 0.905 0.687 +5.86
M2M100-1.2B Base 0.62 10.70 100.50 0.858 0.475 -
QLoRA 10.96 33.09 82.99 0.911 0.710 +10.34
Full FT 9.60 31.16 83.43 0.908 0.692 +8.98
Krikri-8B-Instruct Base 8.29 29.87 88.13 0.895 0.695 -
QLoRA 11.90 34.07 84.16 0.906 0.713 +3.60
Full FT 13.16 34.71 83.68 0.848 0.702 +4.45

Stress Set Results (Rare Dialects)

Evaluated on the 250 sentence-pairs Stress Set (Ionic, Doric, Homeric dialects).

Model Method BLEU ↑ chrF++ ↑ TER ↓ BERTScore F1 ↑ COMET ↑ ΔBLEU
NLLB-600M Base 0.77 14.40 118.13 0.866 0.484 -
👉 LoRA 5.65 28.74 88.01 0.900 0.638 +4.89
NLLB-1.3B Base 1.25 16.15 107.03 0.873 0.525 -
LoRA 5.68 28.94 88.24 0.900 0.656 +4.43
M2M100-1.2B Base 0.07 9.37 100.34 0.840 0.427 -
QLoRA 9.52 33.30 81.95 0.911 0.691 +9.45
Full FT 8.16 31.12 83.11 0.907 0.664 +8.09
Krikri-8B-Instruct Base 6.55 28.98 87.38 0.900 0.675 -
QLoRA 10.37 34.09 82.28 0.911 0.717 +3.82
Full FT 12.80 35.90 81.40 0.884 0.716 +6.11

Citation

If you use this model, please cite our LREC 2026 paper:

Mavromatis, S., Sofianopoulos, S., Prokopidis, P., & Giagkou, M. (2026). Ancient Greek to Modern Greek Machine Translation: A Novel Benchmark and Fine-Tuning Experiments on LLMs and NMT Models. In Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026) (pp. 8685–8698). European Language Resources Association (ELRA). https://doi.org/10.63317/4cdk64dgm2w9

@inproceedings{mavromatis-etal-2026-ancient,
  title     = {Ancient Greek to Modern Greek Machine Translation: A Novel Benchmark and Fine-Tuning Experiments on LLMs and NMT Models},
  author    = {Mavromatis, Spyridon and Sofianopoulos, Sokratis and Prokopidis, Prokopis and Giagkou, Maria},
  booktitle = {Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)},
  month     = {May},
  year      = {2026},
  pages     = {8685--8698},
  address   = {Palma, Mallorca, Spain},
  publisher = {European Language Resources Association (ELRA)},
  editor    = {Piperidis, Stelios and Bel, Núria and van den Heuvel, Henk and Ide, Nancy and Krek, Simon and Toral, Antonio},
  doi       = {10.63317/4cdk64dgm2w9}
}

Note on resources: The fine-tuned models are publicly released. The accompanying AG-MG Parallel Corpus is not publicly distributed due to the complex and uncertain copyright status of the source materials.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ilsp/nllb-200-600M-ag-mg-lora

Adapter
(97)
this model