DiffutronLM-0.3B-Base

DiffutronLM-0.3B-Base is the foundational Masked Diffusion Language Model (MDLM) of the Diffutron series, tailored specifically for the Turkish language.

This model represents the completion of the Continual Pre-training (CPT) phase. It has successfully adapted the multilingual representations of its backbone to the agglutinative complexity and morphological nuances of Turkish.

⚠️ Note: This is a base foundation model. It has not been instruction-tuned or aligned for chat capabilities. If you are looking for a model that follows prompts and answers questions, please use DiffutronLM-0.3B-Instruct.

πŸ“Œ Model Details

  • Model Type: Masked Diffusion Language Model (MDLM) Base
  • Base Architecture: jhu-clsp/mmBERT-base (Multilingual Encoder)
  • Language: Turkish
  • Parameter Count: 307M (0.3B)
  • Context Length: 512 tokens
  • Training Libraries: dllm, PyTorch
  • Status: Foundation / Base Model (Post-CPT)

πŸš€ Architecture & Continual Pre-training (CPT)

Unlike standard autoregressive models, Diffutron models text generation as a discrete diffusion process. To align the base encoder's latent space with the Turkish target distribution while preserving cross-lingual reasoning, this model underwent a specialized CPT pipeline:

  • Data Curation: Trained on a composite dataset of approximately 2 million sequences (max length 512) sourced from:
    • Havadis: Comprehensive Turkish news articles.
    • Temiz-OSCAR: A cleaned, filtered subset of the Common Crawl-based Turkish OSCAR corpus.
    • Turkish Wikipedia: High-quality encyclopedic sequences.
  • Efficient Adaptation via LoRA: Instead of full-parameter fine-tuning which risks catastrophic forgetting, we applied Low-Rank Adaptation (LoRA) with a high rank ($r=256$, $\alpha=256$) targeting all linear modules (Attention Q, K, V, O and MLP Input, Output). This resulted in ~14.94% trainable parameters.
  • Objective: Masked Language Modeling (MLM).

πŸ“Š Intrinsic Evaluation

To quantify the improvements gained from the CPT phase, we conducted an intrinsic evaluation using perplexity on the Bilkent Turkish Writings Dataset (evaluated with a masked language modeling probability of 0.15).

The CPT process resulted in a significant reduction in perplexity, indicating a strong alignment with Turkish linguistic structures:

  • jhu-clsp/mmBERT-base (Pre-CPT): 3.42
  • DiffutronLM-0.3B-Base (Post-CPT): 2.75

(Note: Downstream task evaluations on the CETVEL benchmark were conducted on the Instruct-tuned versions of this model.)

πŸ’» Usage

As a base masked diffusion model, this checkpoint is ideal for:

  1. Further Fine-tuning: Acting as a starting point for domain-specific continued pre-training or custom instruction tuning.
  2. Masked Token Prediction: Filling in blanks or reconstructing corrupted text.
  3. Unconditional/Conditional Generation: Generating text using a discrete diffusion sampling loop (e.g., via the dllm library).

Because it uses a non-autoregressive paradigm, standard AutoModelForCausalLM.generate() pipelines will not work. Please utilize discrete diffusion generation strategies.

⚠️ Limitations

  • No Instruction Tuning: Will not respond to QA prompts or instructions naturally.
  • Multilingual Backbone: While heavily adapted to Turkish, it is built upon a multilingual encoder.
  • Context Window: Restricted to a 512-token context window during the base phase.

πŸ“ Citation

@misc{diffutron2026,
  author = {Kocabay, Şuayp Talha and Akkuş, Talha Rüzgar},
  title = {Diffutron: A Masked Diffusion Language Model for Turkish Language},
  year = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{[https://huggingface.co/collections/diffutron/diffutronlm](https://huggingface.co/collections/diffutron/diffutronlm)}}
}
Downloads last month
17
Safetensors
Model size
0.3B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for diffutron/DiffutronLM-0.3B-Base

Finetunes
1 model

Datasets used to train diffutron/DiffutronLM-0.3B-Base

Collection including diffutron/DiffutronLM-0.3B-Base