Dual-objective Language Model (470M)

This is the model repository for the paper Dual-Objective Language Models: Training Efficiency Without Overfitting, published at ICLR 2026.

The dual-objective approach combines autoregressive and masked-diffusion training objectives within a single standard transformer — no architectural changes required. The resulting model can be used as a causal language model, a masked (MNTP) language model, or with prefix attention at inference time.

Authors David Samuel and Lucas Georges Gabriel Charpentier
Paper arXiv:2512.14549
Code github.com/ltgoslo/dual-language-models
License Apache 2.0

Model Overview

  • Parameters: 470M total (360M non-embedding)
  • Layers: 24
  • Hidden size: 1024
  • Attention heads: 16 (head dim = 64)
  • Intermediate size: 3554 (SwiGLU)
  • Vocabulary: 51,200 BPE tokens
  • Context length: 2,048 tokens
  • Training tokens: 32 billion
  • Positional encoding: RoPE (θ = 160,000)
  • Normalization: RMSNorm (pre-norm, ε = 1e-7)

Main Branch

The main branch contains the model trained with the recommended configuration for regular data settings (Remark 1 in the paper):

Setting Value
α (autoregressive weight) 63/64 (0.984375)
Data repetitions 1× (32B total tokens)
Training objective 98.4% autoregressive + 1.6% masked-diffusion

This configuration achieves strong autoregressive performance while gaining bidirectional capabilities essentially for free.

All Trained Models

All 50 models from the paper are available as branches. Each branch is named alpha_{α:.3f}_n-reps={R}:

Click to expand full list of branches

The models span:

  • Repetitions (R): 1, 2, 4, 8, 16, 32, 64, 128, 256
  • Alpha (α): 0 (pure masked-diffusion) to 1 (pure autoregressive), with values at 0, 1/256, 1/64, 1/16, 1/4, 1/2, 3/4, 15/16, 63/64, 255/256, 1

For example:

  • alpha_0.750_n-reps=32 — dual model (α=3/4) trained with 32 data repetitions
  • alpha_0.125_n-reps=128 — dual model (α=1/8) trained with 128 data repetitions
  • alpha_1.000_n-reps=1 — pure autoregressive baseline with 1 repetition
  • alpha_0.000_n-reps=128 — pure masked-diffusion with 128 repetitions

Recommended Configurations

Regime Repetitions Recommended α Branch
Regular data (≤16 reps) 63/64 (0.984375) main
Data-constrained (32 reps) 32× 3/4 (0.75) alpha_0.750_n-reps=32
Data-constrained (128 reps) 128× 1/8 (0.125) alpha_0.125_n-reps=128

Usage

This model uses a LLaMA-backbone-based implementation (LlamaModel from transformers) with a custom GELU projection head. All model classes require trust_remote_code=True.

Loading the Model

from transformers import AutoTokenizer, LlamaConfig
from hf_model_llama.modeling_dlm_llama import DLMLlamaForCausalLM

model_id = "ltg/dual-lm-470m"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
config = LlamaConfig.from_pretrained(model_id)
model = DLMLlamaForCausalLM.from_pretrained(model_id, config=config, trust_remote_code=True)

Text Generation (Autoregressive)

import torch
from transformers import AutoTokenizer, LlamaConfig
from hf_model_llama.modeling_dlm_llama import DLMLlamaForCausalLM

model_id = "ltg/dual-lm-470m"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
config = LlamaConfig.from_pretrained(model_id)
model = DLMLlamaForCausalLM.from_pretrained(model_id, config=config, trust_remote_code=True)
model.eval()

prompt = "The history of artificial intelligence begins with"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=100,
        do_sample=True,
        top_p=0.95,
        temperature=0.8,
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Masked Language Model (Bidirectional / MNTP)

The model can also be used bidirectionally by loading it as DLMLlamaForMaskedLM. This uses full (non-causal) attention, enabling the model to attend to all positions.

import torch
from transformers import AutoTokenizer, LlamaConfig
from hf_model_llama.modeling_dlm_llama import DLMLlamaForMaskedLM

model_id = "ltg/dual-lm-470m"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
config = LlamaConfig.from_pretrained(model_id)
model = DLMLlamaForMaskedLM.from_pretrained(model_id, config=config, trust_remote_code=True)
model.eval()

# Mask token id
mask_id = tokenizer.convert_tokens_to_ids("<mask>")

# Predict masked tokens with bidirectional context
text = "The capital of <mask> is Paris"
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

# The model uses MNTP: prediction at position i is for token i+1
# Find the mask position and get the prediction from position before it
mask_pos = (inputs["input_ids"] == mask_id).nonzero(as_tuple=True)[1].item()
predicted_id = outputs.logits[0, mask_pos - 1].argmax(dim=-1)
print(f"Predicted: {tokenizer.decode(predicted_id)}")

Perplexity / Log-Likelihood Scoring

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, LlamaConfig
from hf_model_llama.modeling_dlm_llama import DLMLlamaForCausalLM

model_id = "ltg/dual-lm-470m"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
config = LlamaConfig.from_pretrained(model_id)
model = DLMLlamaForCausalLM.from_pretrained(model_id, config=config, trust_remote_code=True)
model.eval()

text = "Language models predict the next token in a sequence."
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs, labels=inputs["input_ids"])

print(f"Loss (NLL per token): {outputs.loss.item():.4f}")
print(f"Perplexity: {torch.exp(outputs.loss).item():.2f}")

Loading a Specific Branch

from transformers import LlamaConfig
from hf_model_llama.modeling_dlm_llama import DLMLlamaForCausalLM

# Load the data-constrained model (α=3/4, 32 repetitions)
model_id = "ltg/dual-lm-470m"
config = LlamaConfig.from_pretrained(model_id, revision="alpha_0.750_n-reps=32")
model = DLMLlamaForCausalLM.from_pretrained(
    model_id,
    revision="alpha_0.750_n-reps=32",
    config=config,
    trust_remote_code=True,
)

Available Model Classes

Task Class
Causal LM (generation) DLMLlamaForCausalLM
Masked LM (bidirectional) DLMLlamaForMaskedLM
Sequence Classification DLMLlamaForSequenceClassification
Token Classification DLMLlamaForTokenClassification
Question Answering DLMLlamaForQuestionAnswering

All classes are in hf_model_llama.modeling_dlm_llama.

Architecture

The model follows the standard modern transformer recipe with no modifications:

  • Pre-normalization with RMSNorm (with learnable scale)
  • Rotary Positional Embeddings (RoPE) with θ = 160,000
  • SwiGLU feed-forward layers
  • Classifier head: RMSNorm → Linear + GELU(tanh) → LM head (with bias)
  • No tied embeddings (tie_weights = false)

The only difference between autoregressive and masked-diffusion modes is the input (original vs. partially masked tokens) and the attention mask (causal vs. bidirectional). Both modes predict the next token at each position.

Training Details

Hyperparameter Value
Optimizer Muon (Liu et al., 2025)
Learning rate 0.007
LR schedule Warmup-Stable-Decay (no warmup, 2,048 decay steps)
Total steps 8,192
Batch size 4M tokens (global)
Sequence length 2,048
Weight decay 0.1
Z-loss 1e-4
Precision bfloat16
Hardware 128× AMD MI250X GPUs (256 logical devices)
Training corpus HPLT v2 (English subset)

The α parameter controls the split of GPUs between objectives: with 256 logical devices, α = 63/64 means 252 devices run autoregressive and 4 run masked-diffusion.

Evaluation Results

Autoregressive (Unidirectional) Performance

Normalized scores where 0% = random baseline, 100% = perfect.

Model ARC-C ARC-E BLiMP CSQA HSwag MMLU OBQA PIQA SIQA Avg
Dual (α=63/64, 1 rep) 5.7 28.6 63.7 35.1 31.1 4.9 17.6 40.9 14.3 26.9
AR-only (α=1, 1 rep) 5.9 30.3 61.3 33.5 31.7 3.8 13.6 39.4 15.2 26.1
Dual (α=3/4, 32 reps) 3.3 28.0 57.9 31.1 26.4 3.6 14.4 36.1 14.6 23.9
AR-only (α=1, 32 reps) 5.0 24.9 53.3 28.5 25.4 3.8 9.9 33.3 14.2 22.0
Dual (α=1/8, 128 reps) 1.7 23.6 56.1 24.8 14.2 1.6 8.5 28.1 13.3 19.1
AR-only (α=1, 128 reps) -1.0 12.3 33.2 6.8 8.1 1.1 -0.5 15.8 8.9 9.4

Key Findings

  1. Regular data (≤16 reps): A small amount of masked-diffusion (α ≈ 63/64) improves bidirectional performance without losing any autoregressive quality.
  2. Data-constrained (>32 reps): Choose α so the autoregressive objective sees ~16 effective repetitions. The dual model dramatically outperforms pure autoregressive (19.1 vs 9.4 at 128 reps).
  3. Prefix attention: Dual-objective models reliably gain ~1+ percentage points by processing the context bidirectionally at inference time — no additional training needed.

Practical Recommendations from the Paper

Remark 1 (Regular data settings): Train with α ≈ 63/64 to gain strong bidirectional performance without losing autoregressive quality.

Remark 2 (Data-constrained settings): Choose α that exposes the autoregressive objective to roughly 16 repetitions of the training data.

Remark 3 (Prefix language modeling): At inference time, process the conditional part of the prompt fully bidirectionally for improved autoregressive performance.

Citation

@inproceedings{
  samuel2026dualobjective,
  title={Dual-objective Language Models: Training Efficiency Without Overfitting},
  author={David Samuel and Lucas Georges Gabriel Charpentier},
  booktitle={The Fourteenth International Conference on Learning Representations},
  year={2026},
  url={https://openreview.net/forum?id=BrPt0GFgOM}
}
Downloads last month
337
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train ltg/dual-lm-470m

Papers for ltg/dual-lm-470m

Evaluation results