Dual-objective Language Model (470M)

This is the model repository for the paper Dual-Objective Language Models: Training Efficiency Without Overfitting, published at ICLR 2026.

The dual-objective approach combines autoregressive and masked-diffusion training objectives within a single standard transformer — no architectural changes required. The resulting model can be used as a causal language model, a masked (MNTP) language model, or with prefix attention at inference time.


Authors	David Samuel and Lucas Georges Gabriel Charpentier
Paper	arXiv:2512.14549
Code	github.com/ltgoslo/dual-language-models
License	Apache 2.0

Model Overview

Parameters: 470M total (360M non-embedding)
Layers: 24
Hidden size: 1024
Attention heads: 16 (head dim = 64)
Intermediate size: 3554 (SwiGLU)
Vocabulary: 51,200 BPE tokens
Context length: 2,048 tokens
Training tokens: 32 billion
Positional encoding: RoPE (θ = 160,000)
Normalization: RMSNorm (pre-norm, ε = 1e-7)

Main Branch

The main branch contains the model trained with the recommended configuration for regular data settings (Remark 1 in the paper):

Setting	Value
α (autoregressive weight)	63/64 (0.984375)
Data repetitions	1× (32B total tokens)
Training objective	98.4% autoregressive + 1.6% masked-diffusion

This configuration achieves strong autoregressive performance while gaining bidirectional capabilities essentially for free.

All Trained Models

All 50 models from the paper are available as branches. Each branch is named alpha_{α:.3f}_n-reps={R}:

Click to expand full list of branches

The models span:

Repetitions (R): 1, 2, 4, 8, 16, 32, 64, 128, 256
Alpha (α): 0 (pure masked-diffusion) to 1 (pure autoregressive), with values at 0, 1/256, 1/64, 1/16, 1/4, 1/2, 3/4, 15/16, 63/64, 255/256, 1

For example:

alpha_0.750_n-reps=32 — dual model (α=3/4) trained with 32 data repetitions
alpha_0.125_n-reps=128 — dual model (α=1/8) trained with 128 data repetitions
alpha_1.000_n-reps=1 — pure autoregressive baseline with 1 repetition
alpha_0.000_n-reps=128 — pure masked-diffusion with 128 repetitions

Recommended Configurations

Regime	Repetitions	Recommended α	Branch
Regular data (≤16 reps)	1×	63/64 (0.984375)	`main`
Data-constrained (32 reps)	32×	3/4 (0.75)	`alpha_0.750_n-reps=32`
Data-constrained (128 reps)	128×	1/8 (0.125)	`alpha_0.125_n-reps=128`

Usage

This model uses a LLaMA-backbone-based implementation (LlamaModel from transformers) with a custom GELU projection head. All model classes require trust_remote_code=True.

Loading the Model

from transformers import AutoTokenizer, LlamaConfig
from hf_model_llama.modeling_dlm_llama import DLMLlamaForCausalLM

model_id = "ltg/dual-lm-470m"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
config = LlamaConfig.from_pretrained(model_id)
model = DLMLlamaForCausalLM.from_pretrained(model_id, config=config, trust_remote_code=True)

Text Generation (Autoregressive)

import torch
from transformers import AutoTokenizer, LlamaConfig
from hf_model_llama.modeling_dlm_llama import DLMLlamaForCausalLM

model_id = "ltg/dual-lm-470m"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
config = LlamaConfig.from_pretrained(model_id)
model = DLMLlamaForCausalLM.from_pretrained(model_id, config=config, trust_remote_code=True)
model.eval()

prompt = "The history of artificial intelligence begins with"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=100,
        do_sample=True,
        top_p=0.95,
        temperature=0.8,
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Masked Language Model (Bidirectional / MNTP)

The model can also be used bidirectionally by loading it as DLMLlamaForMaskedLM. This uses full (non-causal) attention, enabling the model to attend to all positions.

import torch
from transformers import AutoTokenizer, LlamaConfig
from hf_model_llama.modeling_dlm_llama import DLMLlamaForMaskedLM

model_id = "ltg/dual-lm-470m"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
config = LlamaConfig.from_pretrained(model_id)
model = DLMLlamaForMaskedLM.from_pretrained(model_id, config=config, trust_remote_code=True)
model.eval()

# Mask token id
mask_id = tokenizer.convert_tokens_to_ids("<mask>")

# Predict masked tokens with bidirectional context
text = "The capital of <mask> is Paris"
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

# The model uses MNTP: prediction at position i is for token i+1
# Find the mask position and get the prediction from position before it
mask_pos = (inputs["input_ids"] == mask_id).nonzero(as_tuple=True)[1].item()
predicted_id = outputs.logits[0, mask_pos - 1].argmax(dim=-1)
print(f"Predicted: {tokenizer.decode(predicted_id)}")

Perplexity / Log-Likelihood Scoring

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, LlamaConfig
from hf_model_llama.modeling_dlm_llama import DLMLlamaForCausalLM

model_id = "ltg/dual-lm-470m"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
config = LlamaConfig.from_pretrained(model_id)
model = DLMLlamaForCausalLM.from_pretrained(model_id, config=config, trust_remote_code=True)
model.eval()

text = "Language models predict the next token in a sequence."
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs, labels=inputs["input_ids"])

print(f"Loss (NLL per token): {outputs.loss.item():.4f}")
print(f"Perplexity: {torch.exp(outputs.loss).item():.2f}")

Loading a Specific Branch

from transformers import LlamaConfig
from hf_model_llama.modeling_dlm_llama import DLMLlamaForCausalLM

# Load the data-constrained model (α=3/4, 32 repetitions)
model_id = "ltg/dual-lm-470m"
config = LlamaConfig.from_pretrained(model_id, revision="alpha_0.750_n-reps=32")
model = DLMLlamaForCausalLM.from_pretrained(
    model_id,
    revision="alpha_0.750_n-reps=32",
    config=config,
    trust_remote_code=True,
)

Available Model Classes

Task	Class
Causal LM (generation)	`DLMLlamaForCausalLM`
Masked LM (bidirectional)	`DLMLlamaForMaskedLM`
Sequence Classification	`DLMLlamaForSequenceClassification`
Token Classification	`DLMLlamaForTokenClassification`
Question Answering	`DLMLlamaForQuestionAnswering`

All classes are in hf_model_llama.modeling_dlm_llama.

Architecture

The model follows the standard modern transformer recipe with no modifications:

Pre-normalization with RMSNorm (with learnable scale)
Rotary Positional Embeddings (RoPE) with θ = 160,000
SwiGLU feed-forward layers
Classifier head: RMSNorm → Linear + GELU(tanh) → LM head (with bias)
No tied embeddings (tie_weights = false)

The only difference between autoregressive and masked-diffusion modes is the input (original vs. partially masked tokens) and the attention mask (causal vs. bidirectional). Both modes predict the next token at each position.

Training Details

Hyperparameter	Value
Optimizer	Muon (Liu et al., 2025)
Learning rate	0.007
LR schedule	Warmup-Stable-Decay (no warmup, 2,048 decay steps)
Total steps	8,192
Batch size	4M tokens (global)
Sequence length	2,048
Weight decay	0.1
Z-loss	1e-4
Precision	bfloat16
Hardware	128× AMD MI250X GPUs (256 logical devices)
Training corpus	HPLT v2 (English subset)

The α parameter controls the split of GPUs between objectives: with 256 logical devices, α = 63/64 means 252 devices run autoregressive and 4 run masked-diffusion.

Evaluation Results

Autoregressive (Unidirectional) Performance

Normalized scores where 0% = random baseline, 100% = perfect.

Model	ARC-C	ARC-E	BLiMP	CSQA	HSwag	MMLU	OBQA	PIQA	SIQA	Avg
Dual (α=63/64, 1 rep)	5.7	28.6	63.7	35.1	31.1	4.9	17.6	40.9	14.3	26.9
AR-only (α=1, 1 rep)	5.9	30.3	61.3	33.5	31.7	3.8	13.6	39.4	15.2	26.1
Dual (α=3/4, 32 reps)	3.3	28.0	57.9	31.1	26.4	3.6	14.4	36.1	14.6	23.9
AR-only (α=1, 32 reps)	5.0	24.9	53.3	28.5	25.4	3.8	9.9	33.3	14.2	22.0
Dual (α=1/8, 128 reps)	1.7	23.6	56.1	24.8	14.2	1.6	8.5	28.1	13.3	19.1
AR-only (α=1, 128 reps)	-1.0	12.3	33.2	6.8	8.1	1.1	-0.5	15.8	8.9	9.4

Key Findings

Regular data (≤16 reps): A small amount of masked-diffusion (α ≈ 63/64) improves bidirectional performance without losing any autoregressive quality.
Data-constrained (>32 reps): Choose α so the autoregressive objective sees ~16 effective repetitions. The dual model dramatically outperforms pure autoregressive (19.1 vs 9.4 at 128 reps).
Prefix attention: Dual-objective models reliably gain ~1+ percentage points by processing the context bidirectionally at inference time — no additional training needed.

Practical Recommendations from the Paper

Remark 1 (Regular data settings): Train with α ≈ 63/64 to gain strong bidirectional performance without losing autoregressive quality.

Remark 2 (Data-constrained settings): Choose α that exposes the autoregressive objective to roughly 16 repetitions of the training data.

Remark 3 (Prefix language modeling): At inference time, process the conditional part of the prompt fully bidirectionally for improved autoregressive performance.

Citation

@inproceedings{
  samuel2026dualobjective,
  title={Dual-objective Language Models: Training Efficiency Without Overfitting},
  author={David Samuel and Lucas Georges Gabriel Charpentier},
  booktitle={The Fourteenth International Conference on Learning Representations},
  year={2026},
  url={https://openreview.net/forum?id=BrPt0GFgOM}
}

Downloads last month: 337

Dataset used to train ltg/dual-lm-470m

Papers for ltg/dual-lm-470m

Dual-objective Language Models: Training Efficiency Without Overfitting

Paper • 2512.14549 • Published Mar 27

Muon is Scalable for LLM Training

Paper • 2502.16982 • Published Feb 24, 2025 • 12

Evaluation results

Normalized Accuracy on ARC-Easy
test set self-reported

28.600
Normalized Accuracy on ARC-Challenge
test set self-reported

5.700
Normalized Accuracy on HellaSwag
validation set self-reported

31.100
Normalized Accuracy on PIQA
validation set self-reported

40.900