Dual-objective Language Model (470M)
This is the model repository for the paper Dual-Objective Language Models: Training Efficiency Without Overfitting, published at ICLR 2026.
The dual-objective approach combines autoregressive and masked-diffusion training objectives within a single standard transformer — no architectural changes required. The resulting model can be used as a causal language model, a masked (MNTP) language model, or with prefix attention at inference time.
| Authors | David Samuel and Lucas Georges Gabriel Charpentier |
| Paper | arXiv:2512.14549 |
| Code | github.com/ltgoslo/dual-language-models |
| License | Apache 2.0 |
Model Overview
- Parameters: 470M total (360M non-embedding)
- Layers: 24
- Hidden size: 1024
- Attention heads: 16 (head dim = 64)
- Intermediate size: 3554 (SwiGLU)
- Vocabulary: 51,200 BPE tokens
- Context length: 2,048 tokens
- Training tokens: 32 billion
- Positional encoding: RoPE (θ = 160,000)
- Normalization: RMSNorm (pre-norm, ε = 1e-7)
Main Branch
The main branch contains the model trained with the recommended configuration for regular data settings (Remark 1 in the paper):
| Setting | Value |
|---|---|
| α (autoregressive weight) | 63/64 (0.984375) |
| Data repetitions | 1× (32B total tokens) |
| Training objective | 98.4% autoregressive + 1.6% masked-diffusion |
This configuration achieves strong autoregressive performance while gaining bidirectional capabilities essentially for free.
All Trained Models
All 50 models from the paper are available as branches. Each branch is named alpha_{α:.3f}_n-reps={R}:
Click to expand full list of branches
The models span:
- Repetitions (R): 1, 2, 4, 8, 16, 32, 64, 128, 256
- Alpha (α): 0 (pure masked-diffusion) to 1 (pure autoregressive), with values at 0, 1/256, 1/64, 1/16, 1/4, 1/2, 3/4, 15/16, 63/64, 255/256, 1
For example:
alpha_0.750_n-reps=32— dual model (α=3/4) trained with 32 data repetitionsalpha_0.125_n-reps=128— dual model (α=1/8) trained with 128 data repetitionsalpha_1.000_n-reps=1— pure autoregressive baseline with 1 repetitionalpha_0.000_n-reps=128— pure masked-diffusion with 128 repetitions
Recommended Configurations
| Regime | Repetitions | Recommended α | Branch |
|---|---|---|---|
| Regular data (≤16 reps) | 1× | 63/64 (0.984375) | main |
| Data-constrained (32 reps) | 32× | 3/4 (0.75) | alpha_0.750_n-reps=32 |
| Data-constrained (128 reps) | 128× | 1/8 (0.125) | alpha_0.125_n-reps=128 |
Usage
This model uses a LLaMA-backbone-based implementation (LlamaModel from transformers) with a custom GELU projection head. All model classes require trust_remote_code=True.
Loading the Model
from transformers import AutoTokenizer, LlamaConfig
from hf_model_llama.modeling_dlm_llama import DLMLlamaForCausalLM
model_id = "ltg/dual-lm-470m"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
config = LlamaConfig.from_pretrained(model_id)
model = DLMLlamaForCausalLM.from_pretrained(model_id, config=config, trust_remote_code=True)
Text Generation (Autoregressive)
import torch
from transformers import AutoTokenizer, LlamaConfig
from hf_model_llama.modeling_dlm_llama import DLMLlamaForCausalLM
model_id = "ltg/dual-lm-470m"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
config = LlamaConfig.from_pretrained(model_id)
model = DLMLlamaForCausalLM.from_pretrained(model_id, config=config, trust_remote_code=True)
model.eval()
prompt = "The history of artificial intelligence begins with"
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=100,
do_sample=True,
top_p=0.95,
temperature=0.8,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Masked Language Model (Bidirectional / MNTP)
The model can also be used bidirectionally by loading it as DLMLlamaForMaskedLM. This uses full (non-causal) attention, enabling the model to attend to all positions.
import torch
from transformers import AutoTokenizer, LlamaConfig
from hf_model_llama.modeling_dlm_llama import DLMLlamaForMaskedLM
model_id = "ltg/dual-lm-470m"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
config = LlamaConfig.from_pretrained(model_id)
model = DLMLlamaForMaskedLM.from_pretrained(model_id, config=config, trust_remote_code=True)
model.eval()
# Mask token id
mask_id = tokenizer.convert_tokens_to_ids("<mask>")
# Predict masked tokens with bidirectional context
text = "The capital of <mask> is Paris"
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
# The model uses MNTP: prediction at position i is for token i+1
# Find the mask position and get the prediction from position before it
mask_pos = (inputs["input_ids"] == mask_id).nonzero(as_tuple=True)[1].item()
predicted_id = outputs.logits[0, mask_pos - 1].argmax(dim=-1)
print(f"Predicted: {tokenizer.decode(predicted_id)}")
Perplexity / Log-Likelihood Scoring
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, LlamaConfig
from hf_model_llama.modeling_dlm_llama import DLMLlamaForCausalLM
model_id = "ltg/dual-lm-470m"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
config = LlamaConfig.from_pretrained(model_id)
model = DLMLlamaForCausalLM.from_pretrained(model_id, config=config, trust_remote_code=True)
model.eval()
text = "Language models predict the next token in a sequence."
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs, labels=inputs["input_ids"])
print(f"Loss (NLL per token): {outputs.loss.item():.4f}")
print(f"Perplexity: {torch.exp(outputs.loss).item():.2f}")
Loading a Specific Branch
from transformers import LlamaConfig
from hf_model_llama.modeling_dlm_llama import DLMLlamaForCausalLM
# Load the data-constrained model (α=3/4, 32 repetitions)
model_id = "ltg/dual-lm-470m"
config = LlamaConfig.from_pretrained(model_id, revision="alpha_0.750_n-reps=32")
model = DLMLlamaForCausalLM.from_pretrained(
model_id,
revision="alpha_0.750_n-reps=32",
config=config,
trust_remote_code=True,
)
Available Model Classes
| Task | Class |
|---|---|
| Causal LM (generation) | DLMLlamaForCausalLM |
| Masked LM (bidirectional) | DLMLlamaForMaskedLM |
| Sequence Classification | DLMLlamaForSequenceClassification |
| Token Classification | DLMLlamaForTokenClassification |
| Question Answering | DLMLlamaForQuestionAnswering |
All classes are in hf_model_llama.modeling_dlm_llama.
Architecture
The model follows the standard modern transformer recipe with no modifications:
- Pre-normalization with RMSNorm (with learnable scale)
- Rotary Positional Embeddings (RoPE) with θ = 160,000
- SwiGLU feed-forward layers
- Classifier head: RMSNorm → Linear + GELU(tanh) → LM head (with bias)
- No tied embeddings (
tie_weights = false)
The only difference between autoregressive and masked-diffusion modes is the input (original vs. partially masked tokens) and the attention mask (causal vs. bidirectional). Both modes predict the next token at each position.
Training Details
| Hyperparameter | Value |
|---|---|
| Optimizer | Muon (Liu et al., 2025) |
| Learning rate | 0.007 |
| LR schedule | Warmup-Stable-Decay (no warmup, 2,048 decay steps) |
| Total steps | 8,192 |
| Batch size | 4M tokens (global) |
| Sequence length | 2,048 |
| Weight decay | 0.1 |
| Z-loss | 1e-4 |
| Precision | bfloat16 |
| Hardware | 128× AMD MI250X GPUs (256 logical devices) |
| Training corpus | HPLT v2 (English subset) |
The α parameter controls the split of GPUs between objectives: with 256 logical devices, α = 63/64 means 252 devices run autoregressive and 4 run masked-diffusion.
Evaluation Results
Autoregressive (Unidirectional) Performance
Normalized scores where 0% = random baseline, 100% = perfect.
| Model | ARC-C | ARC-E | BLiMP | CSQA | HSwag | MMLU | OBQA | PIQA | SIQA | Avg |
|---|---|---|---|---|---|---|---|---|---|---|
| Dual (α=63/64, 1 rep) | 5.7 | 28.6 | 63.7 | 35.1 | 31.1 | 4.9 | 17.6 | 40.9 | 14.3 | 26.9 |
| AR-only (α=1, 1 rep) | 5.9 | 30.3 | 61.3 | 33.5 | 31.7 | 3.8 | 13.6 | 39.4 | 15.2 | 26.1 |
| Dual (α=3/4, 32 reps) | 3.3 | 28.0 | 57.9 | 31.1 | 26.4 | 3.6 | 14.4 | 36.1 | 14.6 | 23.9 |
| AR-only (α=1, 32 reps) | 5.0 | 24.9 | 53.3 | 28.5 | 25.4 | 3.8 | 9.9 | 33.3 | 14.2 | 22.0 |
| Dual (α=1/8, 128 reps) | 1.7 | 23.6 | 56.1 | 24.8 | 14.2 | 1.6 | 8.5 | 28.1 | 13.3 | 19.1 |
| AR-only (α=1, 128 reps) | -1.0 | 12.3 | 33.2 | 6.8 | 8.1 | 1.1 | -0.5 | 15.8 | 8.9 | 9.4 |
Key Findings
- Regular data (≤16 reps): A small amount of masked-diffusion (α ≈ 63/64) improves bidirectional performance without losing any autoregressive quality.
- Data-constrained (>32 reps): Choose α so the autoregressive objective sees ~16 effective repetitions. The dual model dramatically outperforms pure autoregressive (19.1 vs 9.4 at 128 reps).
- Prefix attention: Dual-objective models reliably gain ~1+ percentage points by processing the context bidirectionally at inference time — no additional training needed.
Practical Recommendations from the Paper
Remark 1 (Regular data settings): Train with α ≈ 63/64 to gain strong bidirectional performance without losing autoregressive quality.
Remark 2 (Data-constrained settings): Choose α that exposes the autoregressive objective to roughly 16 repetitions of the training data.
Remark 3 (Prefix language modeling): At inference time, process the conditional part of the prompt fully bidirectionally for improved autoregressive performance.
Citation
@inproceedings{
samuel2026dualobjective,
title={Dual-objective Language Models: Training Efficiency Without Overfitting},
author={David Samuel and Lucas Georges Gabriel Charpentier},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=BrPt0GFgOM}
}
- Downloads last month
- 337
Dataset used to train ltg/dual-lm-470m
Papers for ltg/dual-lm-470m
Muon is Scalable for LLM Training
Evaluation results
- Normalized Accuracy on ARC-Easytest set self-reported28.600
- Normalized Accuracy on ARC-Challengetest set self-reported5.700
- Normalized Accuracy on HellaSwagvalidation set self-reported31.100
- Normalized Accuracy on PIQAvalidation set self-reported40.900