NeoLLM
NeoLLM is a 135 M parameter decoder-only language model trained from scratch on FineWeb-Edu in FP8 precision, completing training in approximately 6 hours on a single NVIDIA RTX 5090. It integrates a collection of recently published attention and normalization techniques into a single architecture, with the goal of studying how they interact during pretraining. The model is actively being developed and the current checkpoint represents an intermediate training state.
Author / contact: @Kyokopom on X Repository: KitsuVp/NeoLLM
Architecture
NeoLLM is a decoder-only transformer with the following configuration:
| Parameter | Value |
|---|---|
| Hidden size | 512 |
| Layers | 12 |
| Attention heads | 8 |
| KV heads (GQA) | 4 |
| Head dim | 64 |
| Intermediate size | 1536 |
| Vocabulary | Qwen3 tokenizer (64,402 tokens) |
| Context length | 512 tokens |
Parameter breakdown
| Parameter bucket | Count |
|---|---|
| Total parameters | 116.22M (116,216,184) |
| Embedding parameters (tied) | 32.97M (32,973,824) |
| Non-embedding parameters | 83.24M (83,242,360) |
| Effective trainable parameters | 116.22M (116,216,184) |
Weight tying is enabled: the input embedding matrix and the language-model head share the same parameters, so the effective trainable budget is
total − embed = 83.24M.
Integrated techniques
NeoLLM combines architecture modules, optional auxiliary objectives, and training-time optimizer/stability components from the following papers.
Embedding and token representation
- Learnable Multipliers (arXiv:2601.04890) — Adds per-row and per-column learnable scalar parameters to selected matrix layers and, when enabled, embeddings.
- Leviathan (arXiv:2601.22040) — Optional continuous token embedding generator that can replace the discrete input lookup table.
- KHRONOS (arXiv:2505.13315) — Kernel/basis reference used by the Leviathan continuous token generator implementation.
- JTok / JTok-M (arXiv:2602.00800) — Optional token-indexed self-modulation surfaces over Leviathan coordinates.
- Spelling Bee Embeddings (arXiv:2601.18030) — Augments token embeddings with character-level spelling information.
- Token Embedding Manifold analysis (arXiv:2504.01002) — Reference motivation for treating token embeddings as structured objects rather than unconstrained lookup rows.
Attention, positions, and output projection
- FAN (arXiv:2502.21309) — Fourier Analysis Networks. A portion of the projection channels are dedicated to periodic cosine/sine features.
- MEA (arXiv:2601.19611) — Explicit Multi-head Attention. Adds small learnable interaction matrices between attention heads for K and V.
- LUCID (arXiv:2602.10410) — Applies a learned lower-triangular preconditioner to V before attention, decorrelating value representations across positions.
- Affine-Scaled Attention (arXiv:2602.23057) — Adds
two learnable per-head scalars (α and β) to the softmax weights:
[α·softmax(QKᵀ) + β]·V. - XSA (arXiv:2603.09078) — Exclusive Self Attention. After computing attention, removes the component of the output aligned with the token's own value vector.
- Directional Routing (arXiv:2603.14923) — Each head learns K=4 directions in the output space; a learned router suppresses the attention output along each direction per input.
- Gated Attention (arXiv:2505.06708) — A sigmoid gate is applied to the attention output before the output projection, introducing non-linearity and preventing attention sinks.
- Momentum Attention (arXiv:2411.03884) — Modifies Q and K by subtracting a fraction of the previous position's Q and K values (causal first-difference).
- Interleaved Head Attention / IHA (arXiv:2602.21371) — Builds pseudo-heads from learned cross-head mixtures to create multiple attention patterns per original head.
- REPO (arXiv:2512.14391) — Context re-positioning module that learns contextual position coordinates above a configurable start layer.
- GRAPE (arXiv:2512.07805) — Group representational position encoding used by the REPO-GRAPE positional path.
- GOAT priors (arXiv:2601.15380) — Optional factorized attention log-prior channels inspired by trainable attention priors.
- Hadamard output projection (arXiv:2603.08343) — Replaces dense attention output projection with a structured Hadamard transform plus lightweight scaling.
Normalization, residual flow, and MLP
- SeeDNorm (arXiv:2510.22777) — Applied to Q and K projections. Dynamically rescales normalization from the input's own statistics.
- LayerNorm Scaling / LNS (arXiv:2502.05795) — Each layer's output is scaled by 1/√ℓ where ℓ is the layer index.
- GPAS (arXiv:2506.22049) — Gradient-Preserving Activation Scaling for residual junctions.
- PolyNorm (arXiv:2602.04902) — Replaces the standard MLP activation with normalized linear, quadratic, and cubic branches.
- SimpleGPT (arXiv:2602.01212) — Second-order geometry-inspired normalization strategy applied inside MLP projections.
- StackMemory / STACKTRANS (NeurIPS 2025) — Optional differentiable hidden-state stack between decoder layers.
- Attention Residuals / AttnRes (arXiv:2603.15031) — Optional learned depth-wise aggregation over previous layer outputs or block summaries.
- LAUREL (arXiv:2411.07501) — Optional learned augmented residual layer with residual-weight and low-rank variants.
Training objectives and training-time regularizers
- TWEO (arXiv:2511.23225) — Optional Transformers Without Extreme Outliers activation regularizer for FP8/low-bit-friendly training.
- NITP (arXiv:2605.24956) — Optional Next Implicit Token Prediction auxiliary objective using shallow-layer implicit token targets and a cosine loss.
- NextLat (arXiv:2511.05963) — Optional next-latent prediction objective using latent dynamics, Smooth L1 supervision, and frozen-head KL.
Optimizer and training stability
- Conda (arXiv:2509.24218) — Column-Normalized Adam optimizer path used by the training script.
- Cautious Weight Decay (arXiv:2510.12402) — Sign-selective weight decay variant used by the custom optimizer logic.
- Correction of Decoupled Weight Decay (arXiv:2512.08217) — Adapts decoupled weight decay during learning-rate decay.
- AdamHD (arXiv:2511.14721) — Decoupled Huber decay regularization reference used by the optimizer.
- GradientStabilizer (arXiv:2502.17055) — Optional threshold-free gradient magnitude stabilizer.
Training
| Setting | Value |
|---|---|
| Dataset | FineWeb-Edu (sample-10BT) |
| Tokens seen | ~1.54B (46,875 steps × batch 64 × length 512) |
| Precision | FP8 native (E4M3 weights/activations, E5M2 gradients) + BF16 fallback |
| Optimizer | Conda (Column-Normalized Adam) |
| Learning rate | 6e-04 with linear warmup (10 % of steps) |
| Weight decay | 0.1 |
| Training time | ~3h 51m |
| Hardware | NVIDIA RTX 5090 (single GPU) |
Training curve
| Step | Train Loss | Val Loss |
|---|---|---|
| 5,000 | 4.225 | 4.124 |
| 10,000 | 3.862 | 3.774 |
| 15,000 | 3.749 | 3.664 |
| 20,000 | 3.699 | 3.611 |
| 25,000 | 3.668 | 3.582 |
| 30,000 | 3.647 | 3.559 |
| 35,000 | 3.635 | 3.539 |
| 40,000 | 3.552 | 3.476 |
| 45,000 | 3.512 | 3.429 |
| 46,875 | — | 3.422 |
Limitations
- Token budget — ~1.5 B tokens seen; below estimated optimum. Knowledge-intensive tasks will improve with more training.
- Gradient spike at step 40k — Reorganized the attention pattern in layer 9 that previously captured long-range token correlations. A checkpoint from ~step 38k is expected to have better aggregate benchmark scores.
- PolyNorm exclusivity — The quadratic branch has become partially redundant with the linear branch. Will be corrected in the next training run.
- Base model only — Not instruction-tuned or aligned; purely a next-token-prediction base model.
References
All papers whose techniques are integrated into NeoLLM's architecture, training objective, or training stack:
| Area | Technique | Paper title | Reference |
|---|---|---|---|
| Embeddings | Learnable Multipliers | Freeing the Scale of Language Model Matrix Layers | arXiv:2601.04890 |
| Embeddings | Leviathan | A Separable Architecture for Continuous Token Representation in Language Models | arXiv:2601.22040 |
| Embeddings | KHRONOS | KHRONOS: a Kernel-Based Neural Architecture for Rapid, Resource-Efficient Scientific Computation | arXiv:2505.13315 |
| Embeddings | JTok / JTok-M | JTok: On Token Embedding as Another Axis of Scaling Law via Joint Token Self-Modulation | arXiv:2602.00800 |
| Embeddings | Spelling Bee | Spelling Bee Embeddings for Language Modeling | arXiv:2601.18030 |
| Embeddings | Token embedding analysis | Token Embeddings Violate the Manifold Hypothesis | arXiv:2504.01002 |
| Attention / positions | FAN | Fourier Analysis Networks | arXiv:2502.21309 |
| Attention / positions | MEA | Explicit Multi-head Attention for Inter-head Interaction in Large Language Models | arXiv:2601.19611 |
| Attention / positions | LUCID | Attention with Preconditioned Representations | arXiv:2602.10410 |
| Attention / positions | Affine-Scaled Attention | Affine-Scaled Attention: Towards Flexible and Stable Transformer Attention | arXiv:2602.23057 |
| Attention / positions | XSA | Exclusive Self Attention | arXiv:2603.09078 |
| Attention / positions | Directional Routing | Directional Routing in Transformers | arXiv:2603.14923 |
| Attention / positions | Gated Attention | Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free | arXiv:2505.06708 |
| Attention / positions | Momentum Attention | Momentum Attention | arXiv:2411.03884 |
| Attention / positions | IHA | Interleaved Head Attention | arXiv:2602.21371 |
| Attention / positions | REPO | Language Models with Context Re-Positioning | arXiv:2512.14391 |
| Attention / positions | GRAPE | Group Representational Position Encoding | arXiv:2512.07805 |
| Attention / positions | GOAT priors | You Need Better Attention Priors | arXiv:2601.15380 |
| Attention / positions | Hadamard o_proj | Rethinking Attention Output Projection: Structured Hadamard Transforms for Efficient Transformers | arXiv:2603.08343 |
| Residual / normalization | SeeDNorm | Self-Rescaled Dynamic Normalization | arXiv:2510.22777 |
| Residual / normalization | LNS | The Curse of Depth in LLMs | arXiv:2502.05795 |
| Residual / normalization | GPAS | Gradient-Preserving Activation Scaling | arXiv:2506.22049 |
| Residual / normalization | PolyNorm | PolyNorm / PolyCom | arXiv:2602.04902 |
| Residual / normalization | SimpleGPT | SimpleGPT | arXiv:2602.01212 |
| Residual / normalization | StackMemory / STACKTRANS | Recursive Transformer: Boosting Reasoning Ability with State Stack | NeurIPS 2025 |
| Residual / normalization | Attention Residuals | Attention Residuals | arXiv:2603.15031 |
| Residual / normalization | LAUREL | LAUREL: Learned Augmented Residual Layer | arXiv:2411.07501 |
| Objectives | TWEO | Transformers Without Extreme Outliers Enables FP8 Training And Quantization For Dummies | arXiv:2511.23225 |
| Objectives | NITP | Next Implicit Token Prediction for LLM Pre-training | arXiv:2605.24956 |
| Objectives | NextLat | Next-Latent Prediction Transformers Learn Compact World Models | arXiv:2511.05963 |
| Optimizer / training | Conda | Column-Normalized Adam for Training Large Language Models Faster | arXiv:2509.24218 |
| Optimizer / training | CWD | Cautious Weight Decay | arXiv:2510.12402 |
| Optimizer / training | WD correction | Correction of Decoupled Weight Decay | arXiv:2512.08217 |
| Optimizer / training | AdamHD | AdamHD: Decoupled Huber Decay Regularization for Language Model Pre-Training | arXiv:2511.14721 |
| Optimizer / training | GradientStabilizer | GradientStabilizer | arXiv:2502.17055 |
Citation
@misc{neollm2026,
title = {NeoLLM: A Research Language Model Integrating Recent Attention and Normalization Techniques},
author = {KitsuVp},
year = {2026},
url = {https://huggingface.co/KitsuVp/NeoLLM}
}
Author
@Kyokopom on X
License
Apache 2.0
- Downloads last month
- 582