NeoLLM

NeoLLM is a 135 M parameter decoder-only language model trained from scratch on FineWeb-Edu in FP8 precision, completing training in approximately 6 hours on a single NVIDIA RTX 5090. It integrates a collection of recently published attention and normalization techniques into a single architecture, with the goal of studying how they interact during pretraining. The model is actively being developed and the current checkpoint represents an intermediate training state.

Author / contact: @Kyokopom on X Repository: KitsuVp/NeoLLM


Architecture

NeoLLM is a decoder-only transformer with the following configuration:

Parameter Value
Hidden size 512
Layers 12
Attention heads 8
KV heads (GQA) 4
Head dim 64
Intermediate size 1536
Vocabulary Qwen3 tokenizer (64,402 tokens)
Context length 512 tokens

Parameter breakdown

Parameter bucket Count
Total parameters 116.22M (116,216,184)
Embedding parameters (tied) 32.97M (32,973,824)
Non-embedding parameters 83.24M (83,242,360)
Effective trainable parameters 116.22M (116,216,184)

Weight tying is enabled: the input embedding matrix and the language-model head share the same parameters, so the effective trainable budget is total − embed = 83.24M.

Integrated techniques

NeoLLM combines architecture modules, optional auxiliary objectives, and training-time optimizer/stability components from the following papers.

Embedding and token representation

  • Learnable Multipliers (arXiv:2601.04890) — Adds per-row and per-column learnable scalar parameters to selected matrix layers and, when enabled, embeddings.
  • Leviathan (arXiv:2601.22040) — Optional continuous token embedding generator that can replace the discrete input lookup table.
  • KHRONOS (arXiv:2505.13315) — Kernel/basis reference used by the Leviathan continuous token generator implementation.
  • JTok / JTok-M (arXiv:2602.00800) — Optional token-indexed self-modulation surfaces over Leviathan coordinates.
  • Spelling Bee Embeddings (arXiv:2601.18030) — Augments token embeddings with character-level spelling information.
  • Token Embedding Manifold analysis (arXiv:2504.01002) — Reference motivation for treating token embeddings as structured objects rather than unconstrained lookup rows.

Attention, positions, and output projection

  • FAN (arXiv:2502.21309) — Fourier Analysis Networks. A portion of the projection channels are dedicated to periodic cosine/sine features.
  • MEA (arXiv:2601.19611) — Explicit Multi-head Attention. Adds small learnable interaction matrices between attention heads for K and V.
  • LUCID (arXiv:2602.10410) — Applies a learned lower-triangular preconditioner to V before attention, decorrelating value representations across positions.
  • Affine-Scaled Attention (arXiv:2602.23057) — Adds two learnable per-head scalars (α and β) to the softmax weights: [α·softmax(QKᵀ) + β]·V.
  • XSA (arXiv:2603.09078) — Exclusive Self Attention. After computing attention, removes the component of the output aligned with the token's own value vector.
  • Directional Routing (arXiv:2603.14923) — Each head learns K=4 directions in the output space; a learned router suppresses the attention output along each direction per input.
  • Gated Attention (arXiv:2505.06708) — A sigmoid gate is applied to the attention output before the output projection, introducing non-linearity and preventing attention sinks.
  • Momentum Attention (arXiv:2411.03884) — Modifies Q and K by subtracting a fraction of the previous position's Q and K values (causal first-difference).
  • Interleaved Head Attention / IHA (arXiv:2602.21371) — Builds pseudo-heads from learned cross-head mixtures to create multiple attention patterns per original head.
  • REPO (arXiv:2512.14391) — Context re-positioning module that learns contextual position coordinates above a configurable start layer.
  • GRAPE (arXiv:2512.07805) — Group representational position encoding used by the REPO-GRAPE positional path.
  • GOAT priors (arXiv:2601.15380) — Optional factorized attention log-prior channels inspired by trainable attention priors.
  • Hadamard output projection (arXiv:2603.08343) — Replaces dense attention output projection with a structured Hadamard transform plus lightweight scaling.

Normalization, residual flow, and MLP

  • SeeDNorm (arXiv:2510.22777) — Applied to Q and K projections. Dynamically rescales normalization from the input's own statistics.
  • LayerNorm Scaling / LNS (arXiv:2502.05795) — Each layer's output is scaled by 1/√ℓ where ℓ is the layer index.
  • GPAS (arXiv:2506.22049) — Gradient-Preserving Activation Scaling for residual junctions.
  • PolyNorm (arXiv:2602.04902) — Replaces the standard MLP activation with normalized linear, quadratic, and cubic branches.
  • SimpleGPT (arXiv:2602.01212) — Second-order geometry-inspired normalization strategy applied inside MLP projections.
  • StackMemory / STACKTRANS (NeurIPS 2025) — Optional differentiable hidden-state stack between decoder layers.
  • Attention Residuals / AttnRes (arXiv:2603.15031) — Optional learned depth-wise aggregation over previous layer outputs or block summaries.
  • LAUREL (arXiv:2411.07501) — Optional learned augmented residual layer with residual-weight and low-rank variants.

Training objectives and training-time regularizers

  • TWEO (arXiv:2511.23225) — Optional Transformers Without Extreme Outliers activation regularizer for FP8/low-bit-friendly training.
  • NITP (arXiv:2605.24956) — Optional Next Implicit Token Prediction auxiliary objective using shallow-layer implicit token targets and a cosine loss.
  • NextLat (arXiv:2511.05963) — Optional next-latent prediction objective using latent dynamics, Smooth L1 supervision, and frozen-head KL.

Optimizer and training stability

  • Conda (arXiv:2509.24218) — Column-Normalized Adam optimizer path used by the training script.
  • Cautious Weight Decay (arXiv:2510.12402) — Sign-selective weight decay variant used by the custom optimizer logic.
  • Correction of Decoupled Weight Decay (arXiv:2512.08217) — Adapts decoupled weight decay during learning-rate decay.
  • AdamHD (arXiv:2511.14721) — Decoupled Huber decay regularization reference used by the optimizer.
  • GradientStabilizer (arXiv:2502.17055) — Optional threshold-free gradient magnitude stabilizer.

Training

Setting Value
Dataset FineWeb-Edu (sample-10BT)
Tokens seen ~1.54B (46,875 steps × batch 64 × length 512)
Precision FP8 native (E4M3 weights/activations, E5M2 gradients) + BF16 fallback
Optimizer Conda (Column-Normalized Adam)
Learning rate 6e-04 with linear warmup (10 % of steps)
Weight decay 0.1
Training time ~3h 51m
Hardware NVIDIA RTX 5090 (single GPU)

Training curve

Step Train Loss Val Loss
5,000 4.225 4.124
10,000 3.862 3.774
15,000 3.749 3.664
20,000 3.699 3.611
25,000 3.668 3.582
30,000 3.647 3.559
35,000 3.635 3.539
40,000 3.552 3.476
45,000 3.512 3.429
46,875 3.422

Limitations

  • Token budget — ~1.5 B tokens seen; below estimated optimum. Knowledge-intensive tasks will improve with more training.
  • Gradient spike at step 40k — Reorganized the attention pattern in layer 9 that previously captured long-range token correlations. A checkpoint from ~step 38k is expected to have better aggregate benchmark scores.
  • PolyNorm exclusivity — The quadratic branch has become partially redundant with the linear branch. Will be corrected in the next training run.
  • Base model only — Not instruction-tuned or aligned; purely a next-token-prediction base model.

References

All papers whose techniques are integrated into NeoLLM's architecture, training objective, or training stack:

Area Technique Paper title Reference
Embeddings Learnable Multipliers Freeing the Scale of Language Model Matrix Layers arXiv:2601.04890
Embeddings Leviathan A Separable Architecture for Continuous Token Representation in Language Models arXiv:2601.22040
Embeddings KHRONOS KHRONOS: a Kernel-Based Neural Architecture for Rapid, Resource-Efficient Scientific Computation arXiv:2505.13315
Embeddings JTok / JTok-M JTok: On Token Embedding as Another Axis of Scaling Law via Joint Token Self-Modulation arXiv:2602.00800
Embeddings Spelling Bee Spelling Bee Embeddings for Language Modeling arXiv:2601.18030
Embeddings Token embedding analysis Token Embeddings Violate the Manifold Hypothesis arXiv:2504.01002
Attention / positions FAN Fourier Analysis Networks arXiv:2502.21309
Attention / positions MEA Explicit Multi-head Attention for Inter-head Interaction in Large Language Models arXiv:2601.19611
Attention / positions LUCID Attention with Preconditioned Representations arXiv:2602.10410
Attention / positions Affine-Scaled Attention Affine-Scaled Attention: Towards Flexible and Stable Transformer Attention arXiv:2602.23057
Attention / positions XSA Exclusive Self Attention arXiv:2603.09078
Attention / positions Directional Routing Directional Routing in Transformers arXiv:2603.14923
Attention / positions Gated Attention Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free arXiv:2505.06708
Attention / positions Momentum Attention Momentum Attention arXiv:2411.03884
Attention / positions IHA Interleaved Head Attention arXiv:2602.21371
Attention / positions REPO Language Models with Context Re-Positioning arXiv:2512.14391
Attention / positions GRAPE Group Representational Position Encoding arXiv:2512.07805
Attention / positions GOAT priors You Need Better Attention Priors arXiv:2601.15380
Attention / positions Hadamard o_proj Rethinking Attention Output Projection: Structured Hadamard Transforms for Efficient Transformers arXiv:2603.08343
Residual / normalization SeeDNorm Self-Rescaled Dynamic Normalization arXiv:2510.22777
Residual / normalization LNS The Curse of Depth in LLMs arXiv:2502.05795
Residual / normalization GPAS Gradient-Preserving Activation Scaling arXiv:2506.22049
Residual / normalization PolyNorm PolyNorm / PolyCom arXiv:2602.04902
Residual / normalization SimpleGPT SimpleGPT arXiv:2602.01212
Residual / normalization StackMemory / STACKTRANS Recursive Transformer: Boosting Reasoning Ability with State Stack NeurIPS 2025
Residual / normalization Attention Residuals Attention Residuals arXiv:2603.15031
Residual / normalization LAUREL LAUREL: Learned Augmented Residual Layer arXiv:2411.07501
Objectives TWEO Transformers Without Extreme Outliers Enables FP8 Training And Quantization For Dummies arXiv:2511.23225
Objectives NITP Next Implicit Token Prediction for LLM Pre-training arXiv:2605.24956
Objectives NextLat Next-Latent Prediction Transformers Learn Compact World Models arXiv:2511.05963
Optimizer / training Conda Column-Normalized Adam for Training Large Language Models Faster arXiv:2509.24218
Optimizer / training CWD Cautious Weight Decay arXiv:2510.12402
Optimizer / training WD correction Correction of Decoupled Weight Decay arXiv:2512.08217
Optimizer / training AdamHD AdamHD: Decoupled Huber Decay Regularization for Language Model Pre-Training arXiv:2511.14721
Optimizer / training GradientStabilizer GradientStabilizer arXiv:2502.17055

Citation

@misc{neollm2026,
  title  = {NeoLLM: A Research Language Model Integrating Recent Attention and Normalization Techniques},
  author = {KitsuVp},
  year   = {2026},
  url    = {https://huggingface.co/KitsuVp/NeoLLM}
}

Author

@Kyokopom on X


License

Apache 2.0

Downloads last month
582
Safetensors
Model size
0.1B params
Tensor type
I64
·
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train KitsuVp/NeoLLM

Papers for KitsuVp/NeoLLM