NeoLLM

NeoLLM is a 135 M parameter decoder-only language model trained from scratch on FineWeb-Edu in FP8 precision, completing training in approximately 6 hours on a single NVIDIA RTX 5090. It integrates a collection of recently published attention and normalization techniques into a single architecture, with the goal of studying how they interact during pretraining. The model is actively being developed and the current checkpoint represents an intermediate training state.

Author / contact: @Kyokopom on X Repository: KitsuVp/NeoLLM

Architecture

NeoLLM is a decoder-only transformer with the following configuration:

Parameter	Value
Hidden size	512
Layers	12
Attention heads	8
KV heads (GQA)	4
Head dim	64
Intermediate size	1536
Vocabulary	Qwen3 tokenizer (64,402 tokens)
Context length	512 tokens

Parameter breakdown

Parameter bucket	Count
Total parameters	116.22M (116,216,184)
Embedding parameters (tied)	32.97M (32,973,824)
Non-embedding parameters	83.24M (83,242,360)
Effective trainable parameters	116.22M (116,216,184)

Weight tying is enabled: the input embedding matrix and the language-model head share the same parameters, so the effective trainable budget is total − embed = 83.24M.

Integrated techniques

NeoLLM combines architecture modules, optional auxiliary objectives, and training-time optimizer/stability components from the following papers.

Embedding and token representation

Learnable Multipliers (arXiv:2601.04890) — Adds per-row and per-column learnable scalar parameters to selected matrix layers and, when enabled, embeddings.
Leviathan (arXiv:2601.22040) — Optional continuous token embedding generator that can replace the discrete input lookup table.
KHRONOS (arXiv:2505.13315) — Kernel/basis reference used by the Leviathan continuous token generator implementation.
JTok / JTok-M (arXiv:2602.00800) — Optional token-indexed self-modulation surfaces over Leviathan coordinates.
Spelling Bee Embeddings (arXiv:2601.18030) — Augments token embeddings with character-level spelling information.
Token Embedding Manifold analysis (arXiv:2504.01002) — Reference motivation for treating token embeddings as structured objects rather than unconstrained lookup rows.

Attention, positions, and output projection

FAN (arXiv:2502.21309) — Fourier Analysis Networks. A portion of the projection channels are dedicated to periodic cosine/sine features.
MEA (arXiv:2601.19611) — Explicit Multi-head Attention. Adds small learnable interaction matrices between attention heads for K and V.
LUCID (arXiv:2602.10410) — Applies a learned lower-triangular preconditioner to V before attention, decorrelating value representations across positions.
Affine-Scaled Attention (arXiv:2602.23057) — Adds two learnable per-head scalars (α and β) to the softmax weights: [α·softmax(QKᵀ) + β]·V.
XSA (arXiv:2603.09078) — Exclusive Self Attention. After computing attention, removes the component of the output aligned with the token's own value vector.
Directional Routing (arXiv:2603.14923) — Each head learns K=4 directions in the output space; a learned router suppresses the attention output along each direction per input.
Gated Attention (arXiv:2505.06708) — A sigmoid gate is applied to the attention output before the output projection, introducing non-linearity and preventing attention sinks.
Momentum Attention (arXiv:2411.03884) — Modifies Q and K by subtracting a fraction of the previous position's Q and K values (causal first-difference).
Interleaved Head Attention / IHA (arXiv:2602.21371) — Builds pseudo-heads from learned cross-head mixtures to create multiple attention patterns per original head.
REPO (arXiv:2512.14391) — Context re-positioning module that learns contextual position coordinates above a configurable start layer.
GRAPE (arXiv:2512.07805) — Group representational position encoding used by the REPO-GRAPE positional path.
GOAT priors (arXiv:2601.15380) — Optional factorized attention log-prior channels inspired by trainable attention priors.
Hadamard output projection (arXiv:2603.08343) — Replaces dense attention output projection with a structured Hadamard transform plus lightweight scaling.

Normalization, residual flow, and MLP

SeeDNorm (arXiv:2510.22777) — Applied to Q and K projections. Dynamically rescales normalization from the input's own statistics.
LayerNorm Scaling / LNS (arXiv:2502.05795) — Each layer's output is scaled by 1/√ℓ where ℓ is the layer index.
GPAS (arXiv:2506.22049) — Gradient-Preserving Activation Scaling for residual junctions.
PolyNorm (arXiv:2602.04902) — Replaces the standard MLP activation with normalized linear, quadratic, and cubic branches.
SimpleGPT (arXiv:2602.01212) — Second-order geometry-inspired normalization strategy applied inside MLP projections.
StackMemory / STACKTRANS (NeurIPS 2025) — Optional differentiable hidden-state stack between decoder layers.
Attention Residuals / AttnRes (arXiv:2603.15031) — Optional learned depth-wise aggregation over previous layer outputs or block summaries.
LAUREL (arXiv:2411.07501) — Optional learned augmented residual layer with residual-weight and low-rank variants.

Training objectives and training-time regularizers

TWEO (arXiv:2511.23225) — Optional Transformers Without Extreme Outliers activation regularizer for FP8/low-bit-friendly training.
NITP (arXiv:2605.24956) — Optional Next Implicit Token Prediction auxiliary objective using shallow-layer implicit token targets and a cosine loss.
NextLat (arXiv:2511.05963) — Optional next-latent prediction objective using latent dynamics, Smooth L1 supervision, and frozen-head KL.

Optimizer and training stability

Conda (arXiv:2509.24218) — Column-Normalized Adam optimizer path used by the training script.
Cautious Weight Decay (arXiv:2510.12402) — Sign-selective weight decay variant used by the custom optimizer logic.
Correction of Decoupled Weight Decay (arXiv:2512.08217) — Adapts decoupled weight decay during learning-rate decay.
AdamHD (arXiv:2511.14721) — Decoupled Huber decay regularization reference used by the optimizer.
GradientStabilizer (arXiv:2502.17055) — Optional threshold-free gradient magnitude stabilizer.

Training

Setting	Value
Dataset	FineWeb-Edu (sample-10BT)
Tokens seen	~1.54B (46,875 steps × batch 64 × length 512)
Precision	FP8 native (E4M3 weights/activations, E5M2 gradients) + BF16 fallback
Optimizer	Conda (Column-Normalized Adam)
Learning rate	6e-04 with linear warmup (10 % of steps)
Weight decay	0.1
Training time	~3h 51m
Hardware	NVIDIA RTX 5090 (single GPU)

Training curve

Step	Train Loss	Val Loss
5,000	4.225	4.124
10,000	3.862	3.774
15,000	3.749	3.664
20,000	3.699	3.611
25,000	3.668	3.582
30,000	3.647	3.559
35,000	3.635	3.539
40,000	3.552	3.476
45,000	3.512	3.429
46,875	—	3.422

Limitations

Token budget — ~1.5 B tokens seen; below estimated optimum. Knowledge-intensive tasks will improve with more training.
Gradient spike at step 40k — Reorganized the attention pattern in layer 9 that previously captured long-range token correlations. A checkpoint from ~step 38k is expected to have better aggregate benchmark scores.
PolyNorm exclusivity — The quadratic branch has become partially redundant with the linear branch. Will be corrected in the next training run.
Base model only — Not instruction-tuned or aligned; purely a next-token-prediction base model.

References

All papers whose techniques are integrated into NeoLLM's architecture, training objective, or training stack:

Area	Technique	Paper title	Reference
Embeddings	Learnable Multipliers	Freeing the Scale of Language Model Matrix Layers	arXiv:2601.04890
Embeddings	Leviathan	A Separable Architecture for Continuous Token Representation in Language Models	arXiv:2601.22040
Embeddings	KHRONOS	KHRONOS: a Kernel-Based Neural Architecture for Rapid, Resource-Efficient Scientific Computation	arXiv:2505.13315
Embeddings	JTok / JTok-M	JTok: On Token Embedding as Another Axis of Scaling Law via Joint Token Self-Modulation	arXiv:2602.00800
Embeddings	Spelling Bee	Spelling Bee Embeddings for Language Modeling	arXiv:2601.18030
Embeddings	Token embedding analysis	Token Embeddings Violate the Manifold Hypothesis	arXiv:2504.01002
Attention / positions	FAN	Fourier Analysis Networks	arXiv:2502.21309
Attention / positions	MEA	Explicit Multi-head Attention for Inter-head Interaction in Large Language Models	arXiv:2601.19611
Attention / positions	LUCID	Attention with Preconditioned Representations	arXiv:2602.10410
Attention / positions	Affine-Scaled Attention	Affine-Scaled Attention: Towards Flexible and Stable Transformer Attention	arXiv:2602.23057
Attention / positions	XSA	Exclusive Self Attention	arXiv:2603.09078
Attention / positions	Directional Routing	Directional Routing in Transformers	arXiv:2603.14923
Attention / positions	Gated Attention	Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free	arXiv:2505.06708
Attention / positions	Momentum Attention	Momentum Attention	arXiv:2411.03884
Attention / positions	IHA	Interleaved Head Attention	arXiv:2602.21371
Attention / positions	REPO	Language Models with Context Re-Positioning	arXiv:2512.14391
Attention / positions	GRAPE	Group Representational Position Encoding	arXiv:2512.07805
Attention / positions	GOAT priors	You Need Better Attention Priors	arXiv:2601.15380
Attention / positions	Hadamard o_proj	Rethinking Attention Output Projection: Structured Hadamard Transforms for Efficient Transformers	arXiv:2603.08343
Residual / normalization	SeeDNorm	Self-Rescaled Dynamic Normalization	arXiv:2510.22777
Residual / normalization	LNS	The Curse of Depth in LLMs	arXiv:2502.05795
Residual / normalization	GPAS	Gradient-Preserving Activation Scaling	arXiv:2506.22049
Residual / normalization	PolyNorm	PolyNorm / PolyCom	arXiv:2602.04902
Residual / normalization	SimpleGPT	SimpleGPT	arXiv:2602.01212
Residual / normalization	StackMemory / STACKTRANS	Recursive Transformer: Boosting Reasoning Ability with State Stack	NeurIPS 2025
Residual / normalization	Attention Residuals	Attention Residuals	arXiv:2603.15031
Residual / normalization	LAUREL	LAUREL: Learned Augmented Residual Layer	arXiv:2411.07501
Objectives	TWEO	Transformers Without Extreme Outliers Enables FP8 Training And Quantization For Dummies	arXiv:2511.23225
Objectives	NITP	Next Implicit Token Prediction for LLM Pre-training	arXiv:2605.24956
Objectives	NextLat	Next-Latent Prediction Transformers Learn Compact World Models	arXiv:2511.05963
Optimizer / training	Conda	Column-Normalized Adam for Training Large Language Models Faster	arXiv:2509.24218
Optimizer / training	CWD	Cautious Weight Decay	arXiv:2510.12402
Optimizer / training	WD correction	Correction of Decoupled Weight Decay	arXiv:2512.08217
Optimizer / training	AdamHD	AdamHD: Decoupled Huber Decay Regularization for Language Model Pre-Training	arXiv:2511.14721
Optimizer / training	GradientStabilizer	GradientStabilizer	arXiv:2502.17055

Citation

@misc{neollm2026,
  title  = {NeoLLM: A Research Language Model Integrating Recent Attention and Normalization Techniques},
  author = {KitsuVp},
  year   = {2026},
  url    = {https://huggingface.co/KitsuVp/NeoLLM}
}

Author

@Kyokopom on X

License

Apache 2.0

Downloads last month: 582

Safetensors

Model size

0.1B params

Tensor type

I64

F32

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train KitsuVp/NeoLLM

Papers for KitsuVp/NeoLLM