Growing Transformers β Model unfrozen 1_9 (β247M)
This repository contains growing-transformers-model-unfrozen-1-9-247m, a constructively grown (layer-wise expanded) model from the paper:
It is part of the comparative-study collection:
https://huggingface.co/collections/Bochkov/growing-transformers-layer-wise-expansion-comparative-study
Code:
https://github.com/AVBochkov/PGT
What this model is (in one paragraph)
This is a 9-layer decoder-only Transformer trained with a constructive, layer-wise growth procedure (layers are added and trained progressively while previously trained parts are frozen). Unlike the βfrozen-substrateβ variants, this model starts with standard trainable token embeddings and is trained monolithically for the first stage (layers 1β3 + embeddings all trainable). After that first stage converges, the embedding layer and the first 3 layers are frozen, then layers 4β6 are trained, frozen, and finally layers 7β9 are trained. This model exists to isolate the effect of the growth procedure without a frozen embedding substrate at initialization.
Main comparison (why this repo exists)
This model is intended to be compared primarily against:
- Bochkov/growing-transformers-model-16-bit-1-9-181m (constructive growth, frozen 16-bit embedding)
- Bochkov/growing-transformers-model-unicode-1-9-247m (constructive growth, frozen βvisual UNICODEβ embedding)
All these models share the same Transformer stack design in the controlled study (same depth and same d_model / n_head), but differ in the embedding substrate and what is frozen when.
Important: parameter count difference (why this model is larger than 16-bit)
This checkpoint is larger (β247M parameters) than the 16-bit constructive model (β181M) primarily because it uses a full-size embedding matrix (standard learned embeddings at d_model, with vocab_size = 65,536), whereas the 16-bit model uses a tiny n_embed = 16 embedding that is deterministically expanded to d_model.
Even with the same Transformer blocks, the embedding matrix dominates a large chunk of parameters when vocab_size is large and n_embed = d_model.
Note: exact parameter totals can differ slightly across implementations (e.g., weight tying vs separate LM head). The key reason for the size gap is the embedding dimensionality (full
d_modelvs16).
Training method (constructive growth schedule)
This model was trained in three stages (controlled study setup):
- Stage A (classic / unfrozen start): train token embeddings + layers 1β3 (all trainable)
- Stage B: freeze embeddings + layers 1β3, add layers 4β6, train only layers 4β6
- Stage C: freeze embeddings + layers 1β6, add layers 7β9, train only layers 7β9
So: it is constructive / iterative, but not βfrozen embeddings Stage Aβ.
Model architecture (controlled study)
- Type: decoder-only Transformer (GPT-like)
- Layers: 9
- Hidden size: d_model = 1024
- Heads: n_head = 32
- Context length used in training: 1024
- Embedding: standard learned embedding matrix (trainable in Stage A, frozen afterwards)
Tokenizer
This model uses the BVV tokenizer. Canonical tokenizer repo:
- https://huggingface.co/Bochkov/bvv241-2-3
(collection: https://huggingface.co/collections/Bochkov/tokenizers)
Reproducibility note: even with the same tokenizer, this repo also includes embedding artifacts specific to this model, so for exact reproduction it is recommended to load the tokenizer from this model repo.
How to use (Transformers)
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Bochkov/growing-transformers-model-unfrozen-1-9-247m")
model = AutoModelForCausalLM.from_pretrained("Bochkov/growing-transformers-model-unfrozen-1-9-247m", trust_remote_code=True).to('cuda')
inputs = torch.tensor([tokenizer.encode("Write a short poem about the ocean. ")], dtype=torch.long, device='cuda')
outputs = model.generate(
inputs,
max_new_tokens=50,
do_sample=False
)
print(tokenizer.decode(outputs[0].tolist()))
#Write a short poem about the ocean. The poem was written by the author of the poem The Lord of the Rings, and was published in
inputs = torch.tensor([tokenizer.encode("Question: What is the capital of India?\nAnswer:")], dtype=torch.long, device='cuda')
outputs = model.generate(
inputs,
max_new_tokens=10,
do_sample=False
)
print(tokenizer.decode(outputs[0].tolist()))
#Question: What is the capital of India?
#Answer:San Francisco de Pa
π§βπ¬ Citation & Concept
If you use this model or the underlying concepts in your research, please cite our work:
@article{
bochkov2025emergent,
title={Emergent Semantics Beyond Token Embeddings: Transformer {LM}s with Frozen Visual Unicode Representations},
author={Andrey Bochkov},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2025},
url={https://openreview.net/forum?id=Odh8IynO1o},
note={}
}
@misc{bochkov2025growingtransformersmodularcomposition,
title={Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate},
author={A. Bochkov},
year={2025},
eprint={2507.07129},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2507.07129},
}
- Downloads last month
- 4