Growing Transformers β€” Model unfrozen 1_9 (β‰ˆ247M)

This repository contains growing-transformers-model-unfrozen-1-9-247m, a constructively grown (layer-wise expanded) model from the paper:

πŸ“š Paper (Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate) -

πŸ“š Paper (Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations) -

It is part of the comparative-study collection:
https://huggingface.co/collections/Bochkov/growing-transformers-layer-wise-expansion-comparative-study

Code:
https://github.com/AVBochkov/PGT


What this model is (in one paragraph)

This is a 9-layer decoder-only Transformer trained with a constructive, layer-wise growth procedure (layers are added and trained progressively while previously trained parts are frozen). Unlike the β€œfrozen-substrate” variants, this model starts with standard trainable token embeddings and is trained monolithically for the first stage (layers 1–3 + embeddings all trainable). After that first stage converges, the embedding layer and the first 3 layers are frozen, then layers 4–6 are trained, frozen, and finally layers 7–9 are trained. This model exists to isolate the effect of the growth procedure without a frozen embedding substrate at initialization.


Main comparison (why this repo exists)

This model is intended to be compared primarily against:

  • Bochkov/growing-transformers-model-16-bit-1-9-181m (constructive growth, frozen 16-bit embedding)
  • Bochkov/growing-transformers-model-unicode-1-9-247m (constructive growth, frozen β€œvisual UNICODE” embedding)

All these models share the same Transformer stack design in the controlled study (same depth and same d_model / n_head), but differ in the embedding substrate and what is frozen when.


Important: parameter count difference (why this model is larger than 16-bit)

This checkpoint is larger (β‰ˆ247M parameters) than the 16-bit constructive model (β‰ˆ181M) primarily because it uses a full-size embedding matrix (standard learned embeddings at d_model, with vocab_size = 65,536), whereas the 16-bit model uses a tiny n_embed = 16 embedding that is deterministically expanded to d_model.

Even with the same Transformer blocks, the embedding matrix dominates a large chunk of parameters when vocab_size is large and n_embed = d_model.

Note: exact parameter totals can differ slightly across implementations (e.g., weight tying vs separate LM head). The key reason for the size gap is the embedding dimensionality (full d_model vs 16).


Training method (constructive growth schedule)

This model was trained in three stages (controlled study setup):

  1. Stage A (classic / unfrozen start): train token embeddings + layers 1–3 (all trainable)
  2. Stage B: freeze embeddings + layers 1–3, add layers 4–6, train only layers 4–6
  3. Stage C: freeze embeddings + layers 1–6, add layers 7–9, train only layers 7–9

So: it is constructive / iterative, but not β€œfrozen embeddings Stage A”.


Model architecture (controlled study)

  • Type: decoder-only Transformer (GPT-like)
  • Layers: 9
  • Hidden size: d_model = 1024
  • Heads: n_head = 32
  • Context length used in training: 1024
  • Embedding: standard learned embedding matrix (trainable in Stage A, frozen afterwards)

Tokenizer

This model uses the BVV tokenizer. Canonical tokenizer repo:

Reproducibility note: even with the same tokenizer, this repo also includes embedding artifacts specific to this model, so for exact reproduction it is recommended to load the tokenizer from this model repo.


How to use (Transformers)


import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Bochkov/growing-transformers-model-unfrozen-1-9-247m")
model = AutoModelForCausalLM.from_pretrained("Bochkov/growing-transformers-model-unfrozen-1-9-247m", trust_remote_code=True).to('cuda')

inputs = torch.tensor([tokenizer.encode("Write a short poem about the ocean. ")], dtype=torch.long, device='cuda')

outputs = model.generate(
    inputs, 
    max_new_tokens=50,
    do_sample=False
)
print(tokenizer.decode(outputs[0].tolist()))
#Write a short poem about the ocean. The poem was written by the author of the poem The Lord of the Rings, and was published in 

inputs = torch.tensor([tokenizer.encode("Question: What is the capital of India?\nAnswer:")], dtype=torch.long, device='cuda')

outputs = model.generate(
    inputs, 
    max_new_tokens=10,
    do_sample=False
)
print(tokenizer.decode(outputs[0].tolist()))
#Question: What is the capital of India?
#Answer:San Francisco de Pa

πŸ§‘β€πŸ”¬ Citation & Concept

If you use this model or the underlying concepts in your research, please cite our work:

@article{
      bochkov2025emergent,
      title={Emergent Semantics Beyond Token Embeddings: Transformer {LM}s with Frozen Visual Unicode Representations},
      author={Andrey Bochkov},
      journal={Transactions on Machine Learning Research},
      issn={2835-8856},
      year={2025},
      url={https://openreview.net/forum?id=Odh8IynO1o},
      note={}
}

@misc{bochkov2025growingtransformersmodularcomposition,
      title={Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate}, 
      author={A. Bochkov},
      year={2025},
      eprint={2507.07129},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2507.07129}, 
}
Downloads last month
4
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including Bochkov/growing-transformers-model-unfrozen-1-9-247m

Papers for Bochkov/growing-transformers-model-unfrozen-1-9-247m