Layer sharing?

#4
by Datdanboi25 - opened

If this model is using layer sharing does that mean inference compute cost is closer to a 300m model ?

FrontiersMind org
edited 2 days ago

No, Our model and SmolLM2 135M have more or less a similar effective layer count. For example, SmolLM2 135M has around 30 layers, while ours is effectively 16×2.

The main difference is that our model width is slightly higher, so inference cost will be a bit higher because of that. But the difference is very small, definitely nowhere close to the compute cost of a typical 300M dense model.

Hmm interesting, I like the factorized tokenizer too, were there any ablations done on it?

FrontiersMind org

Yes, we did run ablations on the factorized tokenizer. We’re publishing a technical blog next week that goes into detail on those results, along with several other innovations we introduced.

Sign up or log in to comment