🔄 In a Training Loop

John Smith PRO

John6666

·

John6666cat

AI & ML interests

None yet

Recent Activity

reacted to RDTvlokip's post with 👍 about 7 hours ago

I finally changed the architecture of my 15M French LLM. It worked. Then I almost fooled myself about how much and catching that was the real win. After proving last time that architecture is a threshold, not a lever, I got stubborn: could I change how the model learns? Four honest attempts, Lion, a sharper AdamW β2, multi-token prediction, LayerScale. Four failures. The bottleneck wasn't the learning rule either. So I changed the shape of the computation instead: loop the same transformer blocks 4×, deeper reasoning, zero added parameters. It beat the baseline on perplexity, the first thing in the whole project to move that number. Then I added my own twist: let each token decide how deep to think, halting on its own entropy. My first evaluation was spectacular. Coherence up 65%. Hallucinated names down 62%. It was noise. Eight prompts, one seed. I re-ran on 50 prompts × 200 tokens and watched the gains shrink to "modest" and on out-of-domain prompts, recurrence actually made things worse. No universal winner. And none of it is new: it's Adaptive Computation Time (2016), the Universal Transformer (2018), and LoopViT (2026), recombined and measured honestly. The real lesson: A number from 8 prompts is a rumor. The eval harness that kills your own best result is worth more than the result it kills. Cite your lineage. Stay preliminary until multiple seeds say otherwise. The three models are live. The write-up is honest about every caveat 👇 🔗 https://huggingface.co/blog/RDTvlokip/teaching-a-15m-french-llm-to-think-deeper

reacted to Quazim0t0's post with 🔥 about 7 hours ago

Created research language model whose channel-mixing block is not an MLP. It is a differentiable Neighbour-Sensing fungal-colony-growth model: each token is expanded into a colony of hyphal tips that grow in a bounded latent region, sense a shared density field, and steer their own growth — the "MLP" is replaced by a few differentiable steps of colony growth, read back out into the hidden state. https://huggingface.co/Quazim0t0/Mycel-LM-79M Also the original SpikeWhale project — the one that sparked all the other SpikeWhale related projects. Every spiking primitive here is hand-written in plain PyTorch: the leaky integrate-and-fire (LIF) neuron dynamics, the fast-sigmoid surrogate gradient, and the backprop-through-time training loop. No snntorch, no spikingjelly, no norse, no bindsnet — the network is a genuine from-scratch SNN. https://huggingface.co/Quazim0t0/SpikeWhale-SNN-216M

reacted to SeaWolf-AI's post with 👀 about 7 hours ago

🚀 Adding a GPU without building one AI is usually framed as "how smart is the model / how many GPUs did you buy." The real bottleneck is elsewhere — how efficiently you use the GPUs you already have. Training happens once; inference runs the entire time users use your product. So a service's economics come down to cost per token. Inference acceleration uses software to pull several times more out of the same GPU — the effect of plugging in one more "virtual GPU." VIDRAFT's VKAE, measured (B200, same-harness, no quality loss): Qwen3.5-35B-A3B (MoE): 25.7 → 601 tok/s (23.4×) Darwin-36B-Opus (in-house MoE): 25.0 → 280.8 (11.2×) 10,000+ tok/s peak aggregate under concurrency The key: it's reproducible — model + serving shipped as one container. docker pull vidraft/qwen35-vkae:601 Don't take our word for it — run it yourself. The mechanism will be released as a paper. 🏆 Leaderboard & demo 👉 https://huggingface.co/spaces/VIDraft/vkae Articles 👉 https://huggingface.co/blog/FINAL-Bench/vkae-leaderboard

View all activity

Organizations

John6666 's collections 19