Is there alghorytm available?
#1
by
podarok - opened
Hi
Thanks for your work, it is giving me a lot of great foundational excitement
I'm working on implementing custom Ukrainian Stemmer ( rust <> python ) and Context Aware Tokenizer in order to pretrain models from the ground up highly optimized.
You work gave me proper vector of investigation.
I'd appreciate if you can share how specifically was pruned data and how tokens were chosen for replacement candidates. I do see tons of tokens can be improved even more, but don't what to spend time on reinventing the wheel.
Thanks for your work