levuloihust
/

vien-unigram-tokenizer

Model card Files Files and versions

levuloihust commited on Nov 5, 2023

Commit

8c38300

·

1 Parent(s): d3af284

Create README.md

Files changed (1) hide show

README.md +75 -0

README.md ADDED Viewed

	@@ -0,0 +1,75 @@

+---
+language:
+- vi
+- en
+---
+# Description
+This tokenizer is of type [Unigram](https://arxiv.org/pdf/1804.10959.pdf), supporting both English and Vietnamese.
+Along with tokenization, this tokenizer also does diacritics normalization (for Vietnamese). For example: `hóa → hoá`, `hủy → huỷ`.
+# Details
+## Library used to train
+https://github.com/google/sentencepiece
+## Training Data
+https://huggingface.co/datasets/levuloihust/vien-corpus-for-tokenizer
+## Training script
+```bash
+./spm_train \
+    --input=vien-corpus.txt \
+    --model_prefix=vien \
+    --vocab_size=64000 \
+    --user_defined_symbols_file=user_defined_symbols.txt \
+    --required_chars_file=required_chars.txt \
+    --unk_surface="<unk>" \
+    --byte_fallback=false \
+    --split_by_unicode_script=true \
+    --split_by_number=true \
+    --split_digits=true \
+    --normalization_rule_tsv=nmt_nfkc_vidiacritic.tsv
+```
+`spm_train` is the executable file built by following installation guide in https://github.com/google/sentencepiece. Other files (`user_defined_symbols.txt`, `required_chars.txt` and `nmt_nfkc_vidiacritic.tsv`) are provided in this repo.
+The training script should be run on a machine with 64GB RAM. After training, we get two files `vien.model` and `vien.vocab`.
+## Convert SPM model to HuggingFace tokenizer
+Run the following python script to convert SPM model to HuggingFace tokenizer.
+```python
+from transformers import DebertaV2Tokenizer
+tokenizer = DebertaV2Tokenizer(
+    vocab_file="assets/spm/vien.model",
+    do_lower_case=False,
+    split_by_punct=False,
+    bos_token="<s>",
+    eos_token="</s>",
+    unk_token="<unk>",
+    sep_token="<sep>",
+    pad_token="<pad>",
+    cls_token="<cls>",
+    mask_token="<mask>"
+)
+tokenizer.save_pretrained("assets/hf-tokenizer")
+```
+Replace `assets/spm/vien.model` and `assets/hf-tokenizer` with the correct path on your local machine.
+## Usage
+```python
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained("levuloihust/vien-unigram-tokenizer", use_fast=False)
+tokens = tokenizer.tokenize("How are you? Thời tiết hôm nay đẹp wóa trời lun =))")
+print(tokens)
+# ['▁How', '▁are', '▁you', '?', '▁Thời', '▁tiết', '▁hôm', '▁nay', '▁đẹp', '▁wo', 'á', '▁trời', '▁lun', '▁=))']
+```
+Note that you must set `use_fast=False` for the tokenizer to properly function. In case `use_fast=True` (default), the tokenizer cannot perform normalization (Note that in the usage example, `wóa` was changed to `woá`)
+# Contact information
+For personal communication related to this project, please contact Loi Le Vu (levuloihust@gmail.com).