Commit ·
8c38300
1
Parent(s): d3af284
Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,75 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- vi
|
| 4 |
+
- en
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# Description
|
| 8 |
+
|
| 9 |
+
This tokenizer is of type [Unigram](https://arxiv.org/pdf/1804.10959.pdf), supporting both English and Vietnamese.
|
| 10 |
+
|
| 11 |
+
Along with tokenization, this tokenizer also does diacritics normalization (for Vietnamese). For example: `hóa → hoá`, `hủy → huỷ`.
|
| 12 |
+
|
| 13 |
+
# Details
|
| 14 |
+
|
| 15 |
+
## Library used to train
|
| 16 |
+
https://github.com/google/sentencepiece
|
| 17 |
+
|
| 18 |
+
## Training Data
|
| 19 |
+
https://huggingface.co/datasets/levuloihust/vien-corpus-for-tokenizer
|
| 20 |
+
|
| 21 |
+
## Training script
|
| 22 |
+
```bash
|
| 23 |
+
./spm_train \
|
| 24 |
+
--input=vien-corpus.txt \
|
| 25 |
+
--model_prefix=vien \
|
| 26 |
+
--vocab_size=64000 \
|
| 27 |
+
--user_defined_symbols_file=user_defined_symbols.txt \
|
| 28 |
+
--required_chars_file=required_chars.txt \
|
| 29 |
+
--unk_surface="<unk>" \
|
| 30 |
+
--byte_fallback=false \
|
| 31 |
+
--split_by_unicode_script=true \
|
| 32 |
+
--split_by_number=true \
|
| 33 |
+
--split_digits=true \
|
| 34 |
+
--normalization_rule_tsv=nmt_nfkc_vidiacritic.tsv
|
| 35 |
+
```
|
| 36 |
+
`spm_train` is the executable file built by following installation guide in https://github.com/google/sentencepiece. Other files (`user_defined_symbols.txt`, `required_chars.txt` and `nmt_nfkc_vidiacritic.tsv`) are provided in this repo.
|
| 37 |
+
|
| 38 |
+
The training script should be run on a machine with 64GB RAM. After training, we get two files `vien.model` and `vien.vocab`.
|
| 39 |
+
|
| 40 |
+
## Convert SPM model to HuggingFace tokenizer
|
| 41 |
+
|
| 42 |
+
Run the following python script to convert SPM model to HuggingFace tokenizer.
|
| 43 |
+
```python
|
| 44 |
+
from transformers import DebertaV2Tokenizer
|
| 45 |
+
|
| 46 |
+
tokenizer = DebertaV2Tokenizer(
|
| 47 |
+
vocab_file="assets/spm/vien.model",
|
| 48 |
+
do_lower_case=False,
|
| 49 |
+
split_by_punct=False,
|
| 50 |
+
bos_token="<s>",
|
| 51 |
+
eos_token="</s>",
|
| 52 |
+
unk_token="<unk>",
|
| 53 |
+
sep_token="<sep>",
|
| 54 |
+
pad_token="<pad>",
|
| 55 |
+
cls_token="<cls>",
|
| 56 |
+
mask_token="<mask>"
|
| 57 |
+
)
|
| 58 |
+
tokenizer.save_pretrained("assets/hf-tokenizer")
|
| 59 |
+
```
|
| 60 |
+
Replace `assets/spm/vien.model` and `assets/hf-tokenizer` with the correct path on your local machine.
|
| 61 |
+
|
| 62 |
+
## Usage
|
| 63 |
+
```python
|
| 64 |
+
from transformers import AutoTokenizer
|
| 65 |
+
|
| 66 |
+
tokenizer = AutoTokenizer.from_pretrained("levuloihust/vien-unigram-tokenizer", use_fast=False)
|
| 67 |
+
tokens = tokenizer.tokenize("How are you? Thời tiết hôm nay đẹp wóa trời lun =))")
|
| 68 |
+
print(tokens)
|
| 69 |
+
# ['▁How', '▁are', '▁you', '?', '▁Thời', '▁tiết', '▁hôm', '▁nay', '▁đẹp', '▁wo', 'á', '▁trời', '▁lun', '▁=))']
|
| 70 |
+
```
|
| 71 |
+
|
| 72 |
+
Note that you must set `use_fast=False` for the tokenizer to properly function. In case `use_fast=True` (default), the tokenizer cannot perform normalization (Note that in the usage example, `wóa` was changed to `woá`)
|
| 73 |
+
|
| 74 |
+
# Contact information
|
| 75 |
+
For personal communication related to this project, please contact Loi Le Vu (levuloihust@gmail.com).
|