T5 Zh→En (OPUS-100)
This model is a T5-style encoder–decoder Transformer trained for Chinese → English translation on OPUS-100 (500,000 sentence pairs).
Dataset
- Helsinki-NLP/opus-100 (en-zh)
- Training size: 500,000 sentence pairs
Model Architecture
- d_model = 512
- d_ff = 2048
- 4 encoder layers
- 4 decoder layers
- num_heads = 8
- max sequence length = 128
- Cosine learning rate scheduler
Evaluation
- Metric: chrF
- Reported chrF: 57.27
Inference Example
Example translations generated by the model:
ZH: 你是谁?
EN: Who are you?
ZH: 我喜欢喝咖啡。
EN: I like coffee.
Tokenizer
This repository includes spm.model, a SentencePiece tokenizer trained jointly on Chinese and English text from OPUS-100.
Tokenizer settings:
- Vocabulary size: 16,000
- Unigram model
- Character coverage: 0.9995
Usage
Load the model using:
from transformers import T5ForConditionalGeneration
import sentencepiece as spm
model = T5ForConditionalGeneration.from_pretrained("BrandenTung/t5-zh-en-opus")
sp = spm.SentencePieceProcessor()
sp.load("spm.model")
Before encoding, add this prefix to the Chinese sentence:
translate Chinese to English:
- Downloads last month
- 19