T5 Zh→En (OPUS-100)

This model is a T5-style encoder–decoder Transformer trained for Chinese → English translation on OPUS-100 (500,000 sentence pairs).

Dataset

Helsinki-NLP/opus-100 (en-zh)
Training size: 500,000 sentence pairs

Model Architecture

d_model = 512
d_ff = 2048
4 encoder layers
4 decoder layers
num_heads = 8
max sequence length = 128
Cosine learning rate scheduler

Evaluation

Metric: chrF
Reported chrF: 57.27

Inference Example

Example translations generated by the model:

ZH: 你是谁？
EN: Who are you?

ZH: 我喜欢喝咖啡。
EN: I like coffee.

Tokenizer

This repository includes spm.model, a SentencePiece tokenizer trained jointly on Chinese and English text from OPUS-100.

Tokenizer settings:

Vocabulary size: 16,000
Unigram model
Character coverage: 0.9995

Usage

Load the model using:

from transformers import T5ForConditionalGeneration
import sentencepiece as spm

model = T5ForConditionalGeneration.from_pretrained("BrandenTung/t5-zh-en-opus")
sp = spm.SentencePieceProcessor()
sp.load("spm.model")

Before encoding, add this prefix to the Chinese sentence:
translate Chinese to English:

Downloads last month: 19

Safetensors

Model size

37.6M params

Tensor type

F32

BrandenTung
/

t5-zh-en-opus