TrorYongOCR

This repository contains model weights and configuration files for the pre-trained model.

TrorYongOCR is a tiny encoder-decoder model for Scene Text Recognition task. Its image encoder projects image into character embedding space allowing the text decoder to process the image encoding as a prefill prompt and generate character tokens in autoregressive manner. Current pre-trained weight supports 2 languages: Khmer and English.

Model Details

Developed by: KHUN Kimang (Ph.D.)
Shared by: KrorngAI
Model type: OCR (Optical Character Recognition)
Language(s) (NLP): Khmer and English

Model Architecture

Model Sources

This model has been pushed to the Hub using the PytorchModelHubMixin integration:

Code: https://pypi.org/project/tror-yong-ocr/
Blog Post: https://kimang18.github.io/krorngai-blog/TrorYongOCR/
Demo:: https://krorngai-troryongocr-demo.hf.space

Model Configuration

The choice of model configuration can be found below where the image input dimension is $(H, W) = (32, 128)$ where $H$ and $W$ are height and width of image respectively, patch size is $(4, 8)$, block size or maximum number of input tokens is $192$. Transformer configuration is the following: there are $4$ blocks, each has embedding dimension $d_{model}=384$ and $h=6$ heads. In particular, encoding blocks (block $1$ to $3$) have MLP dimension $d_{MLP}=2d_{model}=768$ and the decoding block has $d_{MLP}=4d_{model}=1546$.

Layer	$d_{model}$	$h$	$d_{MLP}$	Role
1	384	6	768	Encoder
2	384	6	768	Encoder
3	384	6	768	Encoder
4	384	6	1546	Decoder

Training Detail

The pre-trained weight of TrorYongOCR can be found here. It is obtained by training on seanghay/khmer-hanuman-100k and SoyVitou/KhmerSynthetic1M datasets. Due to no benchmark datasets for Khmer language, the metric I can share here is that TrorYongOCR achieves 6.8% character error rate (CER) on my test dataset containing 33613 samples. This result is very good for OCR model. However, my test dataset is a split of both datasets mentioned earlier.

Datasets

Both datasets have an aspect ratio, $\frac{W}{H}$, varies from $1$ to $15$. Since the ratio of my input image dimension is $\frac{128}{32}=4$, the images of both datasets with extreme ratio will be transformed radically and the characters inside the images are transformed too excessively. This severely impacts the character features and makes it hard for model to capture meaningful features in the transformed images. To resolve this, all images with an aspect ratio larger than 5 are filtered out. This leaves me with combined dataset of 336135 samples.

KhmerSynthetic1M

KhmerSynthetic1M is a dataset by Mr. Soy Vitou. This dataset contains images in gray monochromatic color palette (black, white, gray, etc.,). The distribution of the number of tokens, i.e. frequency of each number of tokens, is fairly uniform. In particular, the maximum number of tokens is around $120$. This implies that there are images with aspect ratio largely higher than $4$.

khmer-hanuman-100k

This dataset by Mr. Yat Seanghay contains images with a variety of background colors and character colors.

Combined dataset

The final dataset has 336135 samples, 33613 samples are for test set, 3025 samples are for validation set, and the rest is for train set. TrorYongOCR is trained for 20 epochs in float16 mixed precision using LightningAI package.

Weight Initialization

We initialize weights as what SOTA models reguarly do. The code to initialize the weight is given below.

Exceptionally, for position embedding used in the decoding block, I initialized it with $std=1.0$.

def init_weights(self, module: nn.Module, name: str = '', exclude: Sequence[str] = ('')):
    """Initialize the weights using the typical initialization schemes used in SOTA models."""
    if any(map(name.startswith, exclude)):
        return
    if isinstance(module, nn.Linear):
        nn.init.trunc_normal_(module.weight, std=0.02)
        if module.bias is not None:
            nn.init.zeros_(module.bias)
    elif isinstance(module, nn.Embedding):
        nn.init.trunc_normal_(module.weight, std=0.02)
        if module.padding_idx is not None:
            module.weight.data[module.padding_idx].zero_()
    elif isinstance(module, nn.Conv2d):
        nn.init.kaiming_normal_(module.weight)
        if module.bias is not None:
            nn.init.zeros_(module.bias)
    elif isinstance(module, (nn.LayerNorm, nn.BatchNorm2d, nn.GroupNorm)):
        nn.init.ones_(module.weight)
        nn.init.zeros_(module.bias)

Citation

BibTeX:

@online{khun2026,
  author = {KHUN, Kimang},
  title = {TrorYongOCR: {Encoder-Decoder} {Model} for {Scene} {Text}
    {Recognition}},
  date = {2026-02-19},
  url = {https://kimang18.github.io/krorngai-blog/TrorYongOCR/},
  langid = {en}
}

Model Card Author

ឈ្មោះ: បណ្ឌិត ឃុន គីមអាង
Name: KHUN Kimang (Ph.D.)

Acknowledgement

LightningAI and Google Colab did not specifically sponsor this project. But, both models are be trained thanks to their free credits. So, huge thanks to LightningAI and Google Colab.

Thanks to Mr. Yat Seanghay and Mr. Soy Vitou for their publicly available dataset.

Model Card Contact

If you have any questions, please reach out at Facebook Page.

Downloads last month: 70

Safetensors

Model size

7.09M params

Tensor type

F32

BOOL

KrorngAI
/

TrorYongOCR