File size: 3,369 Bytes
b5756eb a5b5aa2 b5756eb a5b5aa2 b5756eb a5b5aa2 b5756eb a5b5aa2 b5756eb a5b5aa2 b5756eb a5b5aa2 b5756eb a5b5aa2 b5756eb a5b5aa2 b5756eb a5b5aa2 b5756eb a5b5aa2 b5756eb a5b5aa2 b5756eb a5b5aa2 b5756eb a5b5aa2 b5756eb a5b5aa2 b5756eb a5b5aa2 b5756eb a5b5aa2 b5756eb a5b5aa2 b5756eb a5b5aa2 b5756eb a5b5aa2 b5756eb a5b5aa2 b5756eb a5b5aa2 b5756eb a5b5aa2 b5756eb a5b5aa2 b5756eb a5b5aa2 b5756eb | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 | ---
language:
- ta
- en
license: apache-2.0
base_model: intfloat/multilingual-e5-base
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- tamil
- embedding
- sentence-transformers
- matryoshka
- dravidian
- cross-lingual
model-index:
- name: Tamil-Embed-Base
results:
- task:
type: STS
dataset:
name: IndicCrosslingualSTS (en-ta)
type: mteb/IndicCrosslingualSTS
metrics:
- type: spearman
value: 0.489
name: Spearman (en-ta)
---
# Tamil-Embed-Base
A Tamil-specialized sentence embedding model fine-tuned from multilingual-e5-base (278M parameters) using Matryoshka representation learning.
**Paper:** *"A Thousand Language Problem: Morphological Understanding in Linguistic AI"*
## Model Details
| Property | Value |
|----------|-------|
| Base model | [intfloat/multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) |
| Parameters | 278M |
| Embedding dimensions | 768 (supports Matryoshka: 768, 512, 256, 128, 64) |
| Training data | NLI entailment pairs (ta) + Samanantar parallel corpus (~50K pairs) |
| Loss function | MatryoshkaLoss + MultipleNegativesRankingLoss |
## Training
Two-stage training pipeline:
1. **Stage 1 (NLI Warm-up):** Fine-tune on Tamil NLI entailment pairs (ANLI, FEVER, LING, MNLI, WANLI) with MatryoshkaLoss wrapping MultipleNegativesRankingLoss
2. **Stage 2 (Retrieval):** Fine-tune on Samanantar English-Tamil parallel corpus with hard negatives
## MTEB Results
IndicCrosslingualSTS benchmark (Spearman correlation):
| Language Pair | Score |
|---------------|-------|
| en-hi (Hindi) | 0.640 |
| en-kn (Kannada) | 0.584 |
| en-ml (Malayalam) | 0.582 |
| en-bn (Bengali) | 0.537 |
| en-pa (Punjabi) | 0.536 |
| en-gu (Gujarati) | 0.533 |
| en-as (Assamese) | 0.512 |
| **en-ta (Tamil)** | **0.489** |
| en-mr (Marathi) | 0.485 |
| en-te (Telugu) | 0.468 |
## Usage
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Tamil-ai/tamil-embed-base")
sentences = [
"query: தமிழ் மொழியின் வரலாறு என்ன?",
"passage: தமிழ் மொழி 2000 ஆண்டுகளுக்கும் மேலான வரலாற்றைக் கொண்ட செம்மொழியாகும்.",
"passage: Python is a popular programming language.",
]
embeddings = model.encode(sentences)
print(embeddings.shape) # (3, 768)
# Compute similarity
from sentence_transformers.util import cos_sim
similarities = cos_sim(embeddings[0], embeddings[1:])
print(similarities) # Tamil passage should score higher
```
### Matryoshka (variable dimensions)
```python
# Use smaller dimensions for faster search with minimal quality loss
embeddings_256 = model.encode(sentences, output_value="sentence_embedding")[:, :256]
embeddings_128 = model.encode(sentences, output_value="sentence_embedding")[:, :128]
```
## Intended Use
- Tamil semantic search and retrieval
- Cross-lingual English-Tamil similarity
- Tamil document clustering
- RAG (Retrieval Augmented Generation) for Tamil
## Citation
```bibtex
@misc{tamilai2026embed,
title={A Thousand Language Problem: Morphological Understanding in Linguistic AI},
author={Tamil-AI},
year={2026},
publisher={HuggingFace},
url={https://huggingface.co/Tamil-ai/tamil-embed-base}
}
```
|