File size: 3,369 Bytes

b5756eb
a5b5aa2
 
 
 
b5756eb
 
a5b5aa2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b5756eb
 
a5b5aa2
b5756eb
a5b5aa2
b5756eb
a5b5aa2
b5756eb
a5b5aa2
b5756eb
a5b5aa2
 
 
 
 
 
 
b5756eb
a5b5aa2
b5756eb
a5b5aa2
b5756eb
a5b5aa2
 
b5756eb
a5b5aa2
b5756eb
a5b5aa2
b5756eb
a5b5aa2
 
 
 
 
 
 
 
 
 
 
 
b5756eb
a5b5aa2
b5756eb
 
 
 
 
a5b5aa2
b5756eb
a5b5aa2
 
 
b5756eb
a5b5aa2
b5756eb
a5b5aa2
b5756eb
a5b5aa2
 
 
 
b5756eb
 
a5b5aa2
b5756eb
a5b5aa2
 
 
 
 
b5756eb
a5b5aa2
b5756eb
a5b5aa2
 
 
 
b5756eb
 
 
 
a5b5aa2
 
 
 
 
 
b5756eb

---
language:
- ta
- en
license: apache-2.0
base_model: intfloat/multilingual-e5-base
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- tamil
- embedding
- sentence-transformers
- matryoshka
- dravidian
- cross-lingual
model-index:
- name: Tamil-Embed-Base
  results:
  - task:
      type: STS
    dataset:
      name: IndicCrosslingualSTS (en-ta)
      type: mteb/IndicCrosslingualSTS
    metrics:
    - type: spearman
      value: 0.489
      name: Spearman (en-ta)
---

# Tamil-Embed-Base

A Tamil-specialized sentence embedding model fine-tuned from multilingual-e5-base (278M parameters) using Matryoshka representation learning.

**Paper:** *"A Thousand Language Problem: Morphological Understanding in Linguistic AI"*

## Model Details

| Property | Value |
|----------|-------|
| Base model | [intfloat/multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) |
| Parameters | 278M |
| Embedding dimensions | 768 (supports Matryoshka: 768, 512, 256, 128, 64) |
| Training data | NLI entailment pairs (ta) + Samanantar parallel corpus (~50K pairs) |
| Loss function | MatryoshkaLoss + MultipleNegativesRankingLoss |

## Training

Two-stage training pipeline:

1. **Stage 1 (NLI Warm-up):** Fine-tune on Tamil NLI entailment pairs (ANLI, FEVER, LING, MNLI, WANLI) with MatryoshkaLoss wrapping MultipleNegativesRankingLoss
2. **Stage 2 (Retrieval):** Fine-tune on Samanantar English-Tamil parallel corpus with hard negatives

## MTEB Results

IndicCrosslingualSTS benchmark (Spearman correlation):

| Language Pair | Score |
|---------------|-------|
| en-hi (Hindi) | 0.640 |
| en-kn (Kannada) | 0.584 |
| en-ml (Malayalam) | 0.582 |
| en-bn (Bengali) | 0.537 |
| en-pa (Punjabi) | 0.536 |
| en-gu (Gujarati) | 0.533 |
| en-as (Assamese) | 0.512 |
| **en-ta (Tamil)** | **0.489** |
| en-mr (Marathi) | 0.485 |
| en-te (Telugu) | 0.468 |

## Usage

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Tamil-ai/tamil-embed-base")

sentences = [
    "query: தமிழ் மொழியின் வரலாறு என்ன?",
    "passage: தமிழ் மொழி 2000 ஆண்டுகளுக்கும் மேலான வரலாற்றைக் கொண்ட செம்மொழியாகும்.",
    "passage: Python is a popular programming language.",
]

embeddings = model.encode(sentences)
print(embeddings.shape)  # (3, 768)

# Compute similarity
from sentence_transformers.util import cos_sim
similarities = cos_sim(embeddings[0], embeddings[1:])
print(similarities)  # Tamil passage should score higher
```

### Matryoshka (variable dimensions)

```python
# Use smaller dimensions for faster search with minimal quality loss
embeddings_256 = model.encode(sentences, output_value="sentence_embedding")[:, :256]
embeddings_128 = model.encode(sentences, output_value="sentence_embedding")[:, :128]
```

## Intended Use

- Tamil semantic search and retrieval
- Cross-lingual English-Tamil similarity
- Tamil document clustering
- RAG (Retrieval Augmented Generation) for Tamil

## Citation

```bibtex
@misc{tamilai2026embed,
  title={A Thousand Language Problem: Morphological Understanding in Linguistic AI},
  author={Tamil-AI},
  year={2026},
  publisher={HuggingFace},
  url={https://huggingface.co/Tamil-ai/tamil-embed-base}
}
```