MemTok-Qwen2.5-VL-3B (ViDoRe)
This model uses Memory Tokens (MemTok) to compress multi-vector visual document representations for efficient ColBERT-style late interaction retrieval. Model weights are initialized from Qwen2.5-VL-3B-Instruct and finetuned on the ColPali train set for text-to-visual-document retrieval with bidirectional attention.
MemTok compresses ~1300 visual document token vectors into a fixed budget of 64 vectors (95.1% compression) via learnable memory tokens that aggregate document information through attention.
Method Overview
MemTok appends a set of m learnable memory tokens to the document token sequence. The concatenated sequence is encoded with a bidirectional transformer; after self-attention, each memory token has attended over the full document. The final hidden states of the m memory tokens form the compressed multi-vector representation used for ColBERT-style MaxSim retrieval.
Results on ViDoRe v2
| Method | Tokens | nDCG@5 (Avg) | Bio | Econ | ESG-R | ESG-H |
|---|---|---|---|---|---|---|
| ColPali | – | 53.3 | 56.5 | 49.9 | 55.7 | 51.1 |
| ColQwenOmni | – | 56.5 | 56.5 | 53.2 | 54.2 | 62.2 |
| MetaEmbed | 64 | 58.8 | 58.7 | 55.5 | 57.4 | 63.7 |
| Baseline (Ours, uncompressed) | 1297 | 60.0 | 61.4 | 53.9 | 57.0 | 67.6 |
| SeqResize | 64 | 51.7 | 54.7 | 53.5 | 45.2 | 53.5 |
| MemTok (This model) | 64 | 54.3 | 56.8 | 53.0 | 46.4 | 61.4 |
| H-Pool | 64 | 56.4 | 59.6 | 52.1 | 53.4 | 60.6 |
| AGC | 64 | 56.7 | 59.0 | 54.5 | 55.8 | 57.3 |
Model Details
| Initial weights | Qwen2.5-VL-3B-Instruct |
| Architecture | Qwen2.5-VL with bidirectional attention |
| Hidden dimension | 2048 |
| Compression method | MemTok (memory tokens) |
| Memory tokens | 64 learned tokens (<|mem0|> – <|mem63|>) appended to document |
| Budget | 64 vectors per document |
| Scoring | ColBERT-style MaxSim (late interaction) |
| Normalization | L2-normalized embeddings |
| Query prefix | "Query: " |
| Passage prefix | "Passage: " |
| Precision | bfloat16 |
| Max image tokens | 1280 |
Usage
import torch
from transformers import AutoProcessor
from qwen_vl_utils import process_vision_info
from src.arguments import ModelArguments
from src.encoder.multivec_encoder import MultiVecEncoder
from src.models.qwen2_5_vl_embed.qwen2_5_vl_embed import Qwen2_5ForEmbedding
from src.utils import get_appending_token_strings
MODEL_ID = "hltcoe/MemTok_qwen2.5-vl_colpali"
IMAGE_PATH = "PLACEHOLDER"
NUM_MEMORY_TOKENS = 64
APPENDING_SUFFIX = "".join(get_appending_token_strings(NUM_MEMORY_TOKENS))
# --- Setup ---
model_args = ModelArguments(
model_name_or_path=MODEL_ID,
pooling="memory",
normalize=True,
num_appending_token=NUM_MEMORY_TOKENS,
use_parametric_appending_tokens=True,
attn_implementation="flash_attention_2",
)
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = MultiVecEncoder.load(
Qwen2_5ForEmbedding,
model_args,
attn_implementation=model_args.attn_implementation,
dtype=torch.bfloat16,
)
model = model.to("cuda").eval()
# --- Encode an image document ---
passage_messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "Passage: "},
{"type": "image", "image": IMAGE_PATH, "max_pixels": 1003520, "min_pixels": 614656},
],
}
]
text = processor.apply_chat_template(passage_messages, tokenize=False, add_generation_prompt=False)
text += APPENDING_SUFFIX
image_inputs, video_inputs = process_vision_info(passage_messages)
passage_inputs = processor(
text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt",
).to("cuda")
with torch.amp.autocast(device_type="cuda", dtype=torch.bfloat16):
with torch.inference_mode():
doc_embeddings, doc_mask = model.encode(passage_inputs, is_query=False)
print(doc_embeddings.shape)
# doc_embeddings: (1, 64, 2048) — 64 MemTok vectors
# --- Encode a text query ---
query_messages = [{"role": "user", "content": [{"type": "text", "text": "Query: What types of tissues are unable to regenerate spontaneously?"}]}]
query_text = processor.apply_chat_template(query_messages, tokenize=False, add_generation_prompt=False)
query_inputs = processor(text=[query_text], padding=True, return_tensors="pt").to("cuda")
with torch.amp.autocast(device_type="cuda", dtype=torch.bfloat16):
with torch.inference_mode():
query_embeddings, query_mask = model.encode(query_inputs, is_query=True)
print(query_embeddings.shape)
# --- ColBERT MaxSim scoring ---
score = model.compute_similarity(query_embeddings, doc_embeddings, query_mask, doc_mask)
print(f"Similarity score: {score.item():.4f}")
Command line usage
For running inference and evaluation from the command line, see the Quick Start section.
Citation
@misc{qin2026multivectorindexcompressionmodality,
title={Multi-Vector Index Compression in Any Modality},
author={Hanxiang Qin and Alexander Martin and Rohan Jha and Chunsheng Zuo and Reno Kriz and Benjamin Van Durme},
year={2026},
eprint={2602.21202},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2602.21202},
}
Model tree for hltcoe/MemTok_qwen2.5-vl_colpali
Base model
Qwen/Qwen2.5-VL-3B-Instruct