MemTok-Qwen2.5-VL-3B (ViDoRe)

This model uses Memory Tokens (MemTok) to compress multi-vector visual document representations for efficient ColBERT-style late interaction retrieval. Model weights are initialized from Qwen2.5-VL-3B-Instruct and finetuned on the ColPali train set for text-to-visual-document retrieval with bidirectional attention.

MemTok compresses ~1300 visual document token vectors into a fixed budget of 64 vectors (95.1% compression) via learnable memory tokens that aggregate document information through attention.

arXiv GitHub License

Method Overview

MemTok appends a set of m learnable memory tokens to the document token sequence. The concatenated sequence is encoded with a bidirectional transformer; after self-attention, each memory token has attended over the full document. The final hidden states of the m memory tokens form the compressed multi-vector representation used for ColBERT-style MaxSim retrieval.

Method

Results on ViDoRe v2

Method Tokens nDCG@5 (Avg) Bio Econ ESG-R ESG-H
ColPali – 53.3 56.5 49.9 55.7 51.1
ColQwenOmni – 56.5 56.5 53.2 54.2 62.2
MetaEmbed 64 58.8 58.7 55.5 57.4 63.7
Baseline (Ours, uncompressed) 1297 60.0 61.4 53.9 57.0 67.6
SeqResize 64 51.7 54.7 53.5 45.2 53.5
MemTok (This model) 64 54.3 56.8 53.0 46.4 61.4
H-Pool 64 56.4 59.6 52.1 53.4 60.6
AGC 64 56.7 59.0 54.5 55.8 57.3

Model Details

Initial weights Qwen2.5-VL-3B-Instruct
Architecture Qwen2.5-VL with bidirectional attention
Hidden dimension 2048
Compression method MemTok (memory tokens)
Memory tokens 64 learned tokens (<|mem0|> – <|mem63|>) appended to document
Budget 64 vectors per document
Scoring ColBERT-style MaxSim (late interaction)
Normalization L2-normalized embeddings
Query prefix "Query: "
Passage prefix "Passage: "
Precision bfloat16
Max image tokens 1280

Usage

import torch
from transformers import AutoProcessor
from qwen_vl_utils import process_vision_info

from src.arguments import ModelArguments
from src.encoder.multivec_encoder import MultiVecEncoder
from src.models.qwen2_5_vl_embed.qwen2_5_vl_embed import Qwen2_5ForEmbedding
from src.utils import get_appending_token_strings

MODEL_ID = "hltcoe/MemTok_qwen2.5-vl_colpali"
IMAGE_PATH = "PLACEHOLDER"
NUM_MEMORY_TOKENS = 64
APPENDING_SUFFIX = "".join(get_appending_token_strings(NUM_MEMORY_TOKENS))

# --- Setup ---
model_args = ModelArguments(
    model_name_or_path=MODEL_ID,
    pooling="memory",
    normalize=True,
    num_appending_token=NUM_MEMORY_TOKENS,
    use_parametric_appending_tokens=True,
    attn_implementation="flash_attention_2",
)

processor = AutoProcessor.from_pretrained(MODEL_ID)
model = MultiVecEncoder.load(
    Qwen2_5ForEmbedding,
    model_args,
    attn_implementation=model_args.attn_implementation,
    dtype=torch.bfloat16,
)
model = model.to("cuda").eval()

# --- Encode an image document ---
passage_messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Passage: "},
            {"type": "image", "image": IMAGE_PATH, "max_pixels": 1003520, "min_pixels": 614656},
        ],
    }
]
text = processor.apply_chat_template(passage_messages, tokenize=False, add_generation_prompt=False)
text += APPENDING_SUFFIX
image_inputs, video_inputs = process_vision_info(passage_messages)
passage_inputs = processor(
    text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt",
).to("cuda")

with torch.amp.autocast(device_type="cuda", dtype=torch.bfloat16):
    with torch.inference_mode():
        doc_embeddings, doc_mask = model.encode(passage_inputs, is_query=False)
        print(doc_embeddings.shape)
        # doc_embeddings: (1, 64, 2048) — 64 MemTok vectors

# --- Encode a text query ---
query_messages = [{"role": "user", "content": [{"type": "text", "text": "Query: What types of tissues are unable to regenerate spontaneously?"}]}]
query_text = processor.apply_chat_template(query_messages, tokenize=False, add_generation_prompt=False)
query_inputs = processor(text=[query_text], padding=True, return_tensors="pt").to("cuda")

with torch.amp.autocast(device_type="cuda", dtype=torch.bfloat16):
    with torch.inference_mode():
        query_embeddings, query_mask = model.encode(query_inputs, is_query=True)
        print(query_embeddings.shape)

# --- ColBERT MaxSim scoring ---
score = model.compute_similarity(query_embeddings, doc_embeddings, query_mask, doc_mask)
print(f"Similarity score: {score.item():.4f}")

Command line usage

For running inference and evaluation from the command line, see the Quick Start section.

Citation

@misc{qin2026multivectorindexcompressionmodality,
      title={Multi-Vector Index Compression in Any Modality}, 
      author={Hanxiang Qin and Alexander Martin and Rohan Jha and Chunsheng Zuo and Reno Kriz and Benjamin Van Durme},
      year={2026},
      eprint={2602.21202},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2602.21202}, 
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Safetensors
Model size
756k params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for hltcoe/MemTok_qwen2.5-vl_colpali

Finetuned
(701)
this model

Dataset used to train hltcoe/MemTok_qwen2.5-vl_colpali

Collection including hltcoe/MemTok_qwen2.5-vl_colpali

Paper for hltcoe/MemTok_qwen2.5-vl_colpali