granite-docling-258M-onnx / CONVERSION_GUIDE.md

Add technical conversion reproduction guide

7ee6acf verified 8 months ago

3.52 kB

	# granite-docling ONNX Conversion Guide

	## Technical Reproduction Instructions

	This document provides complete instructions for reproducing the granite-docling ONNX conversion.

	### Prerequisites

	- Python 3.10+
	- ~4GB available RAM
	- ~2GB disk space for conversion environment

	### Step 1: Environment Setup

	```bash
	# Create isolated environment
	python3 -m venv onnx_converter
	source onnx_converter/bin/activate # Linux/Mac
	# or onnx_converter\Scripts\activate # Windows

	# Install dependencies
	pip install torch torchvision transformers optimum[onnxruntime] safetensors
	```

	### Step 2: Download Original Model

	```bash
	# Download granite-docling SafeTensors model
	mkdir granite-docling-258m
	cd granite-docling-258m

	curl -L "https://huggingface.co/ibm-granite/granite-docling-258M/resolve/main/model.safetensors" -o model.safetensors
	curl -L "https://huggingface.co/ibm-granite/granite-docling-258M/resolve/main/config.json" -o config.json
	curl -L "https://huggingface.co/ibm-granite/granite-docling-258M/resolve/main/tokenizer.json" -o tokenizer.json
	curl -L "https://huggingface.co/ibm-granite/granite-docling-258M/resolve/main/preprocessor_config.json" -o preprocessor_config.json
	```

	### Step 3: Install IBM Experimental Fork

	```bash
	# Clone IBM experimental optimum-onnx fork
	git clone https://github.com/gabe-l-hart/optimum-onnx.git
	cd optimum-onnx
	git checkout Idefics3Support

	# Install experimental fork
	pip install -e . --force-reinstall
	```

	### Step 4: Convert to ONNX

	```python
	import os
	import torch
	os.environ['CUDA_VISIBLE_DEVICES'] = '' # Force CPU

	from pathlib import Path
	from transformers import Idefics3ForConditionalGeneration
	from optimum.exporters.onnx import export
	from optimum.exporters.onnx.model_configs import Idefics3OnnxConfig

	# Load model
	model = Idefics3ForConditionalGeneration.from_pretrained(
	'./granite-docling-258m',
	trust_remote_code=True,
	torch_dtype=torch.float32
	).to('cpu')

	# Create ONNX config
	onnx_config = Idefics3OnnxConfig(model.config, task='image-to-text')

	# Export to ONNX
	output_path = Path('./granite_docling.onnx')
	export(model, onnx_config, output_path, 17)

	print(f"ONNX conversion complete: {output_path}")
	```

	### Expected Output

	```
	Initializing Idefics3ModelPatcher
	Entering Idefics3ModelPatcher context
	Patching Idefics3 model
	Using patched position embedding forward
	Exiting Idefics3ModelPatcher context
	ONNX conversion complete: granite_docling.onnx (1.2GB)
	```

	### Validation

	```python
	import onnxruntime as ort

	# Test ONNX model loading
	session = ort.InferenceSession('granite_docling.onnx')
	print("✅ ONNX model loads successfully")

	# Check input/output specifications
	for inp in session.get_inputs():
	print(f"Input: {inp.name} - {inp.shape}")
	for out in session.get_outputs():
	print(f"Output: {out.name} - {out.shape}")
	```

	## Troubleshooting

	### Common Issues

	1. "Custom architecture" error: Ensure using IBM experimental fork
	2. Memory errors: Use CPU-only conversion (`CUDA_VISIBLE_DEVICES=''`)
	3. Import errors: Verify experimental fork installed with `-e .`

	### Technical Notes

	- Conversion time: 5-10 minutes on typical CPU
	- Memory usage: ~4GB RAM during conversion
	- Warnings: TracerWarnings are expected for complex VLM
	- File size: ONNX (~1.2GB) vs SafeTensors (~492MB) due to graph inclusion

	## Attribution

	Original model: IBM Research granite-docling-258M
	Conversion method: IBM experimental Idefics3Support optimum-onnx fork
	Documentation: lamco-development