IntroSVG-Qwen2.5-VL-7B

Learning from Rendering Feedback for Text-to-SVG Generation via an Introspective Generator–Critic Framework

Accepted by CVPR 2026 🎉

Model Summary

IntroSVG-Qwen2.5-VL-7B is an end-to-end vision-language model that generates high-quality SVG (Scalable Vector Graphics) code directly from natural language descriptions. The model is fine-tuned from Qwen2.5-VL-7B-Instruct through a multi-stage training pipeline that combines supervised fine-tuning (SFT), curriculum learning, chain-of-thought (CoT) reasoning, and direct preference optimization (DPO).

The defining feature of IntroSVG is its introspective generator–critic framework: a single unified model alternates between two roles — generator (producing SVG code) and critic (rendering and evaluating its own output) — enabling an iterative generate → evaluate → refine loop at inference time.

Model Details

Property	Value
Base model	Qwen/Qwen2.5-VL-7B-Instruct
Parameters	~7B
Architecture	Vision-Language Model (VLM)
Modalities (input)	Text prompts and rendered SVG images (during the critique stage)
Modality (output)	SVG source code
Training data	SVG-1M (custom corpus, ~1M samples)
Training paradigm	SFT → DPO with curriculum learning and CoT
License	Apache 2.0

Method Overview

The model is built through three core stages:

1. Data Construction

A mixed corpus is synthesized using an early-checkpoint model and a teacher VLM, comprising three subsets:

Direct generation ($\mathcal{D}_G^{\text{direct}}$) — text-to-SVG pairs
Correction ($\mathcal{D}_G^{\text{correction}}$) — flawed SVGs paired with refinements
Critique ($\mathcal{D}_C$) — rendered SVGs paired with critique feedback

2. Supervised Fine-Tuning (SFT)

A unified VLM is trained on the mixed dataset, simultaneously acquiring:

SVG generation capability
SVG critique capability

3. Direct Preference Optimization (DPO)

A teacher VLM scores generated preference pairs, which are used to further optimize the generator policy $M_{\text{Policy}}$ via the DPO loss.

Introspective Inference Loop

At inference time, the same model performs a closed-loop introspective process:

Generate an initial SVG from the prompt.
Switch to the critic role: render the SVG and evaluate it.
Assign a quality score based on the critique.
If unsatisfactory, use the critique to guide the next round of correction.

This loop allows the model to refine its outputs iteratively without any external evaluator.

Intended Use

Primary use cases

Text-to-SVG generation for icons, simple illustrations, logos, diagrams, and UI elements
Programmatic vector graphics design as a creative co-pilot
Research on vision-language reasoning, code generation, and self-refinement methods

Out-of-scope use

The model is not intended for generating photorealistic raster images.
It is not optimized for generating extremely complex artwork or production-ready brand assets without human review.
It should not be used to produce misleading, infringing, or otherwise harmful imagery.

How to Use

Installation

# 1. Clone the repository
git clone https://github.com/gitcat-404/IntroSVG.git
cd IntroSVG

# 2. Create environment
conda create -n introsvg python=3.10 -y
conda activate introsvg

# 3. System dependency for cairosvg (Linux)
sudo apt update
sudo apt install libcairo2 libcairo2-dev

# 4. Python dependencies
pip install torch==2.5.1+cu124 torchvision==0.20.0+cu124 \
    --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt

Download model weights

pip install huggingface_hub
hf download gitcat404/IntroSVG-Qwen2.5-VL-7B \
    --local-dir Models/IntroSVG-Qwen2.5-VL-7B

Inference (recommended: lmdeploy server)

We recommend serving the model with lmdeploy for accelerated inference. Example with 4 GPUs:

CUDA_VISIBLE_DEVICES=0,1,2,3 lmdeploy serve api_server \
    "Models/IntroSVG-Qwen2.5-VL-7B" \
    --tp 4 \
    --server-port 23333

Then run the introspective inference loop on a CSV of prompts:

python inference_loop.py \
    --MODEL_NAME Models/IntroSVG-Qwen2.5-VL-7B \
    --CSV_FILE example/test.csv \
    --OUTPUT_DIR your_output_folder

An example prompt file is provided at example/test.csv in the GitHub repository — each row contains one text prompt for SVG generation.

Quick start with `transformers`

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "gitcat404/IntroSVG-Qwen2.5-VL-7B",
    torch_dtype="auto",
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("gitcat404/IntroSVG-Qwen2.5-VL-7B")

prompt = "A minimalist red apple with a green leaf."
messages = [{"role": "user", "content": [{"type": "text", "text": prompt}]}]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], return_tensors="pt").to(model.device)

output_ids = model.generate(**inputs, max_new_tokens=2048)
svg_code = processor.batch_decode(
    output_ids[:, inputs.input_ids.shape[1]:],
    skip_special_tokens=True,
)[0]
print(svg_code)

💡 To unlock the full introspective refinement loop (generate → render → critique → correct), please use inference_loop.py from the official repository — it handles SVG rendering and feeds the rendered image back to the model in its critic role.

Training

All experiments were conducted on 8 × NVIDIA A800 GPUs, using the LLaMA-Factory training pipeline.

Required artifacts:

Base model: Qwen/Qwen2.5-VL-7B-Instruct
Training data: SVG-1M-Json

Place the data under LLaMA-Factory/data/ and launch training with:

sh train_sft.sh

For DPO and the full multi-stage recipe, please refer to the scripts and configs in the official repository.

Limitations

Visual complexity ceiling. Highly intricate scenes, dense compositions, or fine-grained textures remain difficult to express in SVG and may produce simplified outputs.
Text rendering inside SVGs can be imperfect (font substitution, kerning artifacts).
Latency. The introspective loop trades inference time for quality; single-pass generation is faster but less polished.
Language coverage. Training prompts are predominantly English; performance on other languages may degrade.
Rendering dependency. The critic stage requires a working cairosvg / Cairo installation to rasterize intermediate SVGs.

Citation

If you find IntroSVG useful in your research, please cite our paper:

@article{wang2026introsvg,
  title   = {IntroSVG: Learning from Rendering Feedback for Text-to-SVG Generation
             via an Introspective Generator-Critic Framework},
  author  = {Wang, Feiyu and Yang, Jiayuan and Zhao, Zhiyuan and Zhang, Da and
             Li, Bingyu and Liu, Peng and Gao, Junyu},
  journal = {arXiv preprint arXiv:2603.09312},
  year    = {2026}
}

Acknowledgements

This work builds on the excellent open-source ecosystem around:

Qwen2.5-VL — base vision-language model
LLaMA-Factory — training framework
lmdeploy — inference acceleration
cairosvg — SVG rasterization

License

This model is released under the Apache 2.0 license. Please ensure your use of the model also complies with the license terms of the underlying Qwen2.5-VL-7B-Instruct base model.

Downloads last month: 1,693

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for gitcat404/IntroSVG-Qwen2.5-VL-7B

Base model

Qwen/Qwen2.5-VL-7B-Instruct

Finetuned

(1044)

this model

Quantizations

1 model

Dataset used to train gitcat404/IntroSVG-Qwen2.5-VL-7B

Paper for gitcat404/IntroSVG-Qwen2.5-VL-7B

IntroSVG: Learning from Rendering Feedback for Text-to-SVG Generation via an Introspective Generator-Critic Framework

Paper • 2603.09312 • Published Mar 10