IntroSVG-Qwen2.5-VL-7B

Learning from Rendering Feedback for Text-to-SVG Generation via an Introspective Generator–Critic Framework

Accepted by CVPR 2026 🎉

arXiv GitHub Dataset


Model Summary

IntroSVG-Qwen2.5-VL-7B is an end-to-end vision-language model that generates high-quality SVG (Scalable Vector Graphics) code directly from natural language descriptions. The model is fine-tuned from Qwen2.5-VL-7B-Instruct through a multi-stage training pipeline that combines supervised fine-tuning (SFT), curriculum learning, chain-of-thought (CoT) reasoning, and direct preference optimization (DPO).

The defining feature of IntroSVG is its introspective generator–critic framework: a single unified model alternates between two roles — generator (producing SVG code) and critic (rendering and evaluating its own output) — enabling an iterative generate → evaluate → refine loop at inference time.

Model Details

Property Value
Base model Qwen/Qwen2.5-VL-7B-Instruct
Parameters ~7B
Architecture Vision-Language Model (VLM)
Modalities (input) Text prompts and rendered SVG images (during the critique stage)
Modality (output) SVG source code
Training data SVG-1M (custom corpus, ~1M samples)
Training paradigm SFT → DPO with curriculum learning and CoT
License Apache 2.0

Method Overview

The model is built through three core stages:

1. Data Construction

A mixed corpus is synthesized using an early-checkpoint model and a teacher VLM, comprising three subsets:

  • Direct generation ($\mathcal{D}_G^{\text{direct}}$) — text-to-SVG pairs
  • Correction ($\mathcal{D}_G^{\text{correction}}$) — flawed SVGs paired with refinements
  • Critique ($\mathcal{D}_C$) — rendered SVGs paired with critique feedback

2. Supervised Fine-Tuning (SFT)

A unified VLM is trained on the mixed dataset, simultaneously acquiring:

  • SVG generation capability
  • SVG critique capability

3. Direct Preference Optimization (DPO)

A teacher VLM scores generated preference pairs, which are used to further optimize the generator policy $M_{\text{Policy}}$ via the DPO loss.

Introspective Inference Loop

At inference time, the same model performs a closed-loop introspective process:

  1. Generate an initial SVG from the prompt.
  2. Switch to the critic role: render the SVG and evaluate it.
  3. Assign a quality score based on the critique.
  4. If unsatisfactory, use the critique to guide the next round of correction.

This loop allows the model to refine its outputs iteratively without any external evaluator.

Intended Use

Primary use cases

  • Text-to-SVG generation for icons, simple illustrations, logos, diagrams, and UI elements
  • Programmatic vector graphics design as a creative co-pilot
  • Research on vision-language reasoning, code generation, and self-refinement methods

Out-of-scope use

  • The model is not intended for generating photorealistic raster images.
  • It is not optimized for generating extremely complex artwork or production-ready brand assets without human review.
  • It should not be used to produce misleading, infringing, or otherwise harmful imagery.

How to Use

Installation

# 1. Clone the repository
git clone https://github.com/gitcat-404/IntroSVG.git
cd IntroSVG

# 2. Create environment
conda create -n introsvg python=3.10 -y
conda activate introsvg

# 3. System dependency for cairosvg (Linux)
sudo apt update
sudo apt install libcairo2 libcairo2-dev

# 4. Python dependencies
pip install torch==2.5.1+cu124 torchvision==0.20.0+cu124 \
    --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt

Download model weights

pip install huggingface_hub
hf download gitcat404/IntroSVG-Qwen2.5-VL-7B \
    --local-dir Models/IntroSVG-Qwen2.5-VL-7B

Inference (recommended: lmdeploy server)

We recommend serving the model with lmdeploy for accelerated inference. Example with 4 GPUs:

CUDA_VISIBLE_DEVICES=0,1,2,3 lmdeploy serve api_server \
    "Models/IntroSVG-Qwen2.5-VL-7B" \
    --tp 4 \
    --server-port 23333

Then run the introspective inference loop on a CSV of prompts:

python inference_loop.py \
    --MODEL_NAME Models/IntroSVG-Qwen2.5-VL-7B \
    --CSV_FILE example/test.csv \
    --OUTPUT_DIR your_output_folder

An example prompt file is provided at example/test.csv in the GitHub repository — each row contains one text prompt for SVG generation.

Quick start with transformers

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "gitcat404/IntroSVG-Qwen2.5-VL-7B",
    torch_dtype="auto",
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("gitcat404/IntroSVG-Qwen2.5-VL-7B")

prompt = "A minimalist red apple with a green leaf."
messages = [{"role": "user", "content": [{"type": "text", "text": prompt}]}]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], return_tensors="pt").to(model.device)

output_ids = model.generate(**inputs, max_new_tokens=2048)
svg_code = processor.batch_decode(
    output_ids[:, inputs.input_ids.shape[1]:],
    skip_special_tokens=True,
)[0]
print(svg_code)

💡 To unlock the full introspective refinement loop (generate → render → critique → correct), please use inference_loop.py from the official repository — it handles SVG rendering and feeds the rendered image back to the model in its critic role.

Training

All experiments were conducted on 8 × NVIDIA A800 GPUs, using the LLaMA-Factory training pipeline.

Required artifacts:

Place the data under LLaMA-Factory/data/ and launch training with:

sh train_sft.sh

For DPO and the full multi-stage recipe, please refer to the scripts and configs in the official repository.

Limitations

  • Visual complexity ceiling. Highly intricate scenes, dense compositions, or fine-grained textures remain difficult to express in SVG and may produce simplified outputs.
  • Text rendering inside SVGs can be imperfect (font substitution, kerning artifacts).
  • Latency. The introspective loop trades inference time for quality; single-pass generation is faster but less polished.
  • Language coverage. Training prompts are predominantly English; performance on other languages may degrade.
  • Rendering dependency. The critic stage requires a working cairosvg / Cairo installation to rasterize intermediate SVGs.

Citation

If you find IntroSVG useful in your research, please cite our paper:

@article{wang2026introsvg,
  title   = {IntroSVG: Learning from Rendering Feedback for Text-to-SVG Generation
             via an Introspective Generator-Critic Framework},
  author  = {Wang, Feiyu and Yang, Jiayuan and Zhao, Zhiyuan and Zhang, Da and
             Li, Bingyu and Liu, Peng and Gao, Junyu},
  journal = {arXiv preprint arXiv:2603.09312},
  year    = {2026}
}

Acknowledgements

This work builds on the excellent open-source ecosystem around:

License

This model is released under the Apache 2.0 license. Please ensure your use of the model also complies with the license terms of the underlying Qwen2.5-VL-7B-Instruct base model.

Downloads last month
1,693
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for gitcat404/IntroSVG-Qwen2.5-VL-7B

Finetuned
(1044)
this model
Quantizations
1 model

Dataset used to train gitcat404/IntroSVG-Qwen2.5-VL-7B

Paper for gitcat404/IntroSVG-Qwen2.5-VL-7B