distill-structure

A fine-tuned Qwen3.5-2B model for HTML structure analysis — given a compact DOM representation of a web page, it identifies the logical sections and outputs structured JSON.

What it does

Takes a cleaned, heading-stripped HTML page and returns a JSON array describing its sections:

[
  {
    "title": "Main News Feed Content",
    "start_text": "1. Canada's bill C-22 mandates...",
    "content_type": "article",
    "assets": [{"type": "link", "value": "Canada's bill C-22..."}]
  },
  {
    "title": "Site Footer Navigation",
    "start_text": "Guidelines | FAQ | Lists",
    "content_type": "footer",
    "assets": []
  }
]

Use case

This model powers the StructureAgent inside the distill pipeline — it handles pages with no heading tags where rule-based sectioning fails. The model is trained to recover section structure that headings would normally provide.

Training

  • Base model: Qwen/Qwen3.5-2B
  • Method: LoRA fine-tuning (r=32, α=64) via TRL SFTTrainer
  • Dataset: ~3,455 training / 384 eval examples generated from heading-rich web pages (headings stripped and used as labels)
  • Epochs: 3 — Train loss: 1.009 — Token accuracy: 80.5%

Quick start

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch, json

model_id = "nahidstaq/distill-structure"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16, device_map="auto")

SYSTEM = (
    "You are an HTML structure analyzer. Given a compact DOM representation "
    "of a web page (with headings removed), identify the logical sections. "
    "Output a JSON array of sections, each with title, start_text, content_type, and assets fields."
)

def analyze(page_title: str, compact_dom: str) -> list[dict]:
    messages = [
        {"role": "system", "content": SYSTEM},
        {"role": "user", "content": f"Page: {page_title}\n\n{compact_dom}"},
    ]
    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        ids = model.generate(**inputs, max_new_tokens=512, do_sample=False,
                             pad_token_id=tokenizer.eos_token_id)
    raw = tokenizer.decode(ids[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
    return json.loads(raw)

Output fields

Field Description
title Short descriptive section title
start_text First ~50 chars of the section's text (for anchoring)
content_type One of: article, list, hero, navigation, footer, table, faq, other
assets Extracted links, images, or list items relevant to the section

Limitations

  • Works best on English pages
  • Table-heavy layouts (e.g. nested <td>) may collapse into fewer sections
  • content_type classification skews toward other for ambiguous sections
Downloads last month
548
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nahidstaq/html-section-retriever

Finetuned
Qwen/Qwen3.5-2B
Adapter
(35)
this model