pico-type

pico-type πŸ”

A tiny byte-level multi-head content classifier β€” ~1.5M params, ~9MB ONNX, <6ms inference.

Classifies any content into 7 categories from raw bytes in a single forward pass.

License Python ONNX PyPI HuggingFace Space HuggingFace Model GitHub CI DOI

Built by eulogik β€” AI infrastructure for developers.


✨ Features

  • No tokenizer β€” operates directly on raw UTF-8 bytes (supports all languages, zero pre-processing)
  • 7 heads, one forward pass β€” coarse type, modality, subtype, code lang, text lang, file MIME, risk flags
  • 4 Matryoshka tiers β€” tiny (16d) β†’ small (64d) β†’ base (192d) β†’ pro (576d)
  • ~9MB ONNX β€” self-contained single-file, deploy on edge devices, serverless functions, browser (WebAssembly)
  • <6ms inference on CPU via ONNX Runtime (base tier, 1024 bytes)
  • CLI, Gradio Space, MCP server β€” ready for any integration
  • 62 programming languages β€” Python, JS, TypeScript, Java, C, C++, Go, Rust, SQL, Bash, and 52 more
  • 95.2% real-world accuracy β€” tested against 21 hand-curated inputs across all content types

πŸ“Š Performance

Head Classes Synthetic Accuracy Real-World Accuracy
coarse 12 100% 100%
modality 8 100% 100%
subtype 24 95% β€”
code_lang 62 39% 100% (9/9 code samples)
text_lang 30 99% 100%
file_mime 90 100% β€”
risk (mAP) 6 100% β€”

Evaluated on 1000 synthetic samples + 21 hand-curated real-world inputs. Base tier, ~5ms inference.

Real-world accuracy: 95.2% (20/21) β€” The model correctly classifies code, text, markup, config, images, binary archives, and error tracebacks. Only failure: YAML config β†’ predicts error (a fundamental byte-level ambiguity at 2KB context).

πŸš€ Quick Start

CLI

pip install pico-type

echo "def hello():\n    return 42" | picotype --pretty
picotype --file document.txt
picotype --clip

Python

from picotype import PicoType, PicoTypeConfig, decode_output

model = PicoType(PicoTypeConfig()).eval()
# ... load checkpoint ...
result = decode_output(model(b"input bytes"), tier="base")

MCP Server (Claude/Cursor)

PICOTYPE_MODEL_DIR=./checkpoints python -m model.pico_type.mcp_server

Browser Demo (No Install)

Try the in-browser demo at eulogik.github.io/pico-type/demo.html β€” runs the full model via ONNX Runtime Web. No server needed.

πŸ— Architecture

Bytes β†’ ByteEmbed(256β†’96d) β†’ 3Γ—Conv1D(k=3,5,7) β†’ 2Γ—BiAttention(RoPE) β†’ Pool(meanβ€–maxβ€–std) β†’ 7Γ—Matryoshka Heads
Component Description
ByteEmbed nn.Embedding(256, 96) β€” lookup-free byte embedding
Conv1D 3 parallel kernels (width 3, 5, 7) with residual + LayerNorm + GELU
BiAttention Bidirectional self-attention with Rotary Position Embeddings, 4 heads
Pool Mean + Max + Std concatenation over masked positions
Matryoshka Heads 4 tier slices of the pooled vector β†’ 7 linear classifiers

Total parameters: 1.43M (tiny) / 1.45M (small) / 1.48M (base) / 1.56M (pro)

πŸ”§ Model Tiers

Tier Dim Params ONNX Size Speed
tiny 16 1.43M 8.7 MB ~3ms
small 64 1.45M 8.7 MB ~4ms
base 192 1.48M 8.8 MB ~5ms
pro 576 1.56M 9.1 MB ~12ms

All tiers share the same trunk; only the final linear layers differ. Switch tiers at inference with zero overhead.

πŸ§ͺ Classification Heads

Head Classes Gated By Examples
coarse 12 β€” text, code, link, image, file, config, markup, data, error, secret, archive, binary
modality 8 β€” textual, binary_image, binary_archive, binary_executable, binary_document, binary_audio, binary_video, binary_other
subtype 24 config, markup, data json, yaml, toml, csv, html, markdown, sql, log, dockerfile
code_lang 62 code python, javascript, typescript, java, c, cpp, go, rust, kotlin, swift, bash, sql
text_lang 30 text en, es, fr, de, it, pt, ru, zh, ja, ko, ar, hi
file_mime 90 image, file text/html, application/json, application/pdf, image/png, video/mp4
risk 6 β€” api_key, jwt, password, email, phone, ssh_key (probabilities)

🌐 Deployment

PyPI GitHub HuggingFace Model Browser Demo Zenodo

πŸ“š Documentation

πŸ“„ License

Apache 2.0 β€” free for commercial and personal use.


Built with ❀️ by eulogik
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using eulogik/pico-type 1