pico-type π
A tiny byte-level multi-head content classifier β ~1.5M params, ~9MB ONNX, <6ms inference.
Classifies any content into 7 categories from raw bytes in a single forward pass.
Built by eulogik β AI infrastructure for developers.
β¨ Features
- No tokenizer β operates directly on raw UTF-8 bytes (supports all languages, zero pre-processing)
- 7 heads, one forward pass β coarse type, modality, subtype, code lang, text lang, file MIME, risk flags
- 4 Matryoshka tiers β tiny (16d) β small (64d) β base (192d) β pro (576d)
- ~9MB ONNX β self-contained single-file, deploy on edge devices, serverless functions, browser (WebAssembly)
- <6ms inference on CPU via ONNX Runtime (base tier, 1024 bytes)
- CLI, Gradio Space, MCP server β ready for any integration
- 62 programming languages β Python, JS, TypeScript, Java, C, C++, Go, Rust, SQL, Bash, and 52 more
- 95.2% real-world accuracy β tested against 21 hand-curated inputs across all content types
π Performance
| Head | Classes | Synthetic Accuracy | Real-World Accuracy |
|---|---|---|---|
| coarse | 12 | 100% | 100% |
| modality | 8 | 100% | 100% |
| subtype | 24 | 95% | β |
| code_lang | 62 | 39% | 100% (9/9 code samples) |
| text_lang | 30 | 99% | 100% |
| file_mime | 90 | 100% | β |
| risk (mAP) | 6 | 100% | β |
Evaluated on 1000 synthetic samples + 21 hand-curated real-world inputs. Base tier, ~5ms inference.
Real-world accuracy: 95.2% (20/21) β The model correctly classifies code, text, markup, config, images, binary archives, and error tracebacks. Only failure: YAML config β predicts error (a fundamental byte-level ambiguity at 2KB context).
π Quick Start
CLI
pip install pico-type
echo "def hello():\n return 42" | picotype --pretty
picotype --file document.txt
picotype --clip
Python
from picotype import PicoType, PicoTypeConfig, decode_output
model = PicoType(PicoTypeConfig()).eval()
# ... load checkpoint ...
result = decode_output(model(b"input bytes"), tier="base")
MCP Server (Claude/Cursor)
PICOTYPE_MODEL_DIR=./checkpoints python -m model.pico_type.mcp_server
Browser Demo (No Install)
Try the in-browser demo at eulogik.github.io/pico-type/demo.html β runs the full model via ONNX Runtime Web. No server needed.
π Architecture
Bytes β ByteEmbed(256β96d) β 3ΓConv1D(k=3,5,7) β 2ΓBiAttention(RoPE) β Pool(meanβmaxβstd) β 7ΓMatryoshka Heads
| Component | Description |
|---|---|
| ByteEmbed | nn.Embedding(256, 96) β lookup-free byte embedding |
| Conv1D | 3 parallel kernels (width 3, 5, 7) with residual + LayerNorm + GELU |
| BiAttention | Bidirectional self-attention with Rotary Position Embeddings, 4 heads |
| Pool | Mean + Max + Std concatenation over masked positions |
| Matryoshka Heads | 4 tier slices of the pooled vector β 7 linear classifiers |
Total parameters: 1.43M (tiny) / 1.45M (small) / 1.48M (base) / 1.56M (pro)
π§ Model Tiers
| Tier | Dim | Params | ONNX Size | Speed |
|---|---|---|---|---|
| tiny | 16 | 1.43M | 8.7 MB | ~3ms |
| small | 64 | 1.45M | 8.7 MB | ~4ms |
| base | 192 | 1.48M | 8.8 MB | ~5ms |
| pro | 576 | 1.56M | 9.1 MB | ~12ms |
All tiers share the same trunk; only the final linear layers differ. Switch tiers at inference with zero overhead.
π§ͺ Classification Heads
| Head | Classes | Gated By | Examples |
|---|---|---|---|
| coarse | 12 | β | text, code, link, image, file, config, markup, data, error, secret, archive, binary |
| modality | 8 | β | textual, binary_image, binary_archive, binary_executable, binary_document, binary_audio, binary_video, binary_other |
| subtype | 24 | config, markup, data | json, yaml, toml, csv, html, markdown, sql, log, dockerfile |
| code_lang | 62 | code | python, javascript, typescript, java, c, cpp, go, rust, kotlin, swift, bash, sql |
| text_lang | 30 | text | en, es, fr, de, it, pt, ru, zh, ja, ko, ar, hi |
| file_mime | 90 | image, file | text/html, application/json, application/pdf, image/png, video/mp4 |
| risk | 6 | β | api_key, jwt, password, email, phone, ssh_key (probabilities) |
π Deployment
- PyPI:
pip install pico-type - GitHub: eulogik/pico-type
- HuggingFace Model: eulogik/pico-type
- Browser Demo: eulogik.github.io/pico-type/demo.html
- Zenodo: 10.5281/zenodo.20758542
π Documentation
- Model Card β detailed architecture, training, evaluation
- Architecture Plan β full design document
- Walkthrough β development log with all decisions
π License
Apache 2.0 β free for commercial and personal use.