nl-lokaal-middel
Medium-size Dutch PII NER model — 110M parameters, fine-tuned from RobBERT-2023 for high-quality redaction of personally identifiable information.
nl-lokaal-middel ("middel" = medium in Dutch) is a token-classification model that identifies personally identifiable information in Dutch text across 14 GDPR-relevant categories. It is the accuracy-first / teacher member of the LokaalHub Dutch PII family — paired with the smaller distilled nl-lokaal-klein for low-latency workloads.
At a glance
| Base model | DTAI-KULeuven/robbert-2023-dutch-base |
| Parameters | 117M |
| Disk size | 473 MB (fp32) |
| Architecture | RoBERTa, 12 layers, hidden 768, 12 attn heads |
| Max sequence | 512 tokens (trained at 384) |
| Language | Dutch (nl) |
| Task | Token classification (BIO, 47 labels) |
| License | Apache-2.0 |
| Training data | ai4privacy/pii-masking-300k (Dutch subset) + Dutch open-source NER corpora |
Quick start
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
model_id = "LokaalHub/nl-lokaal-middel"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)
ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
text = "Mijn naam is Jan van der Berg, BSN 123456782. Bel me op 06-12345678."
for span in ner(text):
print(f"{span['entity_group']:14} {span['word']!r:30} score={span['score']:.2f}")
When to choose nl-lokaal-middel vs nl-lokaal-klein
nl-lokaal-klein |
nl-lokaal-middel (this model) |
|
|---|---|---|
| Parameters | 46M | 117M |
| Disk size | 177 MB | 473 MB |
| Typical throughput on CPU | ~200 tokens/s | ~70 tokens/s |
| Accuracy (Dutch 300k val, raw F1) | 0.7790 | 0.8435 |
| Use when | Edge / on-device / low latency required | Batch / server processing / accuracy-first |
Both share the same 14 output categories, so you can swap between them without changing downstream code.
Design choices
1. 14 merged PII categories
nl-lokaal-middel predicts 23 entity types internally; 14 of them appear in the standard Dutch PII evaluation gold:
PERSON, BSN, IBAN, EMAIL, PHONE, USERNAME, PASSWORD, ADDRESS, CITY, POSTAL_CODE, PASSPORT, DRIVER_LICENSE, DATE_OF_BIRTH, CREDIT_CARD.
Additional categories trained but rarely annotated in public benchmarks: AGE, BTW (Dutch VAT), KVK (Chamber of Commerce), LICENSE_PLATE, ORGANIZATION, TECHNOLOGY, URL, DATE, IP_ADDRESS.
This merged scheme (e.g., combining FIRSTNAME + MIDDLENAME + LASTNAME + PREFIX → PERSON, STREET + BUILDINGNUMBER → ADDRESS) is chosen to match real redaction targets. See nl-lokaal-klein's card for full mapping tables — the schemes are identical.
2. RobBERT-2023 base, not multilingual
Built on RobBERT-2023, a Dutch-native RoBERTa trained on OSCAR-2023 (Dutch subset). This consistently outperforms multilingual bases on Dutch PII in our experiments. The BPE vocabulary is tuned for Dutch orthography, including compound words and diacritics.
3. Used as the distillation teacher for nl-lokaal-klein
Beyond its direct use, nl-lokaal-middel generated soft-label supervision for the smaller nl-lokaal-klein student.
Evaluation
All numbers use seqeval strict (IOB2 scheme). Raw model predictions — no post-processing.
Primary result — in-distribution (ai4privacy 300k validation, Dutch)
| Model | Params | F1 | Precision | Recall |
|---|---|---|---|---|
| nl-lokaal-middel (this model) | 110M | 0.8435 | 0.8070 | 0.8834 |
| nl-lokaal-middel + filenthropist rule layer | 110M | 0.8386 | 0.8058 | 0.8742 |
| nl-lokaal-klein (student) | 46M | 0.7790 | 0.7689 | 0.7895 |
In-distribution = trained on pii-masking-300k train, evaluated on validation (7,457 Dutch rows, 47,638 gold entities after 14-category merge — never seen during training).
Related work
The closest comparable open Dutch PII model is OpenMed/OpenMed-PII-Dutch-BioClinicalBERT-Base-110M-v1 (110M params, Apache-2.0), trained on ai4privacy/pii-masking-400k with a 54-label fine-grained scheme. It reports F1 0.8401 on its own 400k held-out benchmark. A direct head-to-head isn't scientifically meaningful — different test sets, different label taxonomies (54 fine-grained vs our 14 merged), different boundary conventions — but nl-lokaal-middel reaches 0.8435 on a comparable Dutch PII held-out set, suggesting parity at this model size.
Per-category breakdown (300k validation, raw model, nl-lokaal-middel)
| Category | Support | Precision | Recall | F1 |
|---|---|---|---|---|
| 2,540 | 0.9356 | 0.9717 | 0.9533 | |
| IP_ADDRESS | 2,199 | 0.8715 | 0.9345 | 0.9019 |
| DRIVER_LICENSE | 2,429 | 0.8670 | 0.9337 | 0.8991 |
| BSN | 2,439 | 0.8545 | 0.9463 | 0.8981 |
| CITY | 5,141 | 0.8496 | 0.9167 | 0.8819 |
| USERNAME | 2,571 | 0.8732 | 0.8891 | 0.8811 |
| POSTAL_CODE | 1,807 | 0.8337 | 0.9242 | 0.8766 |
| PHONE | 1,932 | 0.8245 | 0.9022 | 0.8616 |
| PASSPORT | 4,540 | 0.8107 | 0.9044 | 0.8550 |
| PASSWORD | 1,443 | 0.7970 | 0.8898 | 0.8409 |
| ADDRESS | 4,517 | 0.7761 | 0.8656 | 0.8184 |
| PERSON | 8,673 | 0.7811 | 0.8193 | 0.7997 |
| DATE | 5,242 | 0.7331 | 0.8615 | 0.7921 |
| DATE_OF_BIRTH | 2,165 | 0.6393 | 0.7630 | 0.6957 |
| micro avg | 47,638 | 0.8070 | 0.8834 | 0.8435 |
| macro avg | 47,638 | 0.8176 | 0.8944 | 0.8540 |
All 14 categories score F1 ≥ 0.69. High-recall profile (0.88 macro) makes this a good choice as a first-pass PII detector where missing an entity is more costly than flagging an extra one.
How to reproduce
pip install datasets transformers seqeval
python compare_models.py --dataset 300k
Label-mapping tables used by the script match the "Design choices" section above.
Training procedure
Data
Identical data mix as nl-lokaal-klein — see its card for the full table. Briefly:
- 20% ai4privacy Dutch real spans
- 28% teacher pseudo-labels (self-distillation on unlabeled Dutch)
- 32% synthetic + LLM-generated (structured forms, clean prose)
- ~20% Dutch open-source NER corpora (WikiNEuRal, MultiNERD, Gretel, Careons)
Total: ~40K samples, 25% entity-replacement augmentation.
Hyperparameter search — our own autoresearch loop
Hyperparameters were not hand-tuned. We used an in-house autoresearch agent that iterates on the config, trains, evaluates on a held-out benchmark, and either keeps or reverts each change — all autonomously. Over 100+ experiments explored learning rate, epochs, sequence length, label smoothing, B-tag weight, data mix ratios, augmentation ratio, and loss variants.
The pattern is inspired by Andrej Karpathy's minimal-loop approach to ML research — small, readable code, fast iteration, measured decisions.
Hyperparameters
| Optimizer | AdamW, weight decay 0.02 |
| Learning rate | 2.0e-5, cosine schedule, 10% warmup |
| Epochs | 3 |
| Batch size | 16 × 2 gradient accumulation = 32 effective |
| Max sequence length | 384 |
| Label smoothing | 0.0 |
| B-tag boundary weight | 2.0× |
| FP16 | Enabled |
| Seed | 42 |
| Hardware | Apple Silicon (MPS) |
Intended use
In scope
- High-assurance GDPR redaction / pseudonymization of Dutch text where accuracy matters more than latency.
- Teacher model for further distillation or fine-tuning into smaller variants.
- Dutch PII research as a strong open baseline.
Out of scope
Same as nl-lokaal-klein:
- Languages other than Dutch
- Legal anonymization (detection ≠ removal + k-anonymity)
- Fine-grained sub-type distinctions (first name vs last name) — intentionally merged
- PII categories not in the trained 23 (custom corporate IDs, biometric descriptors, etc.)
Limitations & biases
- Boundary conventions follow ai4privacy 300k. Datasets with different entity-splitting conventions (e.g., separate
STREET+BUILDINGNUMentities vs our mergedADDRESS) will score lower under strict evaluation even when the model is qualitatively correct. - Dutch gazetteer coverage reflects CBS and open-source name lists — immigrant-origin names may recall below average.
- Synthetic-data bias in the training mix toward form-like text.
- Single-model caution: at 110M params this is still a moderately-sized model. For mission-critical redaction, ensemble with rule-based backstops (see filenthropist's production pipeline).
Ethical and legal considerations
Same as nl-lokaal-klein:
- Detection ≠ removal or anonymization. Operator retains legal responsibility under GDPR (Reg. (EU) 2016/679), UAVG, and EU AI Act (Reg. (EU) 2024/1689).
- Keep human review in the loop for legally consequential redactions.
- No external data transmission when run locally.
Attribution & citation
Base model: RobBERT-2023-dutch-base by DTAI-KULeuven (MIT license).
Training data: ai4privacy/pii-masking-300k (CC-BY-4.0) plus Dutch open-source NER corpora.
@misc{nl_lokaal_middel_2026,
title = {nl-lokaal-middel: A Dutch PII NER Teacher Model},
author = {LokaalHub},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/LokaalHub/nl-lokaal-middel}
}
@inproceedings{delobelle2020robbert,
title = {{RobBERT}: a {Dutch} {RoBERTa}-based Language Model},
author = {Delobelle, Pieter and Winters, Thomas and Berendt, Bettina},
booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2020},
year = {2020}
}
Changelog
- v1.0 — 2026-04-19 — Initial release. Teacher checkpoint used for
nl-lokaal-kleindistillation.
Built in the Netherlands — optimized for Dutch privacy law, trained on Dutch data, shipped under Apache-2.0.
- Downloads last month
- 81
Model tree for LokaalHub/nl-lokaal-middel
Base model
DTAI-KULeuven/robbert-2023-dutch-baseDataset used to train LokaalHub/nl-lokaal-middel
Evaluation results
- seqeval strict F1 (raw model) on ai4privacy pii-masking-300k — Dutch validationvalidation set self-reported0.844
- seqeval strict precision (raw model) on ai4privacy pii-masking-300k — Dutch validationvalidation set self-reported0.807
- seqeval strict recall (raw model) on ai4privacy pii-masking-300k — Dutch validationvalidation set self-reported0.883