nl-lokaal-middel

Medium-size Dutch PII NER model — 110M parameters, fine-tuned from RobBERT-2023 for high-quality redaction of personally identifiable information.

nl-lokaal-middel ("middel" = medium in Dutch) is a token-classification model that identifies personally identifiable information in Dutch text across 14 GDPR-relevant categories. It is the accuracy-first / teacher member of the LokaalHub Dutch PII family — paired with the smaller distilled nl-lokaal-klein for low-latency workloads.

At a glance


Base model	`DTAI-KULeuven/robbert-2023-dutch-base`
Parameters	117M
Disk size	473 MB (fp32)
Architecture	RoBERTa, 12 layers, hidden 768, 12 attn heads
Max sequence	512 tokens (trained at 384)
Language	Dutch (`nl`)
Task	Token classification (BIO, 47 labels)
License	Apache-2.0
Training data	ai4privacy/pii-masking-300k (Dutch subset) + Dutch open-source NER corpora

Quick start

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model_id = "LokaalHub/nl-lokaal-middel"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)

ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

text = "Mijn naam is Jan van der Berg, BSN 123456782. Bel me op 06-12345678."
for span in ner(text):
    print(f"{span['entity_group']:14} {span['word']!r:30} score={span['score']:.2f}")

When to choose `nl-lokaal-middel` vs `nl-lokaal-klein`

	`nl-lokaal-klein`	`nl-lokaal-middel` (this model)
Parameters	46M	117M
Disk size	177 MB	473 MB
Typical throughput on CPU	~200 tokens/s	~70 tokens/s
Accuracy (Dutch 300k val, raw F1)	0.7790	0.8435
Use when	Edge / on-device / low latency required	Batch / server processing / accuracy-first

Both share the same 14 output categories, so you can swap between them without changing downstream code.

Design choices

1. 14 merged PII categories

nl-lokaal-middel predicts 23 entity types internally; 14 of them appear in the standard Dutch PII evaluation gold:

PERSON, BSN, IBAN, EMAIL, PHONE, USERNAME, PASSWORD, ADDRESS, CITY, POSTAL_CODE, PASSPORT, DRIVER_LICENSE, DATE_OF_BIRTH, CREDIT_CARD.

Additional categories trained but rarely annotated in public benchmarks: AGE, BTW (Dutch VAT), KVK (Chamber of Commerce), LICENSE_PLATE, ORGANIZATION, TECHNOLOGY, URL, DATE, IP_ADDRESS.

This merged scheme (e.g., combining FIRSTNAME + MIDDLENAME + LASTNAME + PREFIX → PERSON, STREET + BUILDINGNUMBER → ADDRESS) is chosen to match real redaction targets. See nl-lokaal-klein's card for full mapping tables — the schemes are identical.

2. RobBERT-2023 base, not multilingual

Built on RobBERT-2023, a Dutch-native RoBERTa trained on OSCAR-2023 (Dutch subset). This consistently outperforms multilingual bases on Dutch PII in our experiments. The BPE vocabulary is tuned for Dutch orthography, including compound words and diacritics.

3. Used as the distillation teacher for `nl-lokaal-klein`

Beyond its direct use, nl-lokaal-middel generated soft-label supervision for the smaller nl-lokaal-klein student.

Evaluation

All numbers use seqeval strict (IOB2 scheme). Raw model predictions — no post-processing.

Primary result — in-distribution (ai4privacy 300k validation, Dutch)

Model	Params	F1	Precision	Recall
nl-lokaal-middel (this model)	110M	0.8435	0.8070	0.8834
nl-lokaal-middel + filenthropist rule layer	110M	0.8386	0.8058	0.8742
nl-lokaal-klein (student)	46M	0.7790	0.7689	0.7895

In-distribution = trained on pii-masking-300k train, evaluated on validation (7,457 Dutch rows, 47,638 gold entities after 14-category merge — never seen during training).

Related work

The closest comparable open Dutch PII model is OpenMed/OpenMed-PII-Dutch-BioClinicalBERT-Base-110M-v1 (110M params, Apache-2.0), trained on ai4privacy/pii-masking-400k with a 54-label fine-grained scheme. It reports F1 0.8401 on its own 400k held-out benchmark. A direct head-to-head isn't scientifically meaningful — different test sets, different label taxonomies (54 fine-grained vs our 14 merged), different boundary conventions — but nl-lokaal-middel reaches 0.8435 on a comparable Dutch PII held-out set, suggesting parity at this model size.

Per-category breakdown (300k validation, raw model, nl-lokaal-middel)

Category	Support	Precision	Recall	F1
EMAIL	2,540	0.9356	0.9717	0.9533
IP_ADDRESS	2,199	0.8715	0.9345	0.9019
DRIVER_LICENSE	2,429	0.8670	0.9337	0.8991
BSN	2,439	0.8545	0.9463	0.8981
CITY	5,141	0.8496	0.9167	0.8819
USERNAME	2,571	0.8732	0.8891	0.8811
POSTAL_CODE	1,807	0.8337	0.9242	0.8766
PHONE	1,932	0.8245	0.9022	0.8616
PASSPORT	4,540	0.8107	0.9044	0.8550
PASSWORD	1,443	0.7970	0.8898	0.8409
ADDRESS	4,517	0.7761	0.8656	0.8184
PERSON	8,673	0.7811	0.8193	0.7997
DATE	5,242	0.7331	0.8615	0.7921
DATE_OF_BIRTH	2,165	0.6393	0.7630	0.6957
micro avg	47,638	0.8070	0.8834	0.8435
macro avg	47,638	0.8176	0.8944	0.8540

All 14 categories score F1 ≥ 0.69. High-recall profile (0.88 macro) makes this a good choice as a first-pass PII detector where missing an entity is more costly than flagging an extra one.

How to reproduce

pip install datasets transformers seqeval
python compare_models.py --dataset 300k

Label-mapping tables used by the script match the "Design choices" section above.

Training procedure

Data

Identical data mix as nl-lokaal-klein — see its card for the full table. Briefly:

20% ai4privacy Dutch real spans
28% teacher pseudo-labels (self-distillation on unlabeled Dutch)
32% synthetic + LLM-generated (structured forms, clean prose)
~20% Dutch open-source NER corpora (WikiNEuRal, MultiNERD, Gretel, Careons)

Total: ~40K samples, 25% entity-replacement augmentation.

Hyperparameter search — our own autoresearch loop

Hyperparameters were not hand-tuned. We used an in-house autoresearch agent that iterates on the config, trains, evaluates on a held-out benchmark, and either keeps or reverts each change — all autonomously. Over 100+ experiments explored learning rate, epochs, sequence length, label smoothing, B-tag weight, data mix ratios, augmentation ratio, and loss variants.

The pattern is inspired by Andrej Karpathy's minimal-loop approach to ML research — small, readable code, fast iteration, measured decisions.

Hyperparameters


Optimizer	AdamW, weight decay 0.02
Learning rate	2.0e-5, cosine schedule, 10% warmup
Epochs	3
Batch size	16 × 2 gradient accumulation = 32 effective
Max sequence length	384
Label smoothing	0.0
B-tag boundary weight	2.0×
FP16	Enabled
Seed	42
Hardware	Apple Silicon (MPS)

Intended use

In scope

High-assurance GDPR redaction / pseudonymization of Dutch text where accuracy matters more than latency.
Teacher model for further distillation or fine-tuning into smaller variants.
Dutch PII research as a strong open baseline.

Out of scope

Same as nl-lokaal-klein:

Languages other than Dutch
Legal anonymization (detection ≠ removal + k-anonymity)
Fine-grained sub-type distinctions (first name vs last name) — intentionally merged
PII categories not in the trained 23 (custom corporate IDs, biometric descriptors, etc.)

Limitations & biases

Boundary conventions follow ai4privacy 300k. Datasets with different entity-splitting conventions (e.g., separate STREET + BUILDINGNUM entities vs our merged ADDRESS) will score lower under strict evaluation even when the model is qualitatively correct.
Dutch gazetteer coverage reflects CBS and open-source name lists — immigrant-origin names may recall below average.
Synthetic-data bias in the training mix toward form-like text.
Single-model caution: at 110M params this is still a moderately-sized model. For mission-critical redaction, ensemble with rule-based backstops (see filenthropist's production pipeline).

Ethical and legal considerations

Same as nl-lokaal-klein:

Detection ≠ removal or anonymization. Operator retains legal responsibility under GDPR (Reg. (EU) 2016/679), UAVG, and EU AI Act (Reg. (EU) 2024/1689).
Keep human review in the loop for legally consequential redactions.
No external data transmission when run locally.

Attribution & citation

Base model: RobBERT-2023-dutch-base by DTAI-KULeuven (MIT license).

Training data: ai4privacy/pii-masking-300k (CC-BY-4.0) plus Dutch open-source NER corpora.

@misc{nl_lokaal_middel_2026,
  title         = {nl-lokaal-middel: A Dutch PII NER Teacher Model},
  author        = {LokaalHub},
  year          = {2026},
  publisher     = {Hugging Face},
  url           = {https://huggingface.co/LokaalHub/nl-lokaal-middel}
}

@inproceedings{delobelle2020robbert,
  title     = {{RobBERT}: a {Dutch} {RoBERTa}-based Language Model},
  author    = {Delobelle, Pieter and Winters, Thomas and Berendt, Bettina},
  booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2020},
  year      = {2020}
}

Changelog

v1.0 — 2026-04-19 — Initial release. Teacher checkpoint used for nl-lokaal-klein distillation.

Built in the Netherlands — optimized for Dutch privacy law, trained on Dutch data, shipped under Apache-2.0.

Downloads last month: 81

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for LokaalHub/nl-lokaal-middel

Base model

DTAI-KULeuven/robbert-2023-dutch-base

Finetuned

(14)

this model

Dataset used to train LokaalHub/nl-lokaal-middel

Evaluation results

seqeval strict F1 (raw model) on ai4privacy pii-masking-300k — Dutch validation
validation set self-reported

0.844
seqeval strict precision (raw model) on ai4privacy pii-masking-300k — Dutch validation
validation set self-reported

0.807
seqeval strict recall (raw model) on ai4privacy pii-masking-300k — Dutch validation
validation set self-reported

0.883