nl-lokaal-middel

Medium-size Dutch PII NER model — 110M parameters, fine-tuned from RobBERT-2023 for high-quality redaction of personally identifiable information.

nl-lokaal-middel ("middel" = medium in Dutch) is a token-classification model that identifies personally identifiable information in Dutch text across 14 GDPR-relevant categories. It is the accuracy-first / teacher member of the LokaalHub Dutch PII family — paired with the smaller distilled nl-lokaal-klein for low-latency workloads.

At a glance

Base model DTAI-KULeuven/robbert-2023-dutch-base
Parameters 117M
Disk size 473 MB (fp32)
Architecture RoBERTa, 12 layers, hidden 768, 12 attn heads
Max sequence 512 tokens (trained at 384)
Language Dutch (nl)
Task Token classification (BIO, 47 labels)
License Apache-2.0
Training data ai4privacy/pii-masking-300k (Dutch subset) + Dutch open-source NER corpora

Quick start

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model_id = "LokaalHub/nl-lokaal-middel"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)

ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

text = "Mijn naam is Jan van der Berg, BSN 123456782. Bel me op 06-12345678."
for span in ner(text):
    print(f"{span['entity_group']:14} {span['word']!r:30} score={span['score']:.2f}")

When to choose nl-lokaal-middel vs nl-lokaal-klein

nl-lokaal-klein nl-lokaal-middel (this model)
Parameters 46M 117M
Disk size 177 MB 473 MB
Typical throughput on CPU ~200 tokens/s ~70 tokens/s
Accuracy (Dutch 300k val, raw F1) 0.7790 0.8435
Use when Edge / on-device / low latency required Batch / server processing / accuracy-first

Both share the same 14 output categories, so you can swap between them without changing downstream code.


Design choices

1. 14 merged PII categories

nl-lokaal-middel predicts 23 entity types internally; 14 of them appear in the standard Dutch PII evaluation gold:

PERSON, BSN, IBAN, EMAIL, PHONE, USERNAME, PASSWORD, ADDRESS, CITY, POSTAL_CODE, PASSPORT, DRIVER_LICENSE, DATE_OF_BIRTH, CREDIT_CARD.

Additional categories trained but rarely annotated in public benchmarks: AGE, BTW (Dutch VAT), KVK (Chamber of Commerce), LICENSE_PLATE, ORGANIZATION, TECHNOLOGY, URL, DATE, IP_ADDRESS.

This merged scheme (e.g., combining FIRSTNAME + MIDDLENAME + LASTNAME + PREFIX → PERSON, STREET + BUILDINGNUMBER → ADDRESS) is chosen to match real redaction targets. See nl-lokaal-klein's card for full mapping tables — the schemes are identical.

2. RobBERT-2023 base, not multilingual

Built on RobBERT-2023, a Dutch-native RoBERTa trained on OSCAR-2023 (Dutch subset). This consistently outperforms multilingual bases on Dutch PII in our experiments. The BPE vocabulary is tuned for Dutch orthography, including compound words and diacritics.

3. Used as the distillation teacher for nl-lokaal-klein

Beyond its direct use, nl-lokaal-middel generated soft-label supervision for the smaller nl-lokaal-klein student.


Evaluation

All numbers use seqeval strict (IOB2 scheme). Raw model predictions — no post-processing.

Primary result — in-distribution (ai4privacy 300k validation, Dutch)

Model Params F1 Precision Recall
nl-lokaal-middel (this model) 110M 0.8435 0.8070 0.8834
nl-lokaal-middel + filenthropist rule layer 110M 0.8386 0.8058 0.8742
nl-lokaal-klein (student) 46M 0.7790 0.7689 0.7895

In-distribution = trained on pii-masking-300k train, evaluated on validation (7,457 Dutch rows, 47,638 gold entities after 14-category merge — never seen during training).

Related work

The closest comparable open Dutch PII model is OpenMed/OpenMed-PII-Dutch-BioClinicalBERT-Base-110M-v1 (110M params, Apache-2.0), trained on ai4privacy/pii-masking-400k with a 54-label fine-grained scheme. It reports F1 0.8401 on its own 400k held-out benchmark. A direct head-to-head isn't scientifically meaningful — different test sets, different label taxonomies (54 fine-grained vs our 14 merged), different boundary conventions — but nl-lokaal-middel reaches 0.8435 on a comparable Dutch PII held-out set, suggesting parity at this model size.

Per-category breakdown (300k validation, raw model, nl-lokaal-middel)

Category Support Precision Recall F1
EMAIL 2,540 0.9356 0.9717 0.9533
IP_ADDRESS 2,199 0.8715 0.9345 0.9019
DRIVER_LICENSE 2,429 0.8670 0.9337 0.8991
BSN 2,439 0.8545 0.9463 0.8981
CITY 5,141 0.8496 0.9167 0.8819
USERNAME 2,571 0.8732 0.8891 0.8811
POSTAL_CODE 1,807 0.8337 0.9242 0.8766
PHONE 1,932 0.8245 0.9022 0.8616
PASSPORT 4,540 0.8107 0.9044 0.8550
PASSWORD 1,443 0.7970 0.8898 0.8409
ADDRESS 4,517 0.7761 0.8656 0.8184
PERSON 8,673 0.7811 0.8193 0.7997
DATE 5,242 0.7331 0.8615 0.7921
DATE_OF_BIRTH 2,165 0.6393 0.7630 0.6957
micro avg 47,638 0.8070 0.8834 0.8435
macro avg 47,638 0.8176 0.8944 0.8540

All 14 categories score F1 ≥ 0.69. High-recall profile (0.88 macro) makes this a good choice as a first-pass PII detector where missing an entity is more costly than flagging an extra one.

How to reproduce

pip install datasets transformers seqeval
python compare_models.py --dataset 300k

Label-mapping tables used by the script match the "Design choices" section above.


Training procedure

Data

Identical data mix as nl-lokaal-klein — see its card for the full table. Briefly:

  • 20% ai4privacy Dutch real spans
  • 28% teacher pseudo-labels (self-distillation on unlabeled Dutch)
  • 32% synthetic + LLM-generated (structured forms, clean prose)
  • ~20% Dutch open-source NER corpora (WikiNEuRal, MultiNERD, Gretel, Careons)

Total: ~40K samples, 25% entity-replacement augmentation.

Hyperparameter search — our own autoresearch loop

Hyperparameters were not hand-tuned. We used an in-house autoresearch agent that iterates on the config, trains, evaluates on a held-out benchmark, and either keeps or reverts each change — all autonomously. Over 100+ experiments explored learning rate, epochs, sequence length, label smoothing, B-tag weight, data mix ratios, augmentation ratio, and loss variants.

The pattern is inspired by Andrej Karpathy's minimal-loop approach to ML research — small, readable code, fast iteration, measured decisions.

Hyperparameters

Optimizer AdamW, weight decay 0.02
Learning rate 2.0e-5, cosine schedule, 10% warmup
Epochs 3
Batch size 16 × 2 gradient accumulation = 32 effective
Max sequence length 384
Label smoothing 0.0
B-tag boundary weight 2.0×
FP16 Enabled
Seed 42
Hardware Apple Silicon (MPS)

Intended use

In scope

  • High-assurance GDPR redaction / pseudonymization of Dutch text where accuracy matters more than latency.
  • Teacher model for further distillation or fine-tuning into smaller variants.
  • Dutch PII research as a strong open baseline.

Out of scope

Same as nl-lokaal-klein:

  • Languages other than Dutch
  • Legal anonymization (detection ≠ removal + k-anonymity)
  • Fine-grained sub-type distinctions (first name vs last name) — intentionally merged
  • PII categories not in the trained 23 (custom corporate IDs, biometric descriptors, etc.)

Limitations & biases

  • Boundary conventions follow ai4privacy 300k. Datasets with different entity-splitting conventions (e.g., separate STREET + BUILDINGNUM entities vs our merged ADDRESS) will score lower under strict evaluation even when the model is qualitatively correct.
  • Dutch gazetteer coverage reflects CBS and open-source name lists — immigrant-origin names may recall below average.
  • Synthetic-data bias in the training mix toward form-like text.
  • Single-model caution: at 110M params this is still a moderately-sized model. For mission-critical redaction, ensemble with rule-based backstops (see filenthropist's production pipeline).

Ethical and legal considerations

Same as nl-lokaal-klein:

  • Detection ≠ removal or anonymization. Operator retains legal responsibility under GDPR (Reg. (EU) 2016/679), UAVG, and EU AI Act (Reg. (EU) 2024/1689).
  • Keep human review in the loop for legally consequential redactions.
  • No external data transmission when run locally.

Attribution & citation

Base model: RobBERT-2023-dutch-base by DTAI-KULeuven (MIT license).

Training data: ai4privacy/pii-masking-300k (CC-BY-4.0) plus Dutch open-source NER corpora.

@misc{nl_lokaal_middel_2026,
  title         = {nl-lokaal-middel: A Dutch PII NER Teacher Model},
  author        = {LokaalHub},
  year          = {2026},
  publisher     = {Hugging Face},
  url           = {https://huggingface.co/LokaalHub/nl-lokaal-middel}
}

@inproceedings{delobelle2020robbert,
  title     = {{RobBERT}: a {Dutch} {RoBERTa}-based Language Model},
  author    = {Delobelle, Pieter and Winters, Thomas and Berendt, Bettina},
  booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2020},
  year      = {2020}
}

Changelog

  • v1.0 — 2026-04-19 — Initial release. Teacher checkpoint used for nl-lokaal-klein distillation.

Built in the Netherlands — optimized for Dutch privacy law, trained on Dutch data, shipped under Apache-2.0.

Downloads last month
81
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for LokaalHub/nl-lokaal-middel

Finetuned
(14)
this model

Dataset used to train LokaalHub/nl-lokaal-middel

Evaluation results