Instructions to use ehd0309/ko-pii-public-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ehd0309/ko-pii-public-v1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="ehd0309/ko-pii-public-v1")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("ehd0309/ko-pii-public-v1") model = AutoModelForTokenClassification.from_pretrained("ehd0309/ko-pii-public-v1") - Notebooks
- Google Colab
- Kaggle
ko-pii-public-v1 (v4 update — long-document robust)
Korean PII (개인식별정보) detector across 23 categories spanning common /
public-sector / finance / medical domains. Fine-tuned from
openai/privacy-filter.
What changed from the previous (v2) release of this repo: training data now includes (a) synthetic long Korean documents (1,500–3,000 chars, multi-PII) targeting the long-document spillover failure mode reported by an external evaluator, (b) raw-numeric ID with keyword grounding (e.g.,
건강보험번호 1234567890raw 10-digit), (c) hard-negative co-occurrence documents (card / health insurance / business reg / corporate together), and (d) KLUE NER real-world Korean sentences (CC BY-SA 4.0) for natural person/address distribution.See the License section below — this release inherits CC BY-SA 4.0 from the KLUE training data; the previous Apache 2.0 release of this repo did not include KLUE.
When to use vs. not use this update
✅ Use v4 if: you process Korean documents longer than ~150 chars (forms, emails, KYC, medical records, contracts, meeting minutes). v4 fixes the long-document failure mode of v2.
⚠️ Stick with v2 if: your only use case is short conversational Korean (KDPII-style chat), and the 0.877 → 0.858 F1 dip on KDPII test matters more to you than long-document correctness.
For most real-world deployments, v4 is the right choice. For an academic benchmark report against KDPII alone, v2 had a slightly better headline number.
Quick start
from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline
import torch
model_id = "ehd0309/ko-pii-public-v1" # this repo, v4 weights
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(
model_id, dtype=torch.bfloat16,
).to("cuda").eval()
nlp = pipeline("token-classification", model=model, tokenizer=tok,
aggregation_strategy="simple", device=0)
# RECOMMENDED for v4: threshold 0.93 (slightly higher than v2's 0.9)
# Reason: v4 has higher recall but lower precision on conversational Korean;
# the +0.03 threshold bump restores precision parity with v2 at thr=0.9.
preds = nlp("고객 김민수 (010-1234-5678) 계좌 110-998877-665544 로 송금")
preds = [p for p in preds if p["score"] >= 0.93]
print(preds)
For long documents (>500 chars) we still recommend operational chunking (150-char line packing) on top of the model — see the Recommended deployment section below.
Categories (23) — same as v2
| Group | Labels |
|---|---|
| Common (8) | person_name, phone_number, email, address, date_of_birth, ip_address, url, credential_secret |
| Public-sector (5) | rrn, foreigner_id, drivers_license, passport_number, vehicle_plate |
| Finance (6) | bank_account, card_number, card_cvc, card_expiry, business_reg_number, corporate_number |
| Medical (4) | patient_id, health_insurance_no, medical_license, prescription_id |
Training data
| Source | Records | Style | License |
|---|---|---|---|
| v3 synthetic generator (carried over) | 5,736 | Document/form style | Apache 2.0 (this repo) |
| KDPII train (carried over) | 11,919 | Real conversational Korean | CC BY 4.0 (Zenodo) |
| NEW v4 synthetic long-doc | 3,000 | Long Korean documents (1,500–3,000 chars) with 8–15 multi-category PII per doc | Apache 2.0 (this repo) |
| NEW v4 raw-keyword | 1,500 | Keyword + raw-numeric ID (건보번호 1234567890 등) | Apache 2.0 (this repo) |
| NEW v4 hard-negative co-occurrence | 1,000 | Card + health + biz reg + corporate distinct in same doc | Apache 2.0 (this repo) |
| NEW v4 OCR-noise | 500 | Em-dash, keyword typos, separator variants | Apache 2.0 (this repo) |
| NEW KLUE NER (CC BY-SA 4.0) | 5,000 | Real Korean (news/wiki) with PS→person_name, postal LC→address |
CC BY-SA 4.0 (KAIST) |
| NEW v4 boundary-fix | 1,000 | Bank/hospital/company name outside the labelled span | Apache 2.0 (this repo) |
| NEW v4 defect-targeted | 1,500 | Long postal addr / raw 10-digit health / dense person | Apache 2.0 (this repo) |
| Combined train total | 31,155 | mixed (most restrictive: CC BY-SA 4.0) |
Training details
- Base:
openai/privacy-filter(1.5B params total, 50M active MoE, banded attention sliding_window=128) - Trainer: HuggingFace
Trainer, BIO labels (47 = 1 + 23 × 2) - Hyperparameters: 4 epochs, batch 8 + grad-accum 2 (effective 16), AdamW (lr 1.5e-4, wd 0.01, warmup 10%, cosine schedule), max_length 384, seed 42
- Hardware: 1× NVIDIA GB10 (DGX Spark), bfloat16
- Training time: ~14 h
- Single seed. No multi-seed variance is reported. Treat numbers as a point estimate, not a confidence interval.
Evaluation
1. KDPII held-out test (carried over from v2 — direct anchor)
seqeval BIO + lenient span overlap, all 23 labels:
| Threshold | v2 (previous release) | v4 (this release) |
|---|---|---|
| 0.0 | F1 0.795 | F1 0.798 |
| 0.7 | F1 0.850 | F1 0.851 |
| 0.9 (v2 recommended) | F1 0.877 | F1 0.858 |
| 0.93 (v4 recommended) | F1 0.870 | F1 0.860 |
| 0.95 | F1 0.854 | F1 0.851 |
Honest disclosure: v4 is ~1.9p worse on KDPII test at the v2-recommended threshold of 0.9. The dip is precision-driven (P 0.941→0.859, R 0.821→0.857). Bumping threshold to 0.93 narrows the gap. KDPII is conversational Korean chat — see section 2 for the regime where v4 is better.
2. Length-stratified evaluation (new — fair comparison across input lengths)
50 fresh synthetic records per bucket (different RNG seed than training). Typed F1 at threshold 0.9:
| Length bucket | avg chars | v2 typed F1 | v4 typed F1 |
|---|---|---|---|
| short (≤150) | 25 | 0.659 | 0.990 |
| mid (151–400) | 152 | 0.905 | 1.000 |
| long (401–1,000) | 552 | 0.615 | 0.999 |
| xlong (1,001+) | 1,749 | 0.601 | 0.999 |
| σ (length-uniformity) | 0.139 | 0.005 |
Caveat: this synthetic test set uses templates similar (not identical) to v4's training data. It is in-distribution leaning — treat absolute numbers with caution. The relative comparison vs. v2 is meaningful (same templates evaluated on both); the ΔF1 is reproducible.
3. Defect-targeted probes (new — driven by external evaluator's report)
8 single-document cases mirroring the production-user evaluation that motivated this update. PASS = all expected spans found, no spurious FPs:
| Probe | v2 | v4 |
|---|---|---|
| Long postal address (e.g., "서울특별시 강남구 ... 17층 1701호") | FAIL 0/1 | ✅ 1/1 |
Raw 10-digit health insurance (건강보험번호 1234567890) |
FAIL 0/1 | ✅ 1/1 |
Standalone CVC (CVC는 123 이고) |
FAIL 1/2 | ✅ 2/2 |
| Dense person names (5+ in one sentence) | FAIL 4/5 | ✅ 5/5 |
Bank-name prefix excluded from bank_account span |
PASS | PASS |
| Long-doc ID spillover (6 distinct ID classes co-occurring) | FAIL 3/6 | ✅ 6/6 |
| PII-free conversational text (must not predict) | PASS | PASS |
| PII-shaped non-PII (ISBN, version, track ID — must not predict) | PASS | PASS |
| Total | 3 PASS / 5 FAIL | 8 PASS / 0 FAIL |
4. Production-style aggregate (30 cases including the 583-char referral letter)
| Metric | v2 | v4 |
|---|---|---|
| Typed F1 across 30 user-style cases | 0.714 | 0.964 |
| Untyped F1 | 0.735 | 0.964 |
| Long-doc (583 chars, 21 expected PII) | 7/21 | 20/21 |
| FP on the long doc | 4 | 0 |
Limitations (defensive — please read)
- KDPII conversational regression (-1.9p F1 @ thr=0.9). For pure short-chat Korean, v2 of this repo is marginally better. Bump threshold to 0.93 if you need to re-balance precision.
- Two short-context regressions identified:
naked passport "M12345678"alone (no surrounding keyword): v4 misses; v2 detected. Mitigation: regex post-processing.naked vehicle plate "12가3456"alone: same. Mitigation: regex.- These do not affect documents with normal context ("여권번호 M12345678" works).
- Single training seed. No variance estimate. Real F1 could be ±1–2p.
- In-distribution evaluation bias. Sections 2 and 3 above use templates similar to v4's training data. Out-of-distribution natural Korean (KDPII section 1) shows the trade-off honestly: v4 trades a small KDPII regression for large gains on document-style Korean.
- Inherits banded-attention architectural limit. The base
openai/privacy-filteruses sliding_window=128. Even with v4's training improvements, very long documents (>2,000 chars) at inference still benefit from operational chunking. Long-doc training mitigates but does not eliminate the architectural constraint. - No labelled real-world production data was used. KDPII (CC BY 4.0) and KLUE NER (CC BY-SA 4.0) are the closest natural-distribution sources. Domain-specific (e.g., your own medical/legal documents) performance must be re-measured on your own holdout.
- No OCR / dialect / heavy noise evaluation reported. The
ocr_noisetraining bucket (em-dash, keyword typos) is mitigation only, not a benchmark. addresslabel is broad. It will flag short geographic terms (e.g., country names, city districts) in addition to full postal addresses. For strict postal-only filtering, post-process by length / regex.person_nameincludes nicknames. KDPII labels nicknames as person; we inherit this. A chat handle like "토깽이" will be flagged asperson_name. This is intentional for compliance use but may surprise users expecting strict legal-name detection.- Personal-account model. No SLA, no patch guarantee, no incident response. For production critical workloads, fork and self-host.
Recommended deployment
def detect_pii(text: str) -> list[dict]:
if len(text) <= 200:
preds = [p for p in nlp(text) if p["score"] >= 0.93] # v4 threshold
else:
# Operational chunking — paranoid even though v4 handles long docs
chunks = chunk_by_lines(text, max_chars=150)
preds = []
for c in chunks:
preds += [p for p in nlp(c) if p["score"] >= 0.93]
# Regex post-processing for the two known v4 short-context regressions
# AND for absolute recall on canonical formats:
preds = merge_adjacent_address(text, preds) # address fragmentation
preds = strip_bank_prefix(preds) # bank_account boundary
preds = regex_fallback(text, preds, patterns=[
"rrn", "foreigner_id", "card_number", "phone_number",
"email", "ip_address", "url", "passport_number",
"business_reg_number", "vehicle_plate",
])
return preds
A reference implementation of the post-processor is at
dgx-guard/app/postprocess.py.
Operational checklist
- Threshold 0.93 (not 0.9) for v4
- Chunking (≤200 chars) for inputs over ~500 chars
- Regex fallback for naked passport / vehicle / RRN / card / phone
- Domain holdout set of 200+ records, measured on your own data
- Audit log (input hash, predictions, timestamp)
- Pseudonymisation mapping kept in a separate KMS-protected store
License & attribution
This release is CC BY-SA 4.0. The previous v2 release of this repo was Apache 2.0; v4 is BY-SA because we trained on KLUE NER, which is CC BY-SA 4.0. Under ShareAlike, the model derived from this data is offered under the same license. We accept this as the most defensible interpretation; if you have a different legal opinion, please flag it.
Attribution required when redistributing or building on this model:
- Base model:
openai/privacy-filter(Apache 2.0) - Training data:
If you need an Apache-2.0-licensed Korean PII model, the previous v2 commit
of this repo is reachable via Hugging Face commit history (commit
1e029083). It does not include KLUE training data.
Citation
@misc{ko-pii-public-v4,
title = {ko-pii-public-v1 (v4): Korean PII Detection — Long-Document Robust},
author = {ehd0309},
year = {2026},
note = {Fine-tuned from openai/privacy-filter on synthetic Korean +
KDPII + KLUE NER, targeting long-document spillover and raw-numeric
ID detection. CC BY-SA 4.0.},
url = {https://huggingface.co/ehd0309/ko-pii-public-v1},
}
KDPII benchmark used in training and evaluation:
@misc{kdpii2024,
title = {KDPII: A New Korean Dialogic Dataset for the Deidentification
of Personally Identifiable Information},
year = {2024},
doi = {10.5281/zenodo.10968609},
note = {CC BY 4.0},
}
KLUE NER:
@misc{klue,
title = {KLUE: Korean Language Understanding Evaluation},
author = {Park, Sungjoon and others},
year = {2021},
url = {https://klue-benchmark.com/},
note = {CC BY-SA 4.0},
}
Changelog
- v4 (this commit): long-document training, KLUE NER inclusion, defect-fix buckets, threshold raised to 0.93. License changed to CC BY-SA 4.0.
- v2 (commit
1e029083): synthetic + KDPII augmentation. KDPII test F1 0.44 → 0.88. Apache 2.0. - v1 (commit
8791afd7): initial release, synthetic-only. Apache 2.0.
- Downloads last month
- 393
Model tree for ehd0309/ko-pii-public-v1
Base model
openai/privacy-filter