Model Card
bardsai/eu-pii-anonimization-multilang is a token classification model fine-tuned for detecting personally identifiable information (PII) in Polish text, aligned with EU GDPR definitions.
Model Details
| Property | Value |
|---|---|
| Base model | XLM-RoBERTa |
| Task | Token classification (NER) |
| Labels | 35 entity types, BIO tagging (71 labels) |
| Language | Polish |
| License | Apache 2.0 |
| Repository | bardsai/eu-pii-anonimization-multilang |
Architecture
The model is based on XLM-RoBERTa with a token classification head. It uses BIO tagging scheme — each entity type has both B- (beginning) and I- (inside) labels, plus an O (outside) label.
- Parameters: ~278M (XLM-RoBERTa base)
- Max sequence length: 512 tokens
- Tokenizer: SentencePiece (XLM-RoBERTa)
Performance
The model achieves strong performance across all entity categories:
| Metric | Score |
|---|---|
| Overall F1 | ~95% |
| Precision | ~94% |
| Recall | ~95% |
Per-Category Performance
Performance varies by entity type. Categories with more training examples (like PERSON_NAME, POSTAL_ADDRESS) tend to achieve higher F1 scores, while rarer categories (like GENETIC_DATA, BIOMETRIC_DATA) may have lower recall.
Strongest categories: Personal Identity, Contact & Location, Government IDs
More challenging categories: Health & Biometric (less training data), Special Categories (context-dependent)
Training
- Training data: Curated dataset of Polish text with PII annotations
- Annotation: Expert-annotated following GDPR Article 4 and Article 9 definitions
- Approach: Fine-tuning XLM-RoBERTa with token-level classification
ONNX Support
The model is available in ONNX format for efficient inference:
- Standard ONNX: Full precision, highest accuracy
- Quantized ONNX: INT8 quantized for faster inference with minimal accuracy loss
This demo uses the ONNX quantized version via Transformers.js for browser-based inference.
Limitations
- Language: Optimized for Polish text. Performance on other languages is not guaranteed.
- Context window: Limited to 512 tokens per inference. Longer texts are split into chunks.
- Entity boundaries: Compound entities (e.g., multi-part addresses) may occasionally be split into separate spans.
- Domain specificity: Trained primarily on general-purpose text. Specialized domains (legal, medical) may have varying performance.
- False positives: Common Polish names that also serve as regular words may be flagged incorrectly.
- Evolving PII: New forms of PII (e.g., biometric identifiers) may not be fully captured.
Ethical Considerations
- This model is a detection tool, not a guarantee of full anonymization
- Always review detected entities before relying on them for compliance purposes
- The model should be used as part of a broader data protection strategy
- No data is transmitted externally — all inference runs locally in the browser