Model Card

bardsai/eu-pii-anonimization-multilang is a token classification model fine-tuned for detecting personally identifiable information (PII) in Polish text, aligned with EU GDPR definitions.

Model Details

PropertyValue
Base modelXLM-RoBERTa
TaskToken classification (NER)
Labels35 entity types, BIO tagging (71 labels)
LanguagePolish
LicenseApache 2.0
Repositorybardsai/eu-pii-anonimization-multilang

Architecture

The model is based on XLM-RoBERTa with a token classification head. It uses BIO tagging scheme — each entity type has both B- (beginning) and I- (inside) labels, plus an O (outside) label.

Performance

The model achieves strong performance across all entity categories:

MetricScore
Overall F1~95%
Precision~94%
Recall~95%

Per-Category Performance

Performance varies by entity type. Categories with more training examples (like PERSON_NAME, POSTAL_ADDRESS) tend to achieve higher F1 scores, while rarer categories (like GENETIC_DATA, BIOMETRIC_DATA) may have lower recall.

Strongest categories: Personal Identity, Contact & Location, Government IDs

More challenging categories: Health & Biometric (less training data), Special Categories (context-dependent)

Training

ONNX Support

The model is available in ONNX format for efficient inference:

This demo uses the ONNX quantized version via Transformers.js for browser-based inference.

Limitations

Ethical Considerations