PII Removal: RoBERTa Ukrainian

Context#

Standard cloud solutions for PII removal — AWS Comprehend and Azure AI Language — show 14% and 37% accuracy on Ukrainian texts.

Simple reason: both services lack training data with Cyrillic, transliteration support per Ukrainian rules, and understanding of specific identifiers — RNOKPP, EDRPOU, passport series. AWS Comprehend is trained primarily on English texts with Latin script, Azure AI Language has basic Cyrillic support for Russian language but doesn't understand Ukrainian specifics. Both services don't account for historical changes in transliteration: before 2010 one system was used, after CMU resolution №55 — another. Volodimir became Volodymyr, Irina — Iryna, but old documents remained with previous transliteration. Cloud services also don't understand formats of Ukrainian identifiers: RNOKPP with 10 digits, EDRPOU with 8 digits, passport series like AA №123456. For them, these are just numbers without context.

Task: build a model that works with real Ukrainian data — Cyrillic, transliteration (modern and pre-2010), mixed texts, legal entities, and medical records. Model must recognize names in Cyrillic and Latin, phones in different formats (+380, 380, 0), emails with Ukrainian domains, addresses with abbreviations street, lane, boulevard, financial data IBAN and card numbers, documents passport, RNOKPP, EDRPOU, medical records with diagnoses and medications, legal entities with director names. Critical: model must work on CPU without GPU, have inference under 100ms, be deterministic for compliance.

My Role#

I worked as CTO and tech lead of a team with two interns.

60% — Management & Architecture

Defined technical strategy: choosing knowledge distillation from OpenAI instead of training from scratch, pipeline architecture, development priorities. Assigned tasks to interns, reviewed code and training results, made architectural decisions. Responsible for keeping the team moving in the right direction and not wasting time on wrong approaches. Planned sprints, distributed tasks between interns, conducted daily standups, reviewed pull requests, analyzed training metrics, made decisions about hyperparameter changes. Chose between different approaches: fine-tuning vs training from scratch, RoBERTa vs BERT vs DistilBERT, knowledge distillation vs manual annotation. Justified each decision with metrics and development time.

30% — Hands-on

Personally responsible for MLOps infrastructure and production deployment. Set up training environment, CI/CD for models, monitoring and inference in prod. Also personally developed transliteration rules — critical part that interns couldn't do independently. Configured Docker containers for training and inference, wrote scripts for model versioning, integrated Weights & Biases for experiment tracking, deployed FastAPI for production API, set up monitoring for latency and throughput. Developed transliteration rules personally: analyzed CMU resolution №55, compared with previous standards, tested on real examples from data leaks, validated on edge cases.

10% — Mentoring

Explained transformer fine-tuning principles to interns, code review, analysis of training errors. Taught how to read loss and accuracy metrics, how to interpret confusion matrix, how to debug overfitting and underfitting, how to optimize hyperparameters learning rate, batch size, epochs. Showed how to work with PyTorch, how to use Hugging Face Transformers, how to write custom datasets and dataloaders, how to profile code for speed optimization.

Technical Solution#

Knowledge distillation from OpenAI

Instead of manual annotation of the entire dataset — used OpenAI as a "teacher" to generate labeled examples. This accelerated training data preparation and improved annotation quality compared to manual process.

Dataset — real leaks

Model trained on real dataset of Ukrainian personal data leaks. This is critical: synthetic data doesn't reflect real variability — name typos, non-standard phone formats, mixed Cyrillic-Latin texts.

Transliteration rules by year

Separate component for transliteration processing. Rules differ before and after 2010 (CMU resolution №55): Volodimir/Volodymyr, Irina/Iryna, Tkachenko/Tkachenko — model recognizes both variants. Developed personally by me.

RoBERTa fine-tuned on PyTorch

Base architecture — roberta-base, fine-tuning for NER task with classes: Person, PhoneNumber, Location, SocialMedia, DocumentID.

MLOps & deployment — personally

Training infrastructure, model versioning, production API deployment — my responsibility from start to finish.

Benchmark Results#

Tested on own dataset with 10 categories, 64 PII entities — Cyrillic, transliteration, mixed texts, financial data, medical records, legal entities.

Category	Accuracy
Names in Cyrillic	100%
Modern transliteration (post-2010)	100%
Old transliteration (pre-2010)	100%
Phones + email	90%
Complex mixed text	90%
Financial data (IBAN, cards)	80%
Legal entities + director	80%
Medical data	80%
Documents (passport, RNOKPP)	86%
Addresses	67%

Overall accuracy: ~76–87%

Comparison with competitors on Ukrainian data

Tool	Accuracy
AWS Comprehend	14%
Azure AI Language	37%
PII Removal (our model)	76–87%

Result: +105% vs Azure, +450% vs AWS on Ukrainian texts.

Known Gaps#

Boundary clipping

Model clips token edges at the beginning and end of entities. Appears in short texts without context. In PDF tests with longer fragments, the problem decreases: context helps.

Addresses (67%)

Street names without the word "street" are missed. Requires separate class and additional examples.

EDRPOU and passport numbers

Per Law "On Personal Data Protection", passport data of individuals is personal. EDRPOU is a public state registry. Model doesn't explicitly distinguish these two classes yet — next iteration.

With boundary clipping fix and expanded document classifier — model reaches 90%+.

What's Next#

→Fix boundary clipping through retraining on edge cases
→Separate class for DocumentID with subtypes: passport, RNOKPP, EDRPOU, IPN
→Dataset expansion for addresses — without explicit markers like "street"
→PDF support with Cyrillic without encoding loss (current artifact nnn)

Stack#

Python

Primary language

PyTorch

ML framework

RoBERTa

Transformer model

Knowledge Distillation

Model optimization

OpenAI API

Data generation

FastAPI

API framework

Docker

Containerization

MLOps

Deployment

FAQ#

Why not use ready-made AWS or Azure solutions?

AWS Comprehend shows 14% accuracy on Ukrainian texts, Azure AI Language — 37%. Both lack training data with Cyrillic and don't understand Ukrainian identifiers RNOKPP, EDRPOU. Our model achieves 76-87% accuracy thanks to training on real Ukrainian data leaks with support for Cyrillic, transliteration per pre/post-2010 rules, recognition of specific document formats and identifiers. Cloud services cannot process mixed Cyrillic-Latin texts, don't understand variability of Ukrainian names in transliteration, and don't account for peculiarities of personal data formatting in local documents and databases. Additionally, cloud solutions have data privacy limitations: transferring personal data to external servers may violate GDPR and Ukrainian personal data protection legislation.

Why RoBERTa and not GPT?

RoBERTa is a specialized model for NER tasks with fast inference 50-100ms on CPU. GPT is a generative model that requires more resources and is less predictable for classification. For production PII removal, we need speed and accuracy, not text generation. RoBERTa is trained on masked language modeling, making it ideal for understanding context and token classification. It consumes less memory, runs faster on CPU without GPU, has deterministic output for each input, which is critical for compliance and audit. GPT requires GPU for acceptable speed, has variable output even with temperature=0, consumes 3-5x more memory. For classification and NER tasks, RoBERTa shows better results with lower resource costs.

What is knowledge distillation from OpenAI?

Instead of manually annotating thousands of examples, we used OpenAI as a teacher to generate labeled data. This accelerated dataset preparation and improved annotation quality compared to manual process. The process looks like this: take unlabeled text from real leaks, pass to OpenAI API with prompt for PII recognition, receive labeled entities, validate result, add to training dataset. This allowed creating over 10000 labeled examples in a week instead of months of manual work. Annotation quality is higher than manual because OpenAI understands context better than a human annotator without specialization. Knowledge distillation approach allows obtaining a compact model with quality close to large teacher model, but with much lower resource requirements.

Why train on real data leaks?

Synthetic data doesn't reflect real variability: name typos, non-standard phone formats, mixed Cyrillic-Latin texts. Real leaks give the model understanding of what data looks like in production. Real data contains typos in names, incomplete addresses, phones without country codes, emails with errors, mixed date formats, transliteration per different standards, outdated document formats. Synthetic datasets generate perfect examples that don't prepare the model for the real world. Training on leaks gives the model robustness to noise, context understanding, and ability to recognize PII even in incorrectly formatted texts. This is critical for working with legacy systems and databases with inconsistent formatting.

What about boundary clipping?

Model sometimes clips token edges at the beginning and end of entities. This appears in short texts without context. Fix — retraining on edge cases. In longer PDF texts, the problem is less noticeable. The reason is tokenization: RoBERTa splits text into subword tokens, and if an entity starts or ends with a rare subword, the model may miss it. For example, Tkachenko may tokenize as Tka-chen-ko, and the model may miss Tka or ko. Solution: expand training dataset with edge token examples, add augmentation with context clipping, use CRF layer to account for dependencies between tokens. Also planning experiment with different tokenizers for better processing of Ukrainian words.