Context#
Standard cloud solutions for PII removal — AWS Comprehend and Azure AI Language — show 14% and 37% accuracy on Ukrainian texts.
Simple reason: both services lack training data with Cyrillic, transliteration support per Ukrainian rules, and understanding of specific identifiers — RNOKPP, EDRPOU, passport series. AWS Comprehend is trained primarily on English texts with Latin script, Azure AI Language has basic Cyrillic support for Russian language but doesn't understand Ukrainian specifics. Both services don't account for historical changes in transliteration: before 2010 one system was used, after CMU resolution №55 — another. Volodimir became Volodymyr, Irina — Iryna, but old documents remained with previous transliteration. Cloud services also don't understand formats of Ukrainian identifiers: RNOKPP with 10 digits, EDRPOU with 8 digits, passport series like AA №123456. For them, these are just numbers without context.
Task: build a model that works with real Ukrainian data — Cyrillic, transliteration (modern and pre-2010), mixed texts, legal entities, and medical records. Model must recognize names in Cyrillic and Latin, phones in different formats (+380, 380, 0), emails with Ukrainian domains, addresses with abbreviations street, lane, boulevard, financial data IBAN and card numbers, documents passport, RNOKPP, EDRPOU, medical records with diagnoses and medications, legal entities with director names. Critical: model must work on CPU without GPU, have inference under 100ms, be deterministic for compliance.
My Role#
I worked as CTO and tech lead of a team with two interns.
60% — Management & Architecture
Defined technical strategy: choosing knowledge distillation from OpenAI instead of training from scratch, pipeline architecture, development priorities. Assigned tasks to interns, reviewed code and training results, made architectural decisions. Responsible for keeping the team moving in the right direction and not wasting time on wrong approaches. Planned sprints, distributed tasks between interns, conducted daily standups, reviewed pull requests, analyzed training metrics, made decisions about hyperparameter changes. Chose between different approaches: fine-tuning vs training from scratch, RoBERTa vs BERT vs DistilBERT, knowledge distillation vs manual annotation. Justified each decision with metrics and development time.
30% — Hands-on
Personally responsible for MLOps infrastructure and production deployment. Set up training environment, CI/CD for models, monitoring and inference in prod. Also personally developed transliteration rules — critical part that interns couldn't do independently. Configured Docker containers for training and inference, wrote scripts for model versioning, integrated Weights & Biases for experiment tracking, deployed FastAPI for production API, set up monitoring for latency and throughput. Developed transliteration rules personally: analyzed CMU resolution №55, compared with previous standards, tested on real examples from data leaks, validated on edge cases.
10% — Mentoring
Explained transformer fine-tuning principles to interns, code review, analysis of training errors. Taught how to read loss and accuracy metrics, how to interpret confusion matrix, how to debug overfitting and underfitting, how to optimize hyperparameters learning rate, batch size, epochs. Showed how to work with PyTorch, how to use Hugging Face Transformers, how to write custom datasets and dataloaders, how to profile code for speed optimization.
Technical Solution#
Knowledge distillation from OpenAI
Instead of manual annotation of the entire dataset — used OpenAI as a "teacher" to generate labeled examples. This accelerated training data preparation and improved annotation quality compared to manual process.
Dataset — real leaks
Model trained on real dataset of Ukrainian personal data leaks. This is critical: synthetic data doesn't reflect real variability — name typos, non-standard phone formats, mixed Cyrillic-Latin texts.
Transliteration rules by year
Separate component for transliteration processing. Rules differ before and after 2010 (CMU resolution №55): Volodimir/Volodymyr, Irina/Iryna, Tkachenko/Tkachenko — model recognizes both variants. Developed personally by me.
RoBERTa fine-tuned on PyTorch
Base architecture — roberta-base, fine-tuning for NER task with classes: Person, PhoneNumber, Location, SocialMedia, DocumentID.
MLOps & deployment — personally
Training infrastructure, model versioning, production API deployment — my responsibility from start to finish.
Benchmark Results#
Tested on own dataset with 10 categories, 64 PII entities — Cyrillic, transliteration, mixed texts, financial data, medical records, legal entities.
| Category | Accuracy |
|---|---|
| Names in Cyrillic | 100% |
| Modern transliteration (post-2010) | 100% |
| Old transliteration (pre-2010) | 100% |
| Phones + email | 90% |
| Complex mixed text | 90% |
| Financial data (IBAN, cards) | 80% |
| Legal entities + director | 80% |
| Medical data | 80% |
| Documents (passport, RNOKPP) | 86% |
| Addresses | 67% |
Overall accuracy: ~76–87%
Comparison with competitors on Ukrainian data
| Tool | Accuracy |
|---|---|
| AWS Comprehend | 14% |
| Azure AI Language | 37% |
| PII Removal (our model) | 76–87% |
Result: +105% vs Azure, +450% vs AWS on Ukrainian texts.
Known Gaps#
Boundary clipping
Model clips token edges at the beginning and end of entities. Appears in short texts without context. In PDF tests with longer fragments, the problem decreases: context helps.
Addresses (67%)
Street names without the word "street" are missed. Requires separate class and additional examples.
EDRPOU and passport numbers
Per Law "On Personal Data Protection", passport data of individuals is personal. EDRPOU is a public state registry. Model doesn't explicitly distinguish these two classes yet — next iteration.
With boundary clipping fix and expanded document classifier — model reaches 90%+.
What's Next#
- →Fix boundary clipping through retraining on edge cases
- →Separate class for DocumentID with subtypes: passport, RNOKPP, EDRPOU, IPN
- →Dataset expansion for addresses — without explicit markers like "street"
- →PDF support with Cyrillic without encoding loss (current artifact nnn)
Stack#
Python
Primary language
PyTorch
ML framework
RoBERTa
Transformer model
Knowledge Distillation
Model optimization
OpenAI API
Data generation
FastAPI
API framework
Docker
Containerization
MLOps
Deployment
FAQ#
Why not use ready-made AWS or Azure solutions?
AWS Comprehend shows 14% accuracy on Ukrainian texts, Azure AI Language — 37%. Both lack training data with Cyrillic and don't understand Ukrainian identifiers RNOKPP, EDRPOU. Our model achieves 76-87% accuracy thanks to training on real Ukrainian data leaks with support for Cyrillic, transliteration per pre/post-2010 rules, recognition of specific document formats and identifiers. Cloud services cannot process mixed Cyrillic-Latin texts, don't understand variability of Ukrainian names in transliteration, and don't account for peculiarities of personal data formatting in local documents and databases. Additionally, cloud solutions have data privacy limitations: transferring personal data to external servers may violate GDPR and Ukrainian personal data protection legislation.
Why RoBERTa and not GPT?
RoBERTa is a specialized model for NER tasks with fast inference 50-100ms on CPU. GPT is a generative model that requires more resources and is less predictable for classification. For production PII removal, we need speed and accuracy, not text generation. RoBERTa is trained on masked language modeling, making it ideal for understanding context and token classification. It consumes less memory, runs faster on CPU without GPU, has deterministic output for each input, which is critical for compliance and audit. GPT requires GPU for acceptable speed, has variable output even with temperature=0, consumes 3-5x more memory. For classification and NER tasks, RoBERTa shows better results with lower resource costs.
What is knowledge distillation from OpenAI?
Instead of manually annotating thousands of examples, we used OpenAI as a teacher to generate labeled data. This accelerated dataset preparation and improved annotation quality compared to manual process. The process looks like this: take unlabeled text from real leaks, pass to OpenAI API with prompt for PII recognition, receive labeled entities, validate result, add to training dataset. This allowed creating over 10000 labeled examples in a week instead of months of manual work. Annotation quality is higher than manual because OpenAI understands context better than a human annotator without specialization. Knowledge distillation approach allows obtaining a compact model with quality close to large teacher model, but with much lower resource requirements.
Why train on real data leaks?
Synthetic data doesn't reflect real variability: name typos, non-standard phone formats, mixed Cyrillic-Latin texts. Real leaks give the model understanding of what data looks like in production. Real data contains typos in names, incomplete addresses, phones without country codes, emails with errors, mixed date formats, transliteration per different standards, outdated document formats. Synthetic datasets generate perfect examples that don't prepare the model for the real world. Training on leaks gives the model robustness to noise, context understanding, and ability to recognize PII even in incorrectly formatted texts. This is critical for working with legacy systems and databases with inconsistent formatting.
What about boundary clipping?
Model sometimes clips token edges at the beginning and end of entities. This appears in short texts without context. Fix — retraining on edge cases. In longer PDF texts, the problem is less noticeable. The reason is tokenization: RoBERTa splits text into subword tokens, and if an entity starts or ends with a rare subword, the model may miss it. For example, Tkachenko may tokenize as Tka-chen-ko, and the model may miss Tka or ko. Solution: expand training dataset with edge token examples, add augmentation with context clipping, use CRF layer to account for dependencies between tokens. Also planning experiment with different tokenizers for better processing of Ukrainian words.