PII (Personally Identifiable Information) refers to any data that can directly or indirectly identify a natural person: name, address, date of birth, social security number, email address, bank details, biometric data, IP addresses.

Can personal data be sent to an LLM?

Only with a legal basis and in compliance with GDPR principles – particularly data minimization (Art. 5(1)(c) GDPR). Roundtrip pseudonymization ensures the model only sees pseudonyms, never personal data.

What is the difference between anonymization and pseudonymization?

Anonymization irreversibly removes the personal reference. Pseudonymization replaces it with pseudonyms while the mapping remains possible through a separate table. For LLM processing, pseudonymization with re-anonymization is the correct approach: the model sees only pseudonyms, the result contains the real data again.

Does this work with self-hosted models?

Yes – and it is advisable even there. Self-hosted environments may have tenant-specific separation requirements: different departments, different clients, different privacy levels. Pseudonymization is model-agnostic.

PII Anonymization for Enterprise AI

Why Personal Data Is a Problem for AI Processing

When an AI agent analyzes an employment contract, reviews a payroll statement, or processes a sick note, it works with personal data. Name, address, date of birth, social security number, salary, diagnosis.

Sending this data to a language model – even a self-hosted one – creates GDPR compliance risk. The regulation requires data minimization (Art. 5(1)(c)): only data necessary for the purpose may be processed. Classifying a document type does not require an employee’s name. Checking salary band compliance does not require a date of birth.

Yet the model needs context. A contract stripped of all personal information is useless for AI analysis – the references, relations, and connections are missing.

The solution is not redaction, but pseudonymization.

Roundtrip Pseudonymization: The Principle

Roundtrip pseudonymization is a three-stage process:

Step 1: Detect and Replace. The pre-processing layer identifies all personal data in the document. Each PII instance is replaced with a consistent pseudonym: “John Smith” becomes “Person_A”, “£85,000” becomes “Salary_A”, “10 Downing Street” becomes “Address_A”. Critically, pseudonyms are consistent – if “John Smith” appears again on page 3, he remains “Person_A”. This preserves document structure.

Step 2: Process. The pseudonymized document is sent to the language model. The model sees: “Person_A has Salary_A at Address_A. The contract runs until 2027.” It can perform contract analysis, salary band checks, clause classification – without ever seeing a real name or salary.

Step 3: Re-anonymize. The model’s output contains pseudonyms: “Person_A falls within salary band E3.” The re-anonymization layer replaces pseudonyms with real data: “John Smith falls within salary band E3.” The mapping table is deleted after processing.

What the Decision Layer Controls

Not every data field requires pseudonymization. The Decision Layer defines which PII categories are detected and replaced – governed by versioned rulesets:

For an HR process: pseudonymize names, salaries, addresses, social security numbers. Job titles and departments can remain – they are relevant for analysis and not personally identifiable.

For a finance process: company names remain, contact persons are pseudonymized, amounts remain (required for booking decisions), bank details are pseudonymized.

For a compliance process: pseudonymize everything – including company names, if the analysis should be cross-organizational.

These rules are tenant-specific and versioned. When a works council agreement (Betriebsvereinbarung) changes, a new rule version is created. During an audit, it is traceable which PII rule in which version was applied at the time of processing.

Limitations and Honest Assessment

PII detection is not perfect. Named Entity Recognition (NER) makes errors – particularly with:

Ambiguous names: “Baker” can be a surname or an occupation. “London” can be a city or a surname. The Decision Layer addresses this through confidence routing: high confidence triggers automatic pseudonymization. Low confidence escalates to a human.

Implicit identifiers: “The only female developer in the Hamburg office” contains no explicit PII but identifies a person. Such indirect identifiers are difficult to detect automatically. The approach: context rules in the ruleset define which combinations of attributes enable identification.

New document types: When a new document type enters processing, the PII ruleset must be reviewed and potentially extended. This is not a one-time setup but an ongoing process.

More on Document Intelligence: Document Intelligence – PII, Contract Redaction, Signature Detection

Book a meeting – We demonstrate roundtrip pseudonymization with your documents.