
Unstructured text data is a goldmine for businesses – customer feedback, survey responses, support tickets, internal reports. But hidden within these texts are frequently personally identifiable information (PII): names, email addresses, phone numbers, account numbers, sometimes even health information. Anyone wanting to analyze these texts faces a dilemma: How can analytical value be preserved without violating the GDPR?
The answer is: automated anonymization. This article explains how it works, why manual approaches fail, and what you need to consider to be on the safe side both legally and analytically.
Most organizations have structured data under control: databases with customer master data, CRM systems with clearly defined fields for name, address, email. Deletion or pseudonymization is relatively straightforward here.
Text data is different. In a free-text comment like "I spoke with Mr. Mueller at the Hannover branch yesterday, but my IBAN DE89 3704 0044 0532 0130 00 was still entered incorrectly" there are at least four personal data points:
Such texts exist in massive quantities in every organization: in NPS comments, complaint forms, chat logs, internal notes. And they are increasingly being analyzed – with AI tools for text analysis, sentiment analysis, or topic extraction.
Without prior anonymization, this analysis is a GDPR risk. Processing personal data requires a legal basis (Art. 6 GDPR), purpose limitation (Art. 5 GDPR), and – for particularly sensitive data – additional safeguards.
The GDPR clearly distinguishes between anonymization and pseudonymization – and the difference is legally significant:
Personal data is replaced by identifiers (e.g., "Mr. Mueller" becomes "Person_42"). The link to the real person remains fundamentally recoverable – via a mapping table. Pseudonymized data is still considered personal data and falls fully under the GDPR.
Data is modified so that a personal reference can no longer be established with reasonable effort. Anonymized data no longer falls under the GDPR. This makes anonymization the preferred method when text data is to be used for analysis, research, or AI training.
Practical tip: For most text analysis applications, true anonymization is the safer and more practical path. Analysis results (topics, sentiments, trends) don't require a personal reference.
Many organizations first try to anonymize text data manually – having employees read texts and redact personal data. This approach has severe drawbacks:
A step further are regular expressions (regex): patterns like "[A-Z][a-z]+ [A-Z][a-z]+" for names or "DE\d{20}" for IBANs. This works for highly structured PII (email addresses, phone numbers) but fails at:
The state of the art in automated text anonymization is Named Entity Recognition (NER) – an NLP method that identifies and classifies entities in texts. Modern NER systems are based on transformer models and understand the context of a word:
The anonymization process runs in three steps:
The result: The text remains readable and analyzable, but the personal reference is removed.
Before: "Mrs. Schmidt from Hamburg complained on March 15th that her contract number VN-2024-8837 is not registered in the system."
After: "[PERSON] from [LOCATION] complained on [DATE] that her contract number [CONTRACT_NO] is not registered in the system."
The following personal data regularly appears in free-text fields – and must be removed before analysis:
The Anonymization module in deepsight Cloud was developed specifically for the requirements of German-language text analysis:
Learn more about the Anonymization module and how it fits into your analysis pipeline.
For organizations analyzing text data, the following GDPR articles are particularly relevant:
For market research institutes, the ICC/ESOMAR Code additionally applies, mandating the anonymity of survey participants. AI-based anonymization helps maintain this requirement reliably even with large data volumes.
Fear of GDPR violations must not lead to valuable text feedback going unused. With modern NER-based anonymization, personal data can be reliably removed – without destroying the analytical value of the texts.
The sentiments, topics, and patterns in the texts remain fully preserved. Only the personal reference is removed. This allows you to analyze customer feedback, survey responses, and support data in a GDPR-compliant manner – while still gaining full insights.
Learn more about data protection and security at deepsight – or test the anonymization directly.
Try it free now and experience how automated anonymization works in practice.
