November 27, 2025 · 4 min read
LLMs and privacy: protecting user data when you use AI
You've integrated an LLM into your software, or you're about to, and at some point the uncomfortable question comes up: what happens to the user data that flows through those prompts? It's the right question, because with LLMs and privacy the line between correct use and a GDPR violation runs through specific technical choices, not statements of principle. Let's look at which ones.
The base principle: minimize what you send to the model
The most effective rule is also the simplest: the model should receive only the data needed for the task. Before building the prompt, ask yourself what the model needs to answer well. In most cases the answer does not include the user's name, email, phone number or tax code.
In practice, in the software we develop we apply a few recurring measures:
- pseudonymization before the call: identifying data is replaced with placeholders ("CLIENTE_1") before sending the text to the model, and reinserted into the response;
- field filters: sensitive database fields are excluded by construction from the context passed to the model, so human error alone isn't enough to leak them;
- watch out for free text: user notes, emails and messages can contain personal data anywhere; if the use case requires them, you need detection filters or explicit warnings for whoever writes them.
Minimization pays twice: it reduces legal risk and it reduces tokens, which means costs.
Cloud or on-premise: the criteria for choosing
The choice between cloud APIs and a model installed on your own servers isn't ideological; it depends on the data being processed and the resources available.
The APIs of the major providers are the fastest route and give access to the best models. On the privacy front, the points to verify before signing are concrete: whether your data is used to train the models (serious business plans exclude it, but you have to read the contract), where it is processed and stored, how long the provider retains request logs, and the availability of a data processing agreement (DPA) to attach to your compliance records.
An open model running on your own infrastructure gives you full control: the data never leaves your systems. The price is complexity: servers with adequate GPUs, the skills to manage them, and model quality often below the best cloud services. It makes sense when you handle particularly delicate data, such as health or judicial records, when the volumes justify the investment, or when internal or industry policies forbid data from leaving.
There is also a middle road: use the cloud for tasks on non-sensitive data and keep in-house only the processing that touches critical data. It's often the best compromise for an SME.
Consent, privacy notices and legal bases
Integrating an LLM that processes personal data also needs to be framed on the documentation side, working with whoever handles your compliance. The points to line up:
- updated privacy notice: if user data passes through an AI vendor, the vendor must be listed among the recipients or data processors;
- legal basis: clarify on what basis you process data for the AI feature, and if specific consent is needed, collect it separately, without hiding it in the general terms;
- automated decisions: if the model's output affects decisions that matter for the person, the GDPR provides specific safeguards, starting with the possibility of human intervention;
- record of processing activities: the AI feature is a processing activity like any other and must be recorded.
A tip from practice: write down in black and white, even on an internal page, which data enters the model, with which vendor and with which filters. When a client or the data protection authority asks questions, having the answer ready changes the tone of the conversation.
Logs, retention and the dark side of debugging
There's a point that almost always slips through: logs. For debugging it's natural to save prompts and responses, but those logs contain exactly the data you're trying to protect. Define from the start what you log (metadata and request identifiers are better than full texts), how long you keep the logs and who can read them. The same goes for third-party monitoring tools: every service that sees the prompts is one more data recipient to record.
Finally, test the system from the attacker's point of view too: prompt injection can lead a model to reveal context data it wasn't supposed to expose. Limit what the model can see by construction, because defensive instructions in the prompt are not enough.
Integrating AI without risky shortcuts
We develop custom software with integrated AI features, and data protection is part of the project from the architecture up: minimization, infrastructure choice, log and consent management. If you want to bring an LLM into your product or your business software without exposing your users' data, book a free call: we'll analyze your case and propose a sustainable architecture.
