AI Data Privacy: PII Detection, Data Anonymization, Local Processing
Introduction
AI applications process vast amounts of data, much of it containing personally identifiable information (PII). Sending raw PII to LLM APIs creates compliance risks under GDPR, CCPA, and other regulations. This article covers practical techniques for detecting and redacting PII, anonymizing training data, and processing sensitive information locally.
PII Detection
Automated detection identifies sensitive data before it reaches an LLM API:
import re
import spacy
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
# Initialize Presidio analyzers
nlp = spacy.load("en_core_web_lg")
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()
def detect_pii(text: str) -> list[dict]:
results = analyzer.analyze(
text=text,
entities=[
"PHONE_NUMBER", "EMAIL_ADDRESS",
"CREDIT_CARD", "SSN", "PERSON",
"LOCATION", "DATE_TIME", "NRP",
"US_BANK_NUMBER", "IP_ADDRESS",
],
language="en",
)
return [
{"entity": r.entity_type, "start": r.start, "end": r.end,
"score": r.score, "text": text[r.start:r.end]}
for r in results
]
def redact_pii(text: str) -> str:
analyzer_results = analyzer.analyze(text=text, language="en")
return anonymizer.anonymize(text=text, analyzer_results=analyzer_results).text
Presidio combines pattern-based detection (regex for credit cards, SSNs, phone numbers) with NLP-based detection (spaCy for person names, locations, organizations). This dual approach catches both structured and unstructured PII.
Data Anonymization
For training data or analytics, full removal may be too destructive. Anonymization preserves utility while protecting privacy:
from faker import Faker
import hashlib
fake = Faker()
class DataAnonymizer:
def __init__(self):
self.mapping_cache = {}
def anonymize_record(self, record: dict, pii_fields: list[str]) -> dict:
anonymized = record.copy()
for field in pii_fields:
if field in anonymized and anonymized[field]:
anonymized[field] = self._replace_value(field, anonymized[field])
return anonymized
def _replace_value(self, field: str, value: str) -> str:
if field == "email":
return fake.email()
elif field == "phone":
return fake.phone_number()
elif field == "name":
return fake.name()
elif field == "address":
return fake.address()
elif field == "ssn":
return fake.ssn()
else:
# Tokenization: stable pseudonym via hashing
hashed = hashlib.sha256(value.encode()).hexdigest()[:16]
return f"USER_{hashed}"
# Differential privacy: add calibrated noise
def add_laplace_noise(true_value: float, epsilon: float = 1.0) -> float:
"""Add Laplace noise for differential privacy.
Lower epsilon = more privacy, less accuracy."""
import numpy as np
scale = 1.0 / epsilon
noise = np.random.laplace(0, scale)
return true_value + noise
Anonymization Strategies
| Technique | Privacy Level | Utility | Use Case |
|-----------|--------------|---------|----------|
| Removal | High | Low | Irreversible redaction |
| Masking | Medium | Medium | Partial visibility (e.g. "****-1234") |
| Pseudonymization | Medium | High | Replace with fake equivalent |
| Generalization | Medium | Medium | ZIP 94301 -> 9430x |
| Differential Privacy | High | Medium | Statistical queries |
| Tokenization | High | High | Deterministic replacement |
Local Processing
For maximum privacy, process sensitive data locally without sending it to external APIs:
from transformers import pipeline
class LocalTextProcessor:
def __init__(self):
# Load small models for local inference
self.classifier = pipeline(
"text-classification",
model="distilbert-base-uncased-finetuned-sst-2-english",
device=-1, # CPU
)
self.ner = pipeline(
"ner",
model="dslim/bert-base-NER",
device=-1,
)
self.summarizer = pipeline(
"summarization",
model="facebook/bart-large-cnn",
device=-1,
)
def process_sensitive_data(self, text: str, task: str) -> dict:
# All processing happens locally; nothing leaves this machine
if task == "classify":
return {"label": self.classifier(text)[0]["label"]}
elif task == "extract_entities":
return {"entities": self.ner(text)}
elif task == "summarize":
return {"summary": self.summarizer(text, max_length=130, min_length=30)[0]["summary_text"]}
Hybrid Approach
For complex tasks requiring powerful LLMs, strip PII before sending, then re-integrate after:
def safe_llm_call(user_text: str) -> str:
# Step 1: Detect and redact PII
pii_entities = detect_pii(user_text)
redacted_text = redact_pii(user_text)
# Store PII mapping for later restoration
pii_map = {
entity["text"]: f"[{entity['entity']}_{i}]"
for i, entity in enumerate(pii_entities)
}
# Step 2: Send redacted text to LLM
safe_prompt = f"Process this text: {redacted_text}"
response = call_llm(safe_prompt)
# Step 3: The response should use placeholders, not real data
# No restoration needed if the LLM just processes the structure
return response
# For cases where PII must be restored:
def process_and_restore(user_text: str, context: dict) -> str:
redacted, pii_map = redact_with_map(user_text)
result = call_llm(f"Based on this data: {redacted}, generate a response.")
# The LLM response should reference PII generically
return result
Compliance Checklist
data_privacy_audit:
- pre_processing:
- Scan all inputs for PII before API calls
- Implement automatic redaction
- Log all detected PII types (not values)
- in_transit:
- Use TLS 1.3 for all API calls
- Never log raw API payloads
- Implement data retention limits
- storage:
- Encrypt all stored data at rest
- Never store raw PII in logs
- Implement automatic purging schedules
- user_rights:
- Support data deletion requests
- Provide data portability exports
- Maintain processing records for audits
Conclusion
AI data privacy requires proactive protection rather than reactive compliance. Detect and redact PII before any external API call. Use local models for sensitive processing when possible. Implement a hybrid approach for complex tasks: strip PII before cloud LLM inference, and never log raw user data. Regular privacy audits ensure that your protection measures stay effective as your application evolves.