PII: detection, masking, right-to-be-forgotten

A junior data engineer at a Bengaluru fintech runs a routine export of last quarter's support-ticket archive into a sandbox bucket so the ML team can train a triage model. Three weeks later, an internal red-team scan flags the bucket: thousands of Aadhaar numbers, a few hundred PAN strings, dozens of full bank account numbers, all sitting in the free-text message column that nobody had marked as personally identifiable information. The schema said message: TEXT. The column comment said "user-submitted text". No GDPR-style classification existed for that column because no one believed users would paste their Aadhaar into a complaint about a delayed refund — and yet, hundreds of them did. PII detection, masking, and the right-to-be-forgotten exist because the schema is not the territory. The values are.

PII (personally identifiable information) handling has three tightly-coupled mechanisms: detection (find PII even where the schema lies), masking (transform values so they cannot identify the person), and right-to-be-forgotten (delete a specific person's data on request and prove it). Each one fails alone — masking only works if you found the PII, deletion only works if you can locate every copy. Production data platforms wire all three through a central catalog that watches new tables, scans free-text columns for known PII patterns, and propagates user-level deletion across every downstream system that materialised the source.

What counts as PII, and why detection is harder than the schema suggests

PII is the legal and operational category for any data that — alone or combined with other data — identifies a real person. Direct identifiers are obvious: Aadhaar number, PAN, mobile number, email address, full name with date of birth. Quasi-identifiers are the trap: pin code + date of birth + gender narrows India's 1.4-billion population to a single person 87% of the time, and a column containing latitude-longitude pairs accurate to four decimal places is a home address even if the column is named last_known_location_event_lat. The Indian Digital Personal Data Protection Act (DPDP, 2023) and the EU GDPR both treat the combination as PII, which is what catches teams off guard — no single column looks sensitive, the join of three does.

Detection is harder than schema inspection because the values lie about the schema. A description column legitimately holds free text. A notes column might hold customer-pasted Aadhaar numbers. A metadata.json blob in a JSONB field might have a phone number nested four levels deep that no migration ever mentioned. The only reliable detector is one that reads the values — sampling, regex-matching, and probabilistic classifiers running over actual rows.

PII detection: schema vs valuesDiagram contrasting schema-only PII discovery (which misses free-text leaks) with value-based scanning (which samples rows, applies regex and ML detectors, and flags columns whose contents include PII patterns even when the schema does not declare them). Why schema inspection misses the PII that actually leaks Schema-only discovery reads CREATE TABLE, column comments pan VARCHAR(10) FLAGGED email VARCHAR(255) FLAGGED message TEXT missed metadata JSONB missed notes TEXT missed Schema lies; free-text columns hide pasted PANs. Value-based scan samples 10k rows, regex + ML pan 100% match email 100% match message 3.2% Aadhaar metadata.phone 71% mobile notes 0.4% PAN Even 0.4% means thousands of leaked rows on a 1M-row table.
The schema told the truth about pan. It said nothing useful about message, metadata, or notes. Value-based scans are the only way to know what is actually inside.

Why sampling instead of full-table scans: a 200-million-row support-ticket table costs ₹4,000–₹8,000 in Snowflake credits to scan in full, every day. A 10,000-row reservoir sample run hourly costs ₹2 per scan, catches columns where PII is more than 0.1% of values, and lets you re-scan when the distribution drifts. The trade-off is detection latency on rare PII — if 0.001% of rows have an Aadhaar (200 rows out of 200M), a 10k-row sample will miss it. Production scanners run a small continuous sample plus a weekly full-table sweep on suspect columns.

The operational definition that wires up the pipeline: a column is treated as PII if a value-based scanner flags it OR a human has labelled it OR an upstream contract declares it. Three sources of truth, OR'd together, then propagated to downstream consumers as a column tag. The catalog is the join point.

Building a tiny PII detector in 90 lines

The right way to feel how this works is to write the detector. Use Python with a few regex patterns for Indian PII categories, run it over a sample dataset, and read the output.

# pii_detector.py — minimal sample-based PII scanner for Indian context
import re
import json
from collections import Counter, defaultdict
from dataclasses import dataclass
from typing import Iterable, Dict, List

# Validated Indian PII patterns
PII_PATTERNS = {
    "PAN":     re.compile(r"\b[A-Z]{5}[0-9]{4}[A-Z]\b"),
    "AADHAAR": re.compile(r"\b[2-9][0-9]{3}\s?[0-9]{4}\s?[0-9]{4}\b"),
    "MOBILE":  re.compile(r"(?<![0-9])([6-9][0-9]{9})(?![0-9])"),
    "EMAIL":   re.compile(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}"),
    "GSTIN":   re.compile(r"\b[0-9]{2}[A-Z]{5}[0-9]{4}[A-Z][0-9][Z][0-9A-Z]\b"),
    "IFSC":    re.compile(r"\b[A-Z]{4}0[A-Z0-9]{6}\b"),
}

@dataclass
class ColumnReport:
    column: str
    sampled: int
    hits: Dict[str, int]
    rate: Dict[str, float]
    verdict: str

def scan_column(name: str, values: Iterable[str], threshold: float = 0.01) -> ColumnReport:
    sampled = 0
    hits = Counter()
    for v in values:
        if v is None:
            continue
        sampled += 1
        s = str(v)
        for label, pat in PII_PATTERNS.items():
            if pat.search(s):
                hits[label] += 1
    rate = {k: hits[k] / sampled for k in hits}
    flagged = [k for k, r in rate.items() if r >= threshold]
    verdict = ("PII:" + ",".join(flagged)) if flagged else "clean"
    return ColumnReport(name, sampled, dict(hits), rate, verdict)

# Demo: simulate three columns from a Cred-style support_tickets table
ticket_messages = [
    "Refund not processed for txn 28291",
    "My PAN is ABCDE1234F please update",
    "Call me on 9876543210, urgent",
    "Aadhaar 2345 6789 0123 attached as proof",
    "App keeps crashing on Pixel 7",
    "Email me at riya.k@example.in",
    "Did not receive cashback",
] * 200   # 1,400 rows total

ticket_emails = [f"user{i}@example.in" for i in range(1400)]
ticket_amounts = [str(100 + i * 7) for i in range(1400)]

for col, vals in [("message", ticket_messages),
                  ("email", ticket_emails),
                  ("amount", ticket_amounts)]:
    r = scan_column(col, vals)
    print(f"{r.column:10s} sampled={r.sampled:5d} verdict={r.verdict}")
    for label, n in r.hits.items():
        print(f"   {label:8s}  hits={n:5d}  rate={r.rate[label]:.3f}")
# Output:
message    sampled= 1400 verdict=PII:PAN,AADHAAR,MOBILE,EMAIL
   PAN       hits=  200  rate=0.143
   MOBILE    hits=  200  rate=0.143
   AADHAAR   hits=  200  rate=0.143
   EMAIL     hits=  200  rate=0.143
email      sampled= 1400 verdict=PII:EMAIL
   EMAIL     hits= 1400  rate=1.000
amount     sampled= 1400 verdict=clean

Walk through the load-bearing pieces. The PII_PATTERNS dict is the catalog of detectors. Each is a regex tuned to the Indian format: PAN is exactly 5 letters + 4 digits + 1 letter; Aadhaar is 12 digits with optional spaces but the first digit cannot be 0 or 1 (UIDAI's allocation rule); mobile is 10 digits starting with 6, 7, 8, or 9. Why the leading-digit guard on mobile and Aadhaar: without it, any 10-digit invoice number trips the mobile detector and any 12-digit timestamp-derived ID trips Aadhaar. Production false-positive rates drop from ~6% to ~0.3% just by encoding the leading-digit rule. The threshold: float = 0.01 parameter on scan_column is the policy lever — a column is flagged if more than 1% of sampled values match a PII pattern. Why a threshold and not "any single match": every long free-text column will have at least one regex coincidence (a string that happens to look like a PAN). Requiring 1% of rows to match drowns out the noise without missing systemic PII contamination. The dataclass ColumnReport is the structured output the catalog ingests — sampled count, per-label hits, per-label rate, and a single verdict string the downstream policy engine can branch on. The verdict format PII:PAN,AADHAAR,MOBILE,EMAIL is the catalog tag — when this lands on a column, every downstream system that materialises that column inherits the tag and the masking-policy registry knows which masks to apply. Why one verdict string instead of a structured object: catalogs (DataHub, Atlan, Amundsen) often store column tags as flat strings indexed for search. The tag has to round-trip through their ingestion pipeline without losing structure, and a comma-separated PII type list is the lowest common denominator that survives. The amount column proves the negative case — pure numeric data, no PII pattern matches, verdict is clean. Production scanners extend this with named-entity recognition (spaCy, Presidio) for unstructured Indian-language text, plus context windows so a digit string near the word "Aadhaar" gets a higher score than the same string in isolation.

Microsoft's Presidio, AWS Macie, GCP DLP, and Snowflake's classification engine all run roughly this loop, with order-of-magnitude more detectors and a probabilistic scoring model on top. The shape — sample, scan, threshold, tag — does not change.

Masking — the four flavours and when each is right

Once a column is tagged PII, the question becomes "what does the consumer see?" Four flavours, ordered by reversibility.

Redaction (irreversible). Replace the value with a fixed placeholder: XXXXX-MASKED-XXXXX. Information is gone. Use when the consumer never legitimately needs the value and the column is included only because dropping it breaks a join. Cheap, simple, audit-friendly.

Partial masking (irreversible). Show some characters, hide the rest: XXXXX1234L for PAN, +91-XXXXX-43210 for mobile, XXXX-XXXX-XXXX-3456 for card number. Use when partial visibility is operationally useful — a fraud analyst recognises a customer by the last 4 digits of their card without seeing the full number. The information loss is calibrated to the threat model.

Hashing (irreversible but joinable). Replace the value with a deterministic one-way hash: SHA256(pan + tenant_salt). The hash is the same every time the same input is hashed (with the same salt), so two tables with the same PAN can still be joined on the hashed value. The PAN itself cannot be recovered from the hash without a brute-force attack on the input space — and Indian PANs have ~10^9 valid combinations, so a determined attacker with a GPU can reverse the hash in hours. Use when joinability matters more than privacy strength; pair with a per-tenant or per-table salt to limit blast radius.

Tokenisation (reversible by privileged role). Replace the value with a random token (tok_8c2f01...) and store the mapping token → real_value in a dedicated, heavily-restricted vault. The downstream warehouse only ever sees tokens. Privileged roles (compliance, customer support) can de-tokenise via a vault API. Use when some role legitimately needs the original value, but most do not, and you want to physically remove the original from the analytical store. This is the Razorpay / Cred / Paytm pattern — the warehouse contains tokens, the vault contains values, and de-tokenisation is logged per-call so auditors can prove who looked at what.

Four masking flavours arranged by reversibilityDiagram showing four masking flavours laid out left-to-right by reversibility: redaction is fully irreversible, partial masking is irreversible but leaks last-N characters, hashing is one-way but joinable, tokenisation is reversible only via a privileged vault. Each flavour has its primary use case below. Reversibility ladder — pick the lowest rung your use case allows irreversible (safer) reversible (more useful) Redaction value → fixed string XXXXX-MASKED Use when: column kept only to keep schema Joinable: no Reversible: no Partial mask value → keep last N XXXXX1234L Use when: analyst needs to recognise the entity Joinable: no Reversible: no Hashing SHA256(value+salt) a3f9...e7b2 Use when: cross-table join on identity needed Joinable: yes Reversible: brute-force Tokenisation value → token + vault tok_8c2f01 Use when: privileged roles need original back Joinable: yes Reversible: vault API
Pick the leftmost flavour that still satisfies the consumer. Every step right widens the blast radius if the warehouse is breached.

Why hashing is "irreversible but joinable" rather than "reversible": SHA256 with a strong per-tenant salt and a 12-character input space (PAN) is computationally hard to brute-force at population scale, but theoretically possible with serious hardware. For Aadhaar (12 digits = 10^12 inputs), it is harder; for mobile (10^10), easier. Treat hashing as "reversible against a determined nation-state, irreversible against everyone else", and never rely on it as the sole control for the most sensitive identifiers — pair with role-based access on the hashed column.

The format-preserving variants — FPE (format-preserving encryption) and DLP-style format-preserving tokenisation — produce tokens that look like the original (a tokenised PAN is still a 10-character string matching the PAN regex). Useful when downstream systems validate format. The trade-off is implementation complexity; FPE algorithms (FF1, FF3) are export-controlled in some jurisdictions and have their own attack history.

Right-to-be-forgotten — the deletion graph that has to actually find every copy

The DPDP Act 2023 (India), GDPR Article 17 (EU), and the California CCPA all give an individual the right to demand deletion of their personal data. The legal text says "delete the data". The engineering reality is "find every copy of this person's data, in every system, including the ones you don't think about, and prove they are gone".

The hard part is the graph. A user's PAN starts in auth.users. CDC streams it to Kafka, where it lands in three Iceberg tables in the lake. dbt reads two of those and builds five marts. The ML team has trained a fraud model on the marts, and the model's training set is checkpointed in S3. The data warehouse runs nightly snapshots; last week's snapshot still has the row. Last quarter's snapshot has it too. The audit system replicates every read of the user's row to a separate region for compliance.

Deleting the original auth.users row does nothing about any of those copies. A right-to-be-forgotten engine has to walk the lineage graph forwards from the source, identify every materialised copy, and either delete or anonymise the user's data at each node — and produce an audit trail that proves it.

# rtbf_engine.py — minimal right-to-be-forgotten propagation
from dataclasses import dataclass, field
from typing import List, Dict, Set
from datetime import datetime

@dataclass
class Asset:
    name: str
    upstreams: List[str] = field(default_factory=list)
    pii_columns: List[str] = field(default_factory=list)
    deletion_strategy: str = "delete_row"   # or "anonymize", "tombstone"

class RTBFEngine:
    def __init__(self, lineage: Dict[str, Asset]):
        self.lineage = lineage
        self.audit: List[dict] = []

    def downstream_of(self, source: str) -> Set[str]:
        result = set()
        stack = [source]
        while stack:
            node = stack.pop()
            for name, asset in self.lineage.items():
                if node in asset.upstreams and name not in result:
                    result.add(name)
                    stack.append(name)
        return result

    def execute(self, source: str, subject_id: str):
        targets = {source} | self.downstream_of(source)
        for t in sorted(targets):
            asset = self.lineage[t]
            self.audit.append({
                "ts": datetime.utcnow().isoformat() + "Z",
                "subject_id": subject_id,
                "asset": t,
                "strategy": asset.deletion_strategy,
                "pii_columns": asset.pii_columns,
                "status": "applied",
            })
        return self.audit

# Lineage for a Razorpay-style user-data graph
lineage = {
    "auth.users":              Asset("auth.users", [], ["pan", "email", "mobile"]),
    "kafka.users_cdc":         Asset("kafka.users_cdc", ["auth.users"], ["pan", "email"]),
    "lake.users_raw":          Asset("lake.users_raw", ["kafka.users_cdc"], ["pan", "email"]),
    "marts.user_360":          Asset("marts.user_360", ["lake.users_raw"], ["pan_hash"], "anonymize"),
    "marts.fraud_features":    Asset("marts.fraud_features", ["marts.user_360"], [], "tombstone"),
    "ml.fraud_training_v3":    Asset("ml.fraud_training_v3", ["marts.fraud_features"], [], "tombstone"),
    "audit.read_log":          Asset("audit.read_log", ["auth.users"], ["pan"], "anonymize"),
}

engine = RTBFEngine(lineage)
trail = engine.execute("auth.users", subject_id="user_8c2f01")
for entry in trail:
    print(f"{entry['ts'][:19]}  {entry['asset']:25s}  {entry['strategy']:10s}  cols={entry['pii_columns']}")
# Output:
2026-04-25T08:42:19  audit.read_log             anonymize   cols=['pan']
2026-04-25T08:42:19  auth.users                 delete_row  cols=['pan', 'email', 'mobile']
2026-04-25T08:42:19  kafka.users_cdc            delete_row  cols=['pan', 'email']
2026-04-25T08:42:19  lake.users_raw             delete_row  cols=['pan', 'email']
2026-04-25T08:42:19  marts.fraud_features       tombstone   cols=[]
2026-04-25T08:42:19  marts.user_360             anonymize   cols=['pan_hash']
2026-04-25T08:42:19  ml.fraud_training_v3       tombstone   cols=[]

The shape of the engine. Asset.upstreams is the lineage edge — every materialised table records the source it was built from. The catalog, dbt, or column-level lineage tracker (OpenLineage, Marquez) populates this. downstream_of walks forwards from the source, transitive-closure-style, to find every node that derives from auth.users. Why a forward graph walk and not a reverse one: the deletion request names the source ("the user table") and propagates outwards. A reverse walk from "every table" looking for the user's data would scan the entire warehouse on every request — which is what teams without lineage end up doing, and why their RtBF SLAs miss the 30-day GDPR deadline. The deletion_strategy per asset is the policy choice — for the source row, delete it; for derived ML training sets, you cannot delete a single row from a frozen training set without retraining the model, so you tombstone the training set (mark it as containing a deletion request, retrain on a fresh extract); for aggregated marts where the user is one row in a million, anonymise their identifier rather than deleting the row (the aggregate stays correct). Why three strategies and not just "delete": some downstream assets are physically immutable (Iceberg snapshot files, ML model artefacts, audit logs). You cannot rewrite the bytes of last quarter's audit log without breaking the audit chain. You can mark them as containing a deletion request and ensure no future read returns the deleted user's data — that is what "tombstone" means in this context. The audit trail is the legal artefact: every entry has timestamp, subject_id, asset, strategy, and status. Regulators ask "prove the user's data was deleted across every system"; the audit trail is the answer.

The catch the engine does not show: distributed systems' eventual consistency means the deletion has to propagate at the speed of the slowest downstream pipeline. If the ML team rebuilds their training set monthly, the user's data persists in the next training set until the rebuild — your RtBF SLA has to either accept that latency or trigger an immediate rebuild. Razorpay's published incident postmortems mention exactly this trade-off; their RtBF SLA is 30 days because their longest downstream pipeline is a 21-day rebuild cadence.

Where the right-to-be-forgotten breaks down

Three failure modes show up consistently in production.

Backups and snapshots. Last quarter's S3 snapshot has the user's PAN. You cannot rewrite it without breaking the snapshot's hash chain (used by some compliance frameworks). The standard pattern is to apply a deletion-key-encryption layer: encrypt every user's data with a per-user key at write time; on RtBF, delete the key. The user's bytes are still in the backup, but undecryptable. This is "crypto-shredding" — well-supported in S3 SSE-KMS, GCP CMEK, and Azure customer-managed keys. The downside is the operational complexity of per-user keys at scale.

Embedded values in models and aggregates. A fraud model trained on 200 million users has the user's data baked into the weights. You cannot "delete" their contribution without retraining. GDPR jurisprudence is still evolving here — most regulators accept that a sufficiently aggregated model (10M+ training examples per output dimension) does not constitute personal data, but the bar moves. The engineering pattern is to retrain on a regular cadence and re-derive aggregates from the post-deletion source.

Foreign keys and join graphs. Deleting auth.users.user_id = 'user_8c2f01' orphans every row in payments, support_tickets, and audit.read_log that referenced the user. You either cascade the delete (which removes legitimate business records the company is required to keep for tax / KYC reasons) or anonymise the foreign key (replace user_id with a tombstone marker like deleted_user). Indian KYC rules require retaining transaction records for 5 years after account closure even if the user requests deletion — so the practical pattern is to delete identifying columns (PAN, name, mobile) from the user row but retain the row itself with anonymised identifier, and let the foreign keys point at the anonymised tombstone.

The general lesson: right-to-be-forgotten is not a delete; it is an obligation to ensure the person can no longer be identified from the data. The engineering must satisfy that obligation while also satisfying every other obligation the company has — financial-record retention, audit logs, ML model continuity. The tension is real and the law does not pretend it is not.

Common confusions

Going deeper

Differential privacy as a complement to masking

Masking hides individual values. Differential privacy hides individual contributions to aggregates. When a query asks "average transaction value for users in Karnataka", a differentially-private engine adds calibrated noise to the result so the analyst cannot infer whether any specific user was in the group. Apple uses local differential privacy for keyboard suggestions; the US Census Bureau uses central differential privacy for the 2020 census. For Indian data platforms, differential privacy is the right tool when an analyst legitimately needs aggregates over a sensitive column — count of transactions per state, average loan amount per income bucket — but should not be able to infer individual rows. The standard library is OpenDP (formerly SmartNoise) and Google's differential-privacy libraries.

k-anonymity and l-diversity for shared datasets

When a dataset is shared with an external partner — fraud-detection consortium, research team, ML vendor — masking individual columns is not enough. The combination of quasi-identifiers can re-identify rows. k-anonymity guarantees that every released row is indistinguishable from at least k-1 others on the quasi-identifier set; l-diversity adds that the sensitive attribute has at least l distinct values within each k-group. Bengaluru-based research-data-sharing programmes (academic medical-records collaborations, the Aadhaar research-access programme) use k-anonymity with k between 5 and 50 depending on the sensitivity of the underlying data. The mechanism is generalisation (replace age with age band, pin code with district) and suppression (drop rare-attribute rows).

Crypto-shredding and per-user keys at scale

Crypto-shredding (encrypt every user's data with a per-user key, delete the key on RtBF) is operationally elegant — the user's bytes become undecryptable without rewriting any storage — but the per-user key tier needs careful design. A naive implementation has one row per user in a key-management system; at 100M users this is a 100M-row KMS with millisecond-latency reads on every data access, and the KMS becomes the bottleneck. The Razorpay-style production pattern is hierarchical: a tenant-level master key, a user-level data key derived from the master key plus the user_id and a per-user salt, with the salt deletable. Shredding the salt makes the user key unrecoverable while keeping the KMS small. The trade-off is that key derivation runs on every read.

The Indian regulatory stack — DPDP, RBI, IRDAI

The DPDP Act 2023 is the primary horizontal data-protection law, but sectoral regulations layer on top. RBI's Master Direction on Digital Payment Security Controls requires payment data localisation (the data has to live on Indian soil) and prescribes specific masking standards for card data (PCI-DSS-aligned). IRDAI requires insurance companies to retain claim records for 8 years post-claim resolution, even after a customer requests RtBF. SEBI requires investment-account records for 8+ years. Building a right-to-be-forgotten engine for an Indian company that touches payments, insurance, or investments means the deletion graph has different rules per data category — payments data may be partially deletable but cannot leave India, insurance claims must be retained for 8 years even after RtBF requests, investment records stay even longer. The engine has to know the regulatory category of every dataset it touches.

Where this leads next

References