Usage tracking and lineage for access decisions

A data-platform engineer at a Bengaluru fintech opens a Jira ticket on Monday morning: "Riya from growth-analytics needs access to payments.transactions for a churn model." The default move — grant SELECT on the whole table — adds another row to the access list, another auditor question next quarter, and another column the security team has to assume Riya might exfiltrate. The better move is to ask the warehouse two questions: which columns of transactions does anyone in growth-analytics actually query, and which downstream tables does growth-analytics already reach transactions through? Usage tracking answers the first; column-level lineage answers the second. Together they turn access decisions from "who asked?" into "what is the smallest grant that answers their actual question?".

Access decisions are usually made on the request, not the evidence. Usage tracking (warehouse query logs aggregated per user, table, and column) and column-level lineage (the graph of which columns derive from which) provide the evidence — they show what columns people actually read, through which downstream tables, on what cadence. The combination shrinks grants to the minimum that answers the question and surfaces dormant permissions for revocation before the next audit.

Why the default grant is a leak

The shape of the typical access flow at a 300-engineer Indian fintech: an analyst asks for "access to payments data", a manager approves, the platform engineer grants SELECT on payments.transactions, and the table has 47 columns including merchant_pan, card_last4, customer_aadhaar_hash, and customer_mobile. The analyst needed three columns — txn_id, amount_paise, merchant_id — to build a churn segmentation. They got 47. The other 44 sit in their query surface for 18 months, until the auditor asks who has access to customer_mobile and the security team reads back a list of 280 names that includes Riya from growth-analytics, who never queried customer_mobile once.

This is not a malice failure; it is a measurement failure. The grant decision was made when nobody had data on what would actually be queried. Usage tracking inverts the order — make the grant tentative, observe the actual queries for two weeks, then narrow the grant to the columns the user actually touched.

Default-grant access flow vs evidence-based grant flowTwo stacked flow diagrams: top shows the default request-approve-grant-table flow with the 44 unused columns highlighted as exposure; bottom shows request to tentative grant to observe to narrow flow with only 3 columns retained. Default grant vs evidence-based grant Default flow (the leak) request "need payments" approve manager nod grant whole table 47 columns 3 columns used, 44 exposed audit risk for 18 months Evidence-based flow (the fix) request "need payments" tentative grant 14-day audit window observe queries log every column read narrow to 3 actual columns PII never enters surface Lineage adds the second axis user already reads analytics.churn_features which derives from txn_id, amount grant zero new columns downstream view already covers it request closed without raw-table access
The default flow grants surface area before observing need. The evidence-based flow inverts the order — grant tentatively, observe the queries, narrow.

Why "44 columns exposed for 18 months" is structurally worse than "44 columns exposed for one query": exposure surface is integrated over time. A column the user can read today and never touches is still in their attack surface — if their laptop is compromised next year, the attacker queries it. SOC2, ISO 27001, and India's DPDP 2023 all evaluate access on a "principle of least privilege" basis, which is operational shorthand for "what was actually used, not what was theoretically allowed".

The principle is older than the cloud. Saltzer and Schroeder's 1975 paper on protection in operating systems named "least privilege" as one of eight design principles — the user should have exactly the minimum authority needed to do the job. The principle has been stable for 50 years; what changed in the last decade is that warehouses now log every query at column granularity, so the evidence to implement least privilege is finally available.

Two signals: usage logs and column-level lineage

The platform needs two raw inputs.

Usage logs. Snowflake ships account_usage.access_history (every query with the columns it touched, since 2021), BigQuery has INFORMATION_SCHEMA.JOBS_BY_PROJECT plus the audit log, Databricks has system.access.audit, and Trino emits per-query event listener payloads. The shape across all of them: one row per query, with the user, the timestamp, the queried tables, and (importantly) the columns referenced. A row from Snowflake's access_history carries query_start_time, user_name, direct_objects_accessed[].columns[], and base_objects_accessed[].columns[] — the columns referenced in the SQL and the columns the query actually had to read after view expansion.

Column-level lineage. A directed acyclic graph where nodes are (table, column) pairs and edges are "column A is derived from column B" relationships extracted from SQL ASTs. Built by parsing every CREATE TABLE AS, INSERT INTO ... SELECT, and dbt model and recording which input columns flow into which output columns. Tools like OpenLineage, dbt-column-lineage, sqllineage, and DataHub's lineage extractor produce this graph. (See /wiki/column-level-lineage-why-its-hard-and-why-it-matters for the mechanics.)

The two signals answer different questions. Usage tells you "what does the user touch?". Lineage tells you "what does the user touch transitively?" — including downstream tables they read that secretly join back to PII columns the user did not realise they were ever exposed to.

Usage tracking and lineage combinedDiagram showing two source feeds — warehouse query logs and column lineage graph — joining into a per-user column-touch matrix that drives access decisions. Two signals join into the access-decision matrix Warehouse query log Snowflake access_history, BigQuery audit log, Trino events (user, table, column, timestamp) Column lineage graph parsed SQL ASTs from dbt, CTAS, view definitions (target.col ← {source.cols}) user × column touch matrix direct + transitive reads, last-touched timestamp grant minimisation "narrow to 3 columns" dormant revocation "unused for 90 days" audit evidence "who touched PAN this Q?"
Usage and lineage feed a single matrix. Three downstream uses follow naturally: minimise grants, revoke dormant access, answer audit questions with data instead of guesses.

A worked example of the lineage half mattering: a marketing analyst has SELECT on analytics.daily_active_users. They never directly query the payments tables. But analytics.daily_active_users is materialised from a join that pulls customer_mobile from customers.profile to deduplicate users. Through the lineage graph, the analyst's read of daily_active_users is a transitive read of customer_mobile — even though customer_mobile is not a column in daily_active_users. When DPO asks "list every user who has read customer_mobile in the last 90 days", usage logs alone answer "nobody", and that answer is wrong. Lineage-resolved usage answers correctly.

Building the matrix

Here is the smallest realistic implementation — a Python module that ingests Snowflake-shape access events and a column-lineage map, and produces the user×column touch matrix that drives access decisions.

# access_evidence.py — usage + lineage → grant minimisation
from collections import defaultdict
from dataclasses import dataclass, field
from datetime import datetime, timedelta
from typing import Dict, Set, List, Tuple

# 1) Raw access events from Snowflake access_history (one per query)
access_events = [
    # (user, query_ts, columns_touched [(table, col)])
    ("riya@razorpay.com",   "2026-04-22T10:14:00Z",
     [("payments.transactions","txn_id"),
      ("payments.transactions","amount_paise"),
      ("payments.transactions","merchant_id")]),
    ("riya@razorpay.com",   "2026-04-23T11:02:00Z",
     [("payments.transactions","txn_id"),
      ("payments.transactions","merchant_id")]),
    ("rahul@razorpay.com",  "2026-04-22T09:30:00Z",
     [("analytics.daily_active_users","dau_count"),
      ("analytics.daily_active_users","date")]),
    ("asha@razorpay.com",   "2026-04-21T15:45:00Z",
     [("payments.transactions","customer_aadhaar_hash")]),  # rare query
]

# 2) Column-level lineage map: target_column ← {source_columns}
lineage: Dict[Tuple[str,str], Set[Tuple[str,str]]] = {
    ("analytics.daily_active_users","dau_count"): {
        ("customers.profile","customer_id"),
        ("customers.profile","customer_mobile"),  # used for dedup
    },
    ("analytics.daily_active_users","date"): {
        ("events.session_start","event_ts"),
    },
}

def transitive_columns(table: str, col: str,
                       seen: Set[Tuple[str,str]] = None) -> Set[Tuple[str,str]]:
    """Walk lineage upstream to find every source column the read touches."""
    seen = seen or set()
    here = (table, col)
    if here in seen: return set()
    seen.add(here)
    out = {here}
    for src in lineage.get(here, set()):
        out |= transitive_columns(*src, seen)
    return out

# 3) Build the user × column matrix with last-touched timestamps
matrix: Dict[str, Dict[Tuple[str,str], dict]] = defaultdict(dict)
for user, ts, cols in access_events:
    for tbl, col in cols:
        for src_tbl, src_col in transitive_columns(tbl, col):
            entry = matrix[user].setdefault((src_tbl, src_col),
                                            {"count": 0, "last": ts})
            entry["count"] += 1
            if ts > entry["last"]: entry["last"] = ts

# 4) Decisions
PII_COLUMNS = {("customers.profile","customer_mobile"),
               ("payments.transactions","customer_aadhaar_hash"),
               ("payments.transactions","merchant_pan")}

# Grant minimisation for riya
print("riya — actually-touched columns:")
for (t,c), e in sorted(matrix["riya@razorpay.com"].items()):
    print(f"  {t}.{c:30s} count={e['count']} last={e['last']}")

# Dormant detection — 90-day cutoff
cutoff = (datetime(2026,4,25) - timedelta(days=90)).isoformat()+"Z"
print("\nDormant grants to revoke:")
for user, cols in matrix.items():
    for (t,c), e in cols.items():
        if e["last"] < cutoff:
            print(f"  REVOKE {user} on {t}.{c} (last touched {e['last']})")

# PII access audit
print("\nPII column reads in window:")
for user, cols in matrix.items():
    for tc in cols:
        if tc in PII_COLUMNS:
            print(f"  {user} -> {tc[0]}.{tc[1]} count={cols[tc]['count']}")
# Output:
riya — actually-touched columns:
  payments.transactions.amount_paise           count=1 last=2026-04-23T11:02:00Z
  payments.transactions.merchant_id            count=2 last=2026-04-23T11:02:00Z
  payments.transactions.txn_id                 count=2 last=2026-04-23T11:02:00Z

Dormant grants to revoke:
  (none in this window)

PII column reads in window:
  asha@razorpay.com -> payments.transactions.customer_aadhaar_hash count=1
  rahul@razorpay.com -> customers.profile.customer_mobile count=1

Walk through the load-bearing pieces. The access_events list is the warehouse's gift — Snowflake's access_history, BigQuery's audit log, Trino's event listener all produce this exact shape (user, timestamp, columns referenced). The platform's job is to land this stream into a queryable table; everything else is SQL on top. Why warehouse logs and not application logs: warehouse-native logs see the post-expansion columns. If the user runs SELECT * FROM payments.transactions, the application sees one query against one table; the warehouse sees 47 columns touched. Only the warehouse-side log gives you column granularity. The lineage map is the second input("analytics.daily_active_users","dau_count") derives from ("customers.profile","customer_mobile") because the materialisation joined on mobile to dedupe users. This map is computed offline by parsing every CTAS, view, and dbt model. The transitive_columns walker does the upstream traversal — when Rahul queries analytics.daily_active_users.dau_count, the function returns {(daily_active_users, dau_count), (customers.profile, customer_id), (customers.profile, customer_mobile)}. Rahul has effectively read customer_mobile even though his SQL never named it. Why a transitive walk and not just a one-hop check: lineage chains can be 5–10 hops long in mature warehouses. A semantic-layer table joins a feature table that joins a staging table that pulls from raw — the PII column is at hop 4. One-hop checks miss it; the recursive walker catches it. The matrix aggregation is the user×column matrix — rows are users, columns are (table, column) pairs, cells carry count and last-touched timestamp. Three reports fall out of this matrix without further machinery: grant minimisation per user (only show actually-touched columns), dormant detection (last-touched older than 90 days), PII audit (which users have read which sensitive columns, directly or transitively). The PII_COLUMNS set is the policy input — typically populated from the catalog's column tags rather than hardcoded, but inline here for clarity. The output shows Asha read customer_aadhaar_hash once and Rahul transitively read customer_mobile once — the audit answer is concrete, not a guess.

In production this matrix is rebuilt nightly, persisted to a governance.access_evidence table, and queried by the access-management UI. When the platform engineer opens Riya's grant request, the UI shows "you are about to grant 47 columns; in the last 90 days users in growth-analytics have touched 6 of them — recommended grant: those 6". The decision becomes one click, with evidence.

How this plugs into the access-management workflow

Building the matrix is the easy half. Wiring it into actual grant decisions is where most platforms stall. Three integration points matter.

The grant-request UI shows evidence inline. When an engineer reviews "grant SELECT on payments.transactions to riya@razorpay.com", the UI shows: "team growth-analytics has 14 members; in the last 90 days they have queried this table 312 times across 6 distinct columns: txn_id, amount_paise, merchant_id, created_at, payment_method, status. Suggested grant: SELECT on those 6 columns. Full-table grant adds 41 columns of unused exposure." The engineer can override, but the default is the minimum.

Dormant grants get auto-revoked. A nightly job walks matrix[user] for every user; any column not touched in 90 days (configurable; some compliance regimes want 30 days) is queued for revocation. The user gets an email: "your access to payments.transactions.customer_aadhaar_hash will be revoked on Friday because you have not used it since January 14th. Reply to keep it." The default is revoke; explicit reply keeps it. Why an opt-out window and not silent revocation: revoking permission someone is about to use breaks their workload at 9 a.m. on Monday. The 7-day notice trades a small amount of audit-period exposure for huge reductions in operational disruption — and tracks better against ISO 27001's "review access at least annually" requirement, which a continuous-revocation system trivially satisfies.

The PII access audit trail is queryable. When the DPO needs to answer "who has read customer_mobile in the last quarter?" — for a DPDP 2023 regulator request, an internal incident, or a board reporting cycle — the answer is one query against governance.access_evidence, not a six-week reconstruction effort across raw audit logs.

Edge cases and adversarial usage

The matrix is only useful if the underlying signals are trustworthy. Three failure modes recur.

SELECT * defeats column granularity if you trust the SQL text alone. A user who runs SELECT * FROM payments.transactions looks like they touched every column. Snowflake's access_history distinguishes direct_objects_accessed (what the SQL named) from base_objects_accessed (the columns actually read after view expansion) — but for SELECT * they collapse to "all columns". The mitigation is to prefer narrow grants in the first place; once narrowed, SELECT * only sees the granted columns and the touch signal is honest.

Stale lineage causes false negatives. If the lineage graph was last rebuilt last Tuesday and a new dbt model was deployed Thursday, a user reading the new model has transitive reads that the matrix misses. Lineage refresh should run on every dbt deployment, not on a daily cadence.

Service accounts and shared dashboards skew the signal. A Looker service account that renders dashboards for 200 users looks in the matrix like one heavy user. The fix is to attribute the column reads to the dashboard-viewing end user via Looker's user impersonation or via the BI tool's per-user query log. Without this, dormant detection over-revokes (the service account's heavy reads keep the columns "live") or under-revokes (each individual user's matrix shows zero touches because they consume through the dashboard).

Common confusions

Going deeper

The access_history schema across vendors

Snowflake's account_usage.access_history (GA since 2021) is the most mature implementation: every query gets one row with direct_objects_accessed (a JSON array of {objectName, columns: [{columnName}]} for what the SQL referenced) and base_objects_accessed (after view expansion). BigQuery's INFORMATION_SCHEMA.JOBS_BY_PROJECT provides per-job referenced tables but not column granularity; for columns you need the audit log's protoPayload.metadata.tableDataRead.fields. Databricks Unity Catalog's system.access.audit has both. Trino's event listener API requires a custom plugin to extract column references from the SQL AST, but the open-source lineage-event-listener plugin does this. The shape across all of them is convergent — the platform code in §2 treats them as one upstream after a small adapter per vendor.

Lineage extraction from CTAS, views, and dbt

Column lineage is built by parsing every SQL statement that produces a target table or view. For a CREATE TABLE AS SELECT a, b, c+d AS e FROM source, the lineage edges are target.a ← source.a, target.b ← source.b, target.e ← {source.c, source.d}. Tools that do this well: dbt-column-lineage (free, dbt-aware), sqllineage (general SQL parser, works on Postgres / Snowflake / Trino dialects), OpenLineage (event-based, integrates with Airflow/Spark/Flink), DataHub's lineage extractor (production-grade, used by LinkedIn). The hard cases are window functions (column carries information from the partition columns), aggregations (output column derives from all input columns referenced in the aggregate), and case-when expressions (output derives from every column in every branch). Production parsers handle these explicitly; toy parsers do not.

Justification-based access — the anti-pattern that does not scale

Some compliance programmes require every access request to carry a free-text justification: "Riya needs this to build a churn model for the Q3 OKR." This signal is human-readable, manager-reviewable, and almost worthless for automation. A six-month-old justification cannot tell the platform whether the access is still needed; the only signal that can is whether the user actually touched the columns recently. Justification stays useful as audit narrative — when the regulator asks why Asha read aadhaar_hash, the justification answers — but as the primary signal for revocation decisions, usage data dominates.

What DPDP 2023 and ISO 27001 actually demand

India's Digital Personal Data Protection Act 2023 (DPDP) requires a data fiduciary (the company) to take "reasonable security safeguards" to prevent personal-data breach, with explicit operational expectations around access control. The Act does not name "least privilege" but the IT Rules under it and the SPDI Rules 2011 (predecessor) both reference the principle. ISO 27001 Annex A.9 makes it explicit: "users shall be allocated only the minimum privileges required for performing their tasks". Auditors interpret this operationally as: "show evidence that you reviewed each user's access in the last 12 months and removed unused privileges". A usage+lineage matrix, queried during the audit window, is the cleanest evidence; without it, the auditor relies on access-list snapshots and asks for justifications, which is slower and qualitatively worse.

The five-year arc: from "grant on request" to "grant on evidence"

Indian fintechs that have published their data-platform evolution — PhonePe, Razorpay, Cred — describe a recurring arc. Year 1: SQL Server with no logging, grants are forever. Year 2: Snowflake or BigQuery, audit logs available but not queried. Year 3: a security incident or DPDP enforcement creates the budget for a usage-tracking project. Year 4: the matrix exists but humans still make grant decisions; dormant revocation is monthly and manual. Year 5: revocation is automated, the grant UI shows evidence by default, the access-review process closes 80% of audit findings without engineering time. Most companies are between year 2 and year 4 in 2026 — the matrix is more often missing than present, and that is the highest-leverage place to invest platform time.

Where this leads next

References