Column-level access and row-level security

A new junior analyst joins the risk team at a Bengaluru fintech on a Monday. Her dbt project queries a 200-million-row merchants table that holds GSTIN, PAN, contact email, monthly GMV, and a fraud-score column. By the rules she has been told, she should be able to read GMV and fraud-score but not PAN, and she should only see merchants in the four southern states her team handles. By Tuesday afternoon her notebook is happily printing PANs for merchants in Maharashtra, because the only thing standing between her and the full table was a GRANT SELECT at the table level and a comment in the dbt repo that said "please filter by region". Column-level access and row-level security exist because that comment is not a control. The warehouse has to enforce the filter before it returns the rows.

Column-level access (CLS) hides specific columns from specific principals; row-level security (RLS) hides specific rows. Both are enforced by the query planner, not by application code, so a curious analyst cannot bypass them by writing a different query. The hard part is not the syntax — it is keeping the policies consistent across views, materialized views, joins, and copy-on-write table snapshots so that nothing leaks through a back door.

What CLS and RLS actually do, and why they live in the planner

CLS and RLS are predicates the query planner inserts into your query before it runs. When the analyst writes SELECT * FROM merchants, the warehouse rewrites it — invisibly to her — into something like SELECT gstin, gmv_monthly, fraud_score FROM merchants WHERE region IN ('KA','TN','KL','AP'). The PAN column is gone because her role lacks the column-level grant; the rows from Maharashtra are gone because her role's RLS policy says region IN (your-team's-states). The rewrite happens before the optimizer picks an execution plan, so the unauthorized column never gets read from storage and the unauthorized rows never get scanned.

This matters because the alternative — filter the result after the query runs, in the application — has been broken every way it can break. An analyst who connects with the BI tool's service account inherits the application's filter; the analyst connecting directly with psql does not. A SaaS vendor who uses your warehouse credentials to run their own query does not see your application filter at all. CLS and RLS attach the predicate to the role, not the application, so the analyst sees the same filtered view whether she queries through Tableau, dbt, a Jupyter notebook, or psql.

The planner is the only place the rewrite catches every client. A WHERE clause inside the BI tool's SQL template does not survive an analyst opening DBeaver with the same credentials.

Why the planner and not the storage layer: the storage layer (Parquet files in S3, micro-partitions in Snowflake) does not know which role is reading. It only knows what bytes were requested. By the time the read reaches storage, the policy has either been applied (the planner narrowed the scan) or it has been bypassed. There is no "CLS at the storage layer" mechanism in any major warehouse — the entire mechanism lives in the planner.

CLS in practice — masking, projection, and the column-grant ladder

Column-level access shows up in three flavours, ordered by how strict they are.

Column-level GRANTs. The simplest form: GRANT SELECT(gstin, gmv_monthly) ON merchants TO ROLE risk_jr;. The role can SELECT gstin, gmv_monthly FROM merchants. Try SELECT * FROM merchants and the planner errors with "permission denied for column pan". Postgres, Snowflake, BigQuery, and Redshift all support this directly. The cost: every analyst who joins or leaves a team triggers a maintenance task on every grant, every column. Forty teams × twenty tables × fifteen sensitive columns is 12,000 grant rows that have to stay in sync with HR.

Dynamic data masking. Instead of denying access, the planner returns a redacted value. CREATE MASKING POLICY pan_mask AS (val STRING) RETURNS STRING -> CASE WHEN CURRENT_ROLE() IN ('compliance') THEN val ELSE 'XXXXX' || RIGHT(val, 4) END; then ALTER TABLE merchants MODIFY COLUMN pan SET MASKING POLICY pan_mask;. Now SELECT pan FROM merchants returns XXXXX1234 for everyone except the compliance role. Why masking instead of denial: queries that join, group, or filter on the column still work — COUNT(DISTINCT pan) still returns the right cardinality because the mask is applied after aggregation in some warehouses (Snowflake) and before in others (BigQuery). Masking is the form CLS takes when the analyst legitimately needs to use the column for analysis but should never see the raw value.

View-based projection. The legacy form: create a view merchants_safe that selects only the allowed columns and grant on the view, not the table. CREATE VIEW merchants_safe AS SELECT merchant_id, region, gmv_monthly, fraud_score FROM merchants;. The view inherits the underlying table's RLS but applies its own column projection. The catch is that the analyst must remember to query merchants_safe and not merchants — and any junior who is allowed to see any column of merchants can write a join through the table directly. View-based projection is enforceable only if you also revoke base-table access.

In practice the production pattern most Indian fintechs settle on is: column-level GRANTs for the deny-by-default sensitive columns (PAN, Aadhaar, mobile, email), masking policies for columns where partial visibility is useful (last-4-digits of cards, hashed mobile), and views for legacy BI tools that cannot deal with masked columns. The three layer together — a view that selects a subset of columns from a table whose sensitive columns are individually masked.

Building a tiny RLS engine in 80 lines

Row-level security feels harder than CLS because the predicate depends on the current user's attributes — their team, their region, their security clearance — and those attributes change. Build a tiny engine to see how it composes.

# rls_engine.py — minimal row-level security predicate injector
import sqlite3
from dataclasses import dataclass
from typing import Callable, Dict

@dataclass
class Principal:
    name: str
    role: str
    attrs: Dict[str, str]   # e.g. {"region": "south", "states": "KA,TN,KL,AP"}

# Policies are functions: principal -> SQL predicate fragment
PolicyFn = Callable[[Principal], str]

class RLSPlanner:
    def __init__(self):
        self.policies: Dict[str, PolicyFn] = {}

    def add_policy(self, table: str, fn: PolicyFn):
        self.policies[table] = fn

    def rewrite(self, table: str, base_sql: str, principal: Principal) -> str:
        if table not in self.policies:
            # default-deny: no policy means no access (fail closed)
            raise PermissionError(f"no RLS policy for {table}")
        predicate = self.policies[table](principal)
        # naive injection — production parsers use AST rewriting, not string concat
        if " WHERE " in base_sql.upper():
            return base_sql + f" AND ({predicate})"
        return base_sql + f" WHERE ({predicate})"

# Define policies
def merchants_policy(p: Principal) -> str:
    if p.role == "compliance":
        return "1=1"                                 # full access
    if p.role == "risk_jr":
        states = p.attrs.get("states", "").split(",")
        in_clause = ",".join(f"'{s.strip()}'" for s in states)
        return f"region IN ({in_clause})"
    if p.role == "vendor_external":
        return "is_public = 1"
    return "1=0"                                     # fail-closed for unknown roles

# Setup
db = sqlite3.connect(":memory:")
db.executescript("""
CREATE TABLE merchants(merchant_id INT, region TEXT, gmv_monthly INT, is_public INT);
INSERT INTO merchants VALUES (1,'KA',45000,1),(2,'MH',82000,0),(3,'TN',15000,1),
                             (4,'DL',91000,0),(5,'KL',8000,1);
""")

planner = RLSPlanner()
planner.add_policy("merchants", merchants_policy)

riya = Principal("riya", "risk_jr", {"states": "KA,TN,KL,AP"})
asha = Principal("asha", "compliance", {})
karan = Principal("karan", "vendor_external", {})

for who in (riya, asha, karan):
    sql = planner.rewrite("merchants", "SELECT merchant_id, region FROM merchants", who)
    rows = db.execute(sql).fetchall()
    print(f"{who.name:8s} ({who.role:16s})  rows={rows}  sql={sql}")

# Output:
riya     (risk_jr         )  rows=[(1, 'KA'), (3, 'TN'), (5, 'KL')]  sql=SELECT merchant_id, region FROM merchants WHERE (region IN ('KA','TN','KL','AP'))
asha     (compliance      )  rows=[(1, 'KA'), (2, 'MH'), (3, 'TN'), (4, 'DL'), (5, 'KL')]  sql=SELECT merchant_id, region FROM merchants WHERE (1=1)
karan    (vendor_external )  rows=[(1, 'KA'), (3, 'TN'), (5, 'KL')]  sql=SELECT merchant_id, region FROM merchants WHERE (is_public = 1)

Walk through what makes this an engine rather than a hand-applied filter. PolicyFn = Callable[[Principal], str] is the policy abstraction — a policy is a pure function from principal to predicate. Why a function and not a static string: the predicate depends on attributes the principal carries (team, region, clearance), which are runtime values. A static WHERE region = 'KA' would only protect one specific user; a policy function generates the right predicate for any principal in the role. The default-deny clause raise PermissionError(...) if no policy is registered is the fail-closed default — the absence of a policy means no access, not full access. Why fail-closed instead of fail-open: a forgotten policy on a sensitive table would otherwise leak silently. Fail-closed surfaces the missing policy at first query, loud, with a stack trace; fail-open surfaces it months later in a compliance audit. The merchants_policy function with a fall-through 1=0 is the same pattern at the per-policy level — unknown roles get an always-false predicate so their queries return zero rows rather than error or full table. The string-concat injection is the part you replace in production: real warehouse planners rewrite the AST, not the SQL text, because string concat breaks on subqueries, CTEs, and complex predicates. Why AST rewriting matters: a query like SELECT * FROM (SELECT * FROM merchants) m WHERE m.gmv > 1000 has the outer WHERE in a different scope from the inner FROM. String concat would inject the policy in the wrong place. AST rewriting walks the tree and inserts the predicate at every reference to the protected table — including inside CTEs and subqueries. The Principal dataclass with attrs is the attribute-based part of attribute-based access control (ABAC) — the policy can branch on any attribute the principal carries, not just role. This is what lets you write "south-region team in Bengaluru office on weekdays" as one policy.

The toy is single-table, single-predicate, no caching. Production engines (Snowflake's RAP, BigQuery's row access policies, Postgres RLS via CREATE POLICY) add: predicate caching by role for query-plan reuse, hierarchical roles (risk_sr inherits risk_jr's predicate plus extra access), policy versioning so changes can be audited, and security barrier views that prevent the optimizer from leaking data through side-channel inference. The shape — fail-closed default, principal-to-predicate function, planner-level injection — does not change.

Snowflake, BigQuery, Postgres — same idea, different syntax

Every warehouse implements CLS and RLS, with mostly-overlapping but slightly-different vocabulary.

Snowflake uses masking policies for CLS and row access policies for RLS. A masking policy is attached to a column with ALTER TABLE ... MODIFY COLUMN ... SET MASKING POLICY; a row access policy is attached to a table with ALTER TABLE ... ADD ROW ACCESS POLICY. Both policies are SQL functions that branch on CURRENT_ROLE(). Snowflake's distinguishing feature is that the policies follow the data through secure views and secure UDFs — a secure view does not expose its query plan to the user, so an analyst cannot deduce the policy by reading EXPLAIN output.

BigQuery uses policy tags (Data Catalog) for CLS and row access policies for RLS. Policy tags are taxonomy-style: you create a taxonomy "PII" with child tags "PAN", "Aadhaar", "Email", attach the tag to columns, then grant roles/bigquerydatapolicy.maskedReader on the tag. A user with the maskedReader role on the PAN tag sees masked PANs everywhere — across every table, every dataset. Row access policies are SQL WHERE predicates attached to a table for a given user/group: CREATE ROW ACCESS POLICY south_team ON merchants GRANT TO ('group:south-team@company.com') FILTER USING (region IN ('KA','TN','KL','AP')).

Postgres uses GRANTs for column-level access and CREATE POLICY for row-level security with ALTER TABLE ... ENABLE ROW LEVEL SECURITY. The policy syntax: CREATE POLICY south_only ON merchants FOR SELECT USING (region IN (SELECT state FROM user_states WHERE username = current_user));. A security barrier view (CREATE VIEW ... WITH (security_barrier = true)) prevents the planner from pushing predicates through the view in ways that would let the user infer hidden rows. Without it, a query like SELECT * FROM merchants_view WHERE leaky_function(pan) could call leaky_function on rows the user is not supposed to see, leaking the existence (and content) of those rows through side effects.

Databricks Unity Catalog uses column masks and row filters — both are SQL functions registered on the catalog that the planner injects when the protected table is queried. Unity Catalog's distinguishing feature is the centralised catalog — one policy applies across every workspace and every cluster, not per-table.

The lesson for a platform team is not "which warehouse has the best RLS" — they are all roughly equivalent in expressiveness — but "which model fits how your teams actually look up sensitive data?" If your sensitivity classification lives in a central data catalog and applies across many tables (PAN is PAN whether it appears in merchants, users, or kyc), BigQuery's tag-based model is the cleanest fit. If your classification is table-specific and varies by business context, Snowflake's per-column policies are easier. If you are running open-source Postgres at the edge of your platform, native RLS via CREATE POLICY is what you have, and the rest of the world's syntactical comforts are not available.

The differences are about *where the policy is attached* — column, table, tag, or catalog. The deeper the attachment point, the broader the blast radius when the policy changes.

The leaks — where a poorly-applied policy still leaks data

CLS and RLS prevent the obvious "select all" leak. They do not prevent inference, side-channels, or the back-door joins that production teams discover only in incident reviews.

Joins through unprotected tables. merchants has RLS on region. The payments table does not, but it has a merchant_id. An analyst writes SELECT p.amount, m.gstin FROM payments p LEFT JOIN merchants m USING (merchant_id) WHERE p.amount > 1e7. The RLS on merchants filters out non-south merchants — m.gstin is NULL for them — but the analyst now has a complete list of which merchant_ids are non-south because their gstins came back NULL. The fix is RLS on payments too, or a security-barrier view that joins both tables and applies the filter at the view level.

Aggregate inference. A user has access to COUNT(*) over merchants but not to individual rows. They write SELECT COUNT(*) FROM merchants WHERE pan = 'AAAPP1234L'. If the count is 1, that PAN exists; if 0, it does not. Differential privacy and k-anonymity floors are the formal defence; the practical defence is to disallow predicates on sensitive columns in non-privileged roles.

Policy mutation through DDL. An analyst with ALTER TABLE on a downstream table can CREATE TABLE leak AS SELECT * FROM merchants_view. If the materialisation runs as the analyst's role, RLS applies and only their rows land in leak. If it runs as a service account with broader access (a common dbt setup), the materialisation contains rows the analyst never had access to — and now they have SELECT on leak. The fix is forcing CTAS / materialised views to inherit the invoking role's policies, which Snowflake does by default, BigQuery does not, and Postgres needs an explicit SECURITY INVOKER setting.

Column-mask bypass via CASE statements. A masking policy returns XXXXX1234 for non-privileged roles. The analyst writes SELECT pan, CASE WHEN LENGTH(pan) > 5 THEN 'long' ELSE 'short' END FROM merchants. If the mask is applied after the projection (the wrong order), the CASE sees the unmasked PAN and the analyst can bit-by-bit reconstruct it. Snowflake applies masks at projection time so this fails; BigQuery applies them via Data Catalog and the same trick has been demonstrated and patched twice in 2023–24.

The pattern is: a leak-prevention mechanism only prevents what it explicitly checks. Anything the policy author did not anticipate — joins, aggregates, DDL, masked-column-as-input — is a back door until someone notices.

Common confusions

"Column GRANTs are the same as masking policies." GRANTs deny access — the column does not appear in the result. Masking policies redact — the column appears with a fake value, and the row count is unchanged. Aggregate queries over a masked column return the right cardinality; aggregate queries over a denied column error.
"RLS replaces application-level authorization." It replaces the data filter, not the entire authorization model. You still need application-level checks for write paths (who can update what), for tenant routing, and for UI-level visibility. RLS is the last line; it is not the only line.
"A view with a WHERE clause is the same as RLS." A view filters when the analyst queries the view. RLS filters when the analyst queries the table — including through joins, CTEs, and DDL. An analyst who can SELECT directly from the base table bypasses the view's WHERE entirely.
"Granting on a column means the data is never read from disk." The planner can push the column-level filter into storage so the unallowed bytes are never materialised, but the file format (Parquet's column chunks, Snowflake's micro-partitions) determines whether a denied column requires reading the file at all. A row-group-aligned scan of a wide table may still touch the file even for a single allowed column.
"RLS predicates are free." They are evaluated on every row, every query. A predicate like WHERE region IN (SELECT state FROM user_state_table WHERE username = current_user()) adds a join to every query against the protected table. Cache the policy result by role, or push the predicate into a static IN (...) list at policy-registration time.
"Masking policies prevent exfiltration." They prevent the masked column appearing in the result. An analyst who can join the masked column to a side table, group by merchant_id, and read the unmasked merchant_id has exfiltrated the row identity even if the masked value was useless.

Going deeper

Attribute-based access control (ABAC) vs role-based access control (RBAC)

RBAC says "role X can read column Y". ABAC says "principal with attributes (team, region, clearance) can read column Y if (their team owns this data, their region matches, their clearance is at least Y's classification)". RBAC is simpler — fewer rules, fewer attributes — but explodes combinatorially: 40 teams × 4 regions × 3 clearance levels is 480 roles. ABAC scales to that combinatorial space because the policy branches on attributes at evaluation time. The hard part of ABAC is keeping the attribute store fresh — when an engineer transfers from the south team to the north team, every query they run after that transfer has to see the new attribute. Snowflake supports both via tag-based RBAC and CURRENT_ROLE-branched policies; BigQuery leans into ABAC via policy tags and IAM conditions; Postgres natively supports only RBAC, with ABAC layered on via custom policy SQL.

Why CDC pipelines need separate policy thinking

Change data capture (CDC) sends every changed row downstream. If the source table has RLS but the CDC target does not, the analyst on the target sees rows their RLS would have hidden in the source. Razorpay's data team learned this the hard way in 2023 when their merchant CDC stream was consumed by a sandbox warehouse that had no RLS — every region's data was visible to every team's BI dashboard for six weeks before someone noticed. The right pattern is to apply the same RLS at the CDC sink, or to filter out the policy-protected rows in the connector itself. Debezium can be configured with row-level filters; Fivetran's column-block lists handle CLS but not RLS, which is why teams running Fivetran on PII-heavy tables typically build a downstream view layer to re-apply the protection.

Differential privacy as a CLS extension

When the analyst legitimately needs aggregate statistics over a sensitive column but should not be able to infer individual rows, plain CLS is not enough — COUNT(*) WHERE pan = ? leaks individual existence. Differential privacy adds calibrated noise to query results so individual rows cannot be inferred from aggregates. Aadhaar's research-data sharing programme uses a (ε, δ)-differential privacy mechanism on demographic queries, and Apple's iOS keyboard suggestions use it at the device level. Differential privacy is the right model when "release this aggregate to the analyst, but no exact-row inference" is the requirement; for plain CLS/RLS where the requirement is "they cannot read this column or these rows at all", it is overkill.

Policy testing — yes, you can unit-test RLS

The mistake teams make is treating policies as configuration. Policies are code: they have edge cases, they regress, and a missing edge case is a security incident. The best teams treat the policy as a function and write unit tests: assert query_as("riya", "SELECT count FROM merchants").row_count() == expected_for_south_team; assert query_as("karan-vendor", "SELECT * FROM merchants").row_count() == 2. The dbt community has been pushing this pattern with the dbt-row-level-security package; Snowflake users with terraform-managed policies can test by spinning up a sandbox account with synthetic test data and running the policies against principal stand-ins. It is more work than treating the policy as a one-time SQL file, but the alternative is finding a regression six months later when the audit catches it.

Where this leads next

/wiki/data-classification-pii-confidential-public — the upstream taxonomy that decides which columns and rows need a policy in the first place.
/wiki/audit-logs-and-data-access-trails — once policies enforce access, you still need to prove who looked at what.
/wiki/data-masking-and-tokenization — the deeper technique behind dynamic data masking, including format-preserving encryption and reversible tokenisation.
/wiki/multi-tenant-warehouses-isolation-and-noisy-neighbours — the resource-isolation side of the same multi-tenant problem.

References

Snowflake, "Column-level Security and Masking Policies" — the canonical exposition of mask + role policy + secure views.
Google Cloud, "BigQuery Column-level Access Control with Policy Tags" — taxonomy-style CLS at scale.
PostgreSQL, "Row Security Policies" — the open-source reference, with security_barrier view discussion.
Databricks, "Unity Catalog: Row Filters and Column Masks" — catalog-level policy attachment across workspaces.
Dwork, "Differential Privacy: A Survey of Results" — the theoretical foundation for aggregate-only access.
RBI, "Master Direction on Digital Payment Security Controls" — Indian regulatory context for who can read what payment data.
/wiki/multi-tenant-warehouses-isolation-and-noisy-neighbours — the sibling chapter on resource-level multi-tenancy.
/wiki/query-cost-attribution — once policies decide who can query what, attribution decides who pays for it.