Data catalogs and the "what does this column mean" problem

On a Tuesday afternoon in Gurgaon, Riya joins the analytics team at Meesho and is asked to "pull last week's reseller GMV by tier". She opens the warehouse, finds three plausible tables — mart.reseller_gmv_daily, analytics.reseller_revenue_v2, and silver.reseller_orders_enriched — each with a column called gmv_inr. The numbers are different. The dbt repo has stale comments. The Confluence page describing reseller GMV was last edited 14 months ago by an engineer who has since left. She pings the team Slack: "which gmv_inr is the right one?" and gets three opinions in eleven minutes. Two hours later she has a number to share, but no confidence in it. Multiply this across every new joiner, every analyst, every product manager pulling data for a board deck, and you have the daily cost of not having a data catalog: thousands of person-hours per quarter spent rediscovering meaning that someone, somewhere, already knew.

A data catalog is the searchable inventory of every dataset, column, dashboard, and metric in your platform — with descriptions, ownership, lineage, freshness, and usage stitched in. It is the answer to "what does this column mean, who owns it, where did it come from, and is it safe to use?" without requiring three Slack threads. Without one, every question that begins "is this number right?" costs hours; with a working one, it costs seconds.

What a data catalog actually is

A data catalog is a metadata index — a database whose rows are the descriptions of other databases. The unit of a catalog is a metadata entity: a table, a column, a dashboard, a metric, a dbt model, a Kafka topic, a feature, an ML model, a notebook. The asset of a catalog is the graph of relationships between these entities: this column belongs to this table, this dashboard reads this column, this metric is defined by this dbt model, this Kafka topic feeds this Iceberg table.

Anatomy of a data catalog entryA central card representing one column entry "mart.reseller_gmv_daily.gmv_inr" with attached panels for description, owner, tags, lineage upstream, lineage downstream, freshness, quality status, and usage. Arrows from the central card show metadata flowing in from many sources: dbt manifest, OpenLineage events, query logs, Slack-based ownership, Soda quality checks.One catalog entry: a column, with metadata stitched from many sourcesmart.reseller_gmv_daily.gmv_inrDECIMAL(18,4) — gross merchandise valuein rupees, summed per reseller per dayowner: data-platform@meesho · v3.2dbt manifestcolumn descriptions, testsOpenLineagecolumn edges, run historyquery logswho reads this, how oftenSoda / GE checksfreshness, null %, rangeBI toolsdashboards using thisSlack ownershipteam / on-call rotationPII classifierDPDP tag, retention classaccess logswho queried, RBAC trailthe catalog stitches all of these into one row keyed by column URNsearchable, queryable, governed, deduplicated
A catalog entry for one column is the join of metadata from at least eight upstream sources: dbt manifest, OpenLineage events, warehouse query logs, BI tools, ownership records, Soda/GE quality checks, PII classifiers, access logs. The catalog's job is to keep this join fresh, queryable, and authoritative.

The catalog is not the data itself — it never holds rows from mart.reseller_gmv_daily. It holds the answer to every question about that table that is not "give me the rows": what columns it has, what they mean, who owns them, what feeds them, who reads them, when they were last updated, what dashboards depend on them, what tests they pass, what regulatory class they belong to. Why this distinction is sharper than it looks: a data warehouse is optimised for queries that scan billions of rows; a catalog is optimised for queries that join hundreds of metadata facets per entity across millions of entities. The two have completely different storage and indexing requirements — which is why catalog tools (DataHub, OpenMetadata, Atlan, Unity Catalog) are separate systems from warehouses, not warehouse features.

A useful three-layer model for what lives inside a catalog: the technical layer (schema, types, partition keys, file paths, retention) ingested automatically from the warehouse / lake / catalog stores; the business layer (descriptions, glossary terms, owners, tiers, regulatory tags) curated by humans and stored as long-lived metadata; the operational layer (lineage edges, freshness, quality status, usage counts, recent changes) recomputed continuously from runtime events. A working catalog keeps all three layers fresh and joinable. A failing catalog has the technical layer (because it's automated) but stale business and operational layers — which is why most catalogs in 2026 either work because someone keeps the human metadata current, or fail because nobody does.

The single hardest engineering problem inside a catalog is identity: deciding when two metadata entries describe the same thing. A column called gmv_inr in dbt-prod, the same column appearing in Snowflake's information schema as MART.RESELLER_GMV_DAILY.GMV_INR, the OpenLineage facet that calls it urn:li:column:(snowflake,prod.mart.reseller_gmv_daily,gmv_inr), and the Looker view referencing it as ${gmv_inr} — all four describe the same column, but the strings are different and the case sensitivity rules are different. The catalog has to resolve these to one canonical entity URN. Get this wrong and the catalog shows four separate cards for the same column, each with partial information; get it right and one card shows the full picture. DataHub's URN scheme, OpenLineage's dataset naming spec, and Unity Catalog's three-level namespace all exist primarily to solve this identity problem.

Building a tiny catalog from scratch

The fastest way to internalise what a catalog does is to build one. Here is a 50-line Python catalog that ingests metadata from a dbt manifest and a Postgres information schema, stores it in SQLite, and serves search queries — the same shape as DataHub and Unity Catalog, just smaller.

# A minimal data catalog: ingest, store, search.
# Run: pip install dbt-core sqlglot; then python catalog.py
import json, sqlite3, hashlib
from pathlib import Path

DB = sqlite3.connect("catalog.db")
DB.executescript("""
CREATE TABLE IF NOT EXISTS entity (
    urn TEXT PRIMARY KEY,
    type TEXT NOT NULL,             -- 'table' | 'column' | 'dashboard'
    name TEXT NOT NULL,
    qualified_name TEXT NOT NULL,
    description TEXT,
    owner TEXT,
    tags TEXT,                      -- JSON array
    technical_props TEXT,           -- JSON blob: type, nullability, etc.
    updated_at INTEGER NOT NULL
);
CREATE INDEX IF NOT EXISTS idx_entity_qname ON entity(qualified_name);
CREATE INDEX IF NOT EXISTS idx_entity_type ON entity(type);
CREATE TABLE IF NOT EXISTS edge (
    src_urn TEXT NOT NULL,
    dst_urn TEXT NOT NULL,
    kind TEXT NOT NULL,             -- 'contains' | 'lineage' | 'reads'
    PRIMARY KEY (src_urn, dst_urn, kind)
);
""")

def urn(kind, qualified):
    return f"urn:meesho:{kind}:{hashlib.md5(qualified.encode()).hexdigest()[:12]}"

def ingest_dbt_manifest(path):
    """dbt's manifest.json holds schemas, descriptions, tests, lineage."""
    manifest = json.loads(Path(path).read_text())
    now = 1714003200
    for node_id, node in manifest.get("nodes", {}).items():
        if node["resource_type"] != "model": continue
        qname = f"{node['schema']}.{node['name']}"
        tbl_urn = urn("table", qname)
        DB.execute("INSERT OR REPLACE INTO entity VALUES (?,?,?,?,?,?,?,?,?)",
            (tbl_urn, "table", node["name"], qname,
             node.get("description") or "",
             node.get("config", {}).get("meta", {}).get("owner", ""),
             json.dumps(node.get("tags", [])),
             json.dumps({"materialization": node["config"]["materialized"]}),
             now))
        for col_name, col in node.get("columns", {}).items():
            col_qname = f"{qname}.{col_name}"
            col_urn = urn("column", col_qname)
            DB.execute("INSERT OR REPLACE INTO entity VALUES (?,?,?,?,?,?,?,?,?)",
                (col_urn, "column", col_name, col_qname,
                 col.get("description") or "",
                 "", json.dumps(col.get("tags", [])),
                 json.dumps({"data_type": col.get("data_type", "")}),
                 now))
            DB.execute("INSERT OR REPLACE INTO edge VALUES (?,?,?)",
                       (tbl_urn, col_urn, "contains"))
    DB.commit()

def search(query):
    rows = DB.execute("""SELECT urn, type, qualified_name, description
        FROM entity WHERE qualified_name LIKE ? OR description LIKE ?
        ORDER BY type DESC LIMIT 10""",
        (f"%{query}%", f"%{query}%")).fetchall()
    return rows

ingest_dbt_manifest("target/manifest.json")
for r in search("gmv_inr"):
    print(f"  {r[1]:7s} {r[2]:50s} — {r[3][:50]}")
# Output:
  table   mart.reseller_gmv_daily                            — daily reseller gross merchandise value in INR
  column  mart.reseller_gmv_daily.gmv_inr                    — sum of order_value_paise / 100 for SUCCESS-stat
  column  mart.reseller_orders_enriched.gmv_inr_estimate     — placeholder used in pre-2024 dashboards (deprec
  column  silver.reseller_orders.gmv_inr                     — staging-level GMV before tier corrections

Walking through the key lines: the urn() function generates a stable URN per qualified name — the catalog's identity resolver. Every entity has exactly one URN, and that URN is what edges, ownership, and lineage reference. Why URN stability matters: if you regenerate URNs each ingestion run, every edge and every Slack-pasted link breaks. URNs are append-only; once issued, they are forever. DataHub's URN scheme uses (platform, qualified_name, env) as the seed; Unity Catalog uses (catalog, schema, table) as the three-level namespace. The hashing here is a shortcut for a tiny example — production systems use a deterministic, human-readable URN.

The entity and edge tables form a graph. entity is the node store with technical and business metadata; edge is the relationship store. This is the same shape DataHub uses (with a Kafka topic for change events), OpenMetadata uses (with an Elasticsearch search layer on top), and Unity Catalog uses (with Delta as the storage layer). The technology choices differ, but the schema is the same.

The ingest_dbt_manifest() function turns dbt's manifest.json into catalog rows. This is the most-leveraged ingestion path in 2026: every dbt run produces a fresh manifest, and the catalog ingests it within minutes of dbt build. The manifest already has descriptions, tests, tags, and column-level lineage — the catalog's job is to translate dbt's vocabulary into the catalog's URN scheme and store it.

Search via LIKE is the toy version. Real catalogs use Elasticsearch or OpenSearch for substring + token + semantic search across millions of entities, with relevance ranking by usage (the column queried 10,000 times this week ranks above the column queried twice last year). DataHub's search uses Elasticsearch with custom analyzers for SQL-identifier tokens; Atlan layers a knowledge-graph embedding on top for "find the column that means revenue" queries.

The thing this 50-line catalog already gets right: a single answer for "what is gmv_inr?" with description, qualified name, and the table that contains it. The thing it does not yet do: lineage, freshness, ownership-from-runtime, usage scoring, PII classification, or governance. Each of those is its own ingestion path, its own consumer of metadata events, and its own UI surface. A production catalog is this skeleton plus 30+ additional ingestion paths.

The thing the toy version also lacks but production needs urgently: a versioned change log. Every metadata write (description edit, owner change, tag application) needs to be recorded with who-did-what-when, because the catalog itself becomes part of the audit trail for regulated industries. RBI inspectors in 2026 ask not only "what does this column mean?" but "who set that meaning, when, and based on what evidence?" An immutable change log keyed by entity URN — typically a Kafka topic with infinite retention plus a periodic snapshot — is what makes the catalog defensible to an auditor. DataHub's ingestion model is built around this (every metadata event is an MCP — Metadata Change Proposal — written to Kafka first, applied to the graph second); OpenMetadata's model is similar. Skipping the change log means losing the ability to answer "when did we know this column was PII?" — which is the kind of question that loses an audit.

What goes inside the catalog (the metadata categories)

A catalog's value is proportional to how complete and how fresh its metadata is. The categories that production catalogs in 2026 ingest, in roughly the order teams add them:

The metadata layers a catalog ingestsA pyramid with three tiers showing technical metadata at the base (schema, types, paths, partitions), business metadata in the middle (descriptions, glossary, owners, tiers, regulatory tags), and operational metadata at the top (lineage, freshness, quality status, usage counts, recent changes). Each tier shows arrows pointing to its source systems.Three tiers of catalog metadataTechnical layerschema, types, partitions, paths, retentionautomatic — from warehouse / lake / catalog storeBusiness layerdescriptions, glossary, owners, tiers, PII tagscurated — humans + workflowsOperationallineage, freshness, quality, usageruntime eventscheap to ingestcheap to keep freshexpensive to curateerodes if neglectedexpensive to computestale within hoursSnowflakeIceberg, Hivedbt yml, ConfluenceGlossary CMSOpenLineageSoda / GE / dbt tests
The technical layer is automatic and cheap. The business layer requires human curation and silently erodes. The operational layer is computed continuously from runtime events. A failing catalog has the base but not the apex; a working catalog keeps all three fresh.

Technical metadata is the easy part. Every warehouse exposes an INFORMATION_SCHEMA or equivalent; every lake table format (Iceberg, Delta, Hudi) exposes manifest files; every cluster manager exposes job histories. A catalog ingests these on a schedule (every 15 minutes is typical) and updates the entity rows. This layer is rarely stale because it is automated.

Business metadata is the hard part. Descriptions, glossary terms, ownership, regulatory classification, tier (gold / silver / bronze), retention policy, business owner. These come from humans — analysts writing dbt model documentation, data stewards filling in glossary terms, platform engineers tagging PII columns. The half-life of business metadata in a fast-moving fintech is about 6 months: a description written today is wrong by mid-year because the meaning of the column drifted with product changes. Catalogs that do not have a workflow for keeping business metadata fresh — pull-request reviews on description changes, expiration dates on stale descriptions, ownership rotation tracking — accumulate stale entries that are worse than missing entries because they look authoritative while being wrong.

The pattern that keeps business metadata fresh in 2026 is a quarterly "metadata review" cycle, run by the data platform team, where every owner of a tier-1 entity gets a Slack message saying "your column gmv_inr was last described 7 months ago — confirm or update". Replies are recorded in the change log; non-replies trigger a tag-as-stale that demotes the description's authority in the UI ("this description has not been confirmed in 9 months — verify before relying on it"). The cost to the data platform team is real but bounded: at Razorpay's scale, the quarterly review involves 400–500 ping-and-confirm cycles, taking roughly two engineer-weeks per quarter. The cost of not running the review is open-ended — descriptions go stale, analysts make decisions on wrong information, the catalog gradually loses credibility, and the team eventually has to do a much more expensive full re-curation. Quarterly hygiene is cheap; deferred hygiene is not.

Operational metadata is the apex of value. Lineage (chapter 28–29), freshness (chapter 33), quality status (Soda, GE, dbt tests), usage counts (from query logs), recent changes (which dbt models were modified this week). This layer requires runtime event ingestion — OpenLineage events from every executor, query log scrapes from every warehouse, quality check results from every test framework. The freshness of this layer is the tightest: a lineage graph that is a day old is a museum piece; a freshness metric that is 6 hours stale defeats its own purpose. Production catalogs ingest operational metadata on minute-level latency.

The operational layer's value is also where most usage actually lands. Analysts open a catalog page primarily to see the green/red status dot — is this column safe to use right now? — and only secondarily to read the description. A catalog without operational metadata is a static documentation site that happens to also have a search bar; a catalog with operational metadata is a control plane the on-call uses during incidents. The Razorpay catalog page-view logs from 2025 showed that the freshness, lineage, and quality status panels accounted for 70% of all in-page interactions — descriptions, ownership, and tags accounted for the remaining 30%. Build the operational layer first if you can; the business layer is what makes it precise, but the operational layer is what makes it used.

The mistake every team makes once is investing heavily in the technical layer (because it is easy and looks impressive in dashboards), then declaring victory before they have built the business and operational layers. The catalog that knows the schema of every table but cannot tell you who owns the table or whether the dashboard reading it is currently broken is a liability — it gives users a false sense of completeness while still leaving them to ask the on-call. Why the order matters: users come to a catalog to answer questions like "is this column safe to use?" The answer requires business context (owner, tier, regulatory class) and operational state (fresh, healthy, low-error-rate), not just schema. A catalog that returns only the technical answer pushes the hard question back to humans — and the human asks Slack, and the Slack thread is exactly what the catalog was supposed to replace.

Glossary and business terms sit between business and operational metadata. A glossary is the controlled vocabulary of business concepts: "Reseller GMV", "Active User", "Tier-2 Merchant". Each glossary term is linked to one or more catalog entities (columns, tables, dashboards) that implement it. The glossary is what lets an analyst search "Reseller GMV" and find every column, dashboard, and dbt model that participates in the metric — without knowing whether the column is named gmv_inr, gross_merchandise_value, or reseller_revenue. DataHub, OpenMetadata, and Atlan all support glossaries; the teams that use them well treat the glossary as a versioned, governed artifact (changes go through review) rather than a free-edit wiki.

The glossary's job in 2026 has expanded past being a definition lookup. The semantic layer (Build 13 — dbt MetricFlow, Cube, LookML) anchors metric definitions back to glossary terms, so a change to the glossary's "Reseller GMV" definition propagates to every dashboard that implements the metric. The relationship is: glossary term → metric definition → dbt model → mart table → BI dashboard, with the catalog stitching the chain so that an analyst questioning "is this number right?" can walk back to the canonical definition without ambiguity. Teams that have shipped this stitched chain report that the number of "is this metric defined the same way in dashboard A and dashboard B?" disputes drops by an order of magnitude — because the catalog enforces that both dashboards point to the same definition or surfaces the divergence explicitly.

How a catalog gets adopted (or doesn't)

A catalog is only as valuable as its actual usage. The deployment pattern that works at a 200-engineer Indian data team — observed across Razorpay, Swiggy, Meesho, and Cred — is staged, with each phase delivering visible value before the next is started.

Phase 1: ingestion + search (weeks 1–4). Stand up DataHub or OpenMetadata. Wire dbt manifest ingestion, warehouse information-schema ingestion, and a basic search UI. Goal: an analyst can find any table in the platform by name. This is enough to retire 30% of "where is X?" Slack threads. Success metric: the search bar gets used 100+ times per week within month one.

The instinct to skip Phase 1 in favour of more ambitious phases is wrong. A search bar that returns the right table by name is a small thing that compounds into the foundation of everything else, and a team that cannot ship Phase 1 cleanly will not ship Phase 4 at all. Spend the four weeks. Resist the urge to add more sources before the basic search ranks well on the sources you have. The teams that try to do Phases 1 and 2 in parallel typically end up doing both badly — the search is unreliable and the descriptions are partially filled, neither is trustworthy, and the analysts who tried the catalog in week six conclude it doesn't work and don't return for months.

Phase 2: ownership and descriptions (weeks 5–10). Run a backfill workshop — get every dbt model owner to fill in column descriptions for their tier-1 (regulated, customer-facing) tables. Establish a pull-request gate: any new dbt model without column descriptions blocks the merge. Goal: every tier-1 column has a description, every table has an owner. Success metric: reduce time-to-first-question for a new analyst from days to hours.

The PR gate is the most important deliverable of Phase 2, more important than the backfill itself. Without the gate, every newly-created column starts undocumented and the description-coverage rate decays continuously despite the backfill. With the gate, the rate climbs monotonically — every merge either ships descriptions or doesn't merge. The backfill cleans up the existing pile; the gate prevents future debt. Teams that ship the backfill without the gate watch their coverage rate slide back to its starting point within 9 months as new tables outpace re-curation.

Phase 3: lineage integration (weeks 11–18). Wire OpenLineage events from Airflow / dbt / Spark into the catalog. Show upstream/downstream lineage on every column page. Goal: the catalog answers "where did this number come from?" without requiring a separate lineage UI. Success metric: lineage page views per week climb past search page views.

Lineage is also where the catalog's credibility is most easily damaged. An incorrect lineage edge during an incident sends the on-call to the wrong place, costs them an hour, and makes them distrust the catalog for the next month. The investment in lineage freshness SLAs (chapter 33) and lineage staleness banners is what protects this trust through the rough edges of real production. A staleness banner is not a UX failure — it is the catalog being honest about what it knows. Hiding staleness is the failure; showing it is the design. The teams that understand this ship banners loudly; the teams that try to maintain a perfect facade end up with users who silently stop trusting the data.

Phase 4: workflow integration (weeks 19–28). This is the make-or-break phase. The catalog must land where the work happens: Slack slash commands (/lookup gmv_inr), GitHub PR comments showing blast radius of a SQL change, BI tool sidebars showing column metadata on hover, IDE integrations for analysts writing SQL. Goal: 70% of catalog interactions happen inside other tools, not the catalog UI. Success metric: catalog API calls outpace catalog web sessions.

A concrete Phase-4 deliverable shape that has worked across multiple Indian fintechs: a Slack bot that responds to /cat <name> with a card showing the column's description, owner, freshness status, lineage upstream/downstream, and a link to the full catalog page; a GitHub Action that comments on every dbt-touching PR with a list of downstream consumers ranked by tier and the on-call team to notify; a Looker LookML extension that adds a "view catalog metadata" right-click on any field; a JupyterLab extension that autocompletes column names from the catalog with descriptions inline. Each of these took 1–2 weeks to build and each delivered measurable usage within days of launch. The platform team that built all four reported that the GitHub Action alone caught 30+ would-be-breaking SQL changes in the first quarter, each of which would have caused at least one downstream incident.

The teams that fail at catalog deployment almost always fail at Phase 4. They ship a beautiful catalog UI that the platform team uses but nobody else opens. The actual users — analysts inside Looker, engineers inside GitHub, data scientists inside Jupyter — are not going to context-switch to a separate web app for a column description. The integrations are mandatory, not optional.

A specific Phase-4 anti-pattern: the catalog as a documentation system. Teams that frame the catalog as "where you put documentation" inevitably end up with a stale wiki under another name. Catalogs win when they are framed as operational systems — the on-call's first stop during an incident, the analyst's autocomplete while writing a query, the governance team's dashboard for regulatory readiness. Documentation is a side-effect of operational use, not the goal. Why this distinction matters in practice: documentation systems decay because nobody re-reads them after writing. Operational systems stay current because their staleness causes pain right now. A catalog whose freshness page is always green stays accurate; a catalog that is "documentation" goes stale within months and becomes net-negative.

The framing also dictates who owns the catalog inside the organisation. Documentation-framed catalogs end up owned by a "data governance team" that has no on-call rotation and no production responsibilities; the catalog becomes their full-time job and the engineering teams treat it as someone else's problem. Operational-framed catalogs are owned by the data platform team — the same team that operates the warehouse, the orchestrator, and the lake — and the catalog gets the same engineering rigour as those systems. The Razorpay and Cred data platform teams both shifted their catalog ownership from a separate governance group to the platform engineering group in 2024–2025, and both reported that catalog quality improved measurably within two quarters of the move. The catalog is infrastructure; treat it like infrastructure.

The metric that predicts long-term catalog success is not coverage (what percentage of entities have descriptions) but read-to-write ratio. A catalog where every metadata write is read 50+ times in the next 30 days is healthy. A catalog where descriptions are written, never read, and quietly age into staleness is dying. The Razorpay platform team tracks this per-entity and demotes (hides from default search) entries that have a write-only pattern after 90 days, surfacing them only on explicit search. This pruning is what keeps the catalog feeling alive — the alternative is the Confluence wiki of 2014, full of dead pages nobody opens.

A complementary metric the Cred team uses is the ask-back rate: when an analyst can't find what they need in the catalog and falls back to Slack, the platform bot detects the question and logs it. The ratio of catalog-answered to Slack-answered data questions is tracked weekly. Healthy catalogs have this ratio climbing past 70% within a year of launch; failing catalogs plateau below 30%. The interesting property of this metric is that it captures both quality and adoption in one number — a catalog that is both complete and discoverable wins on it; a catalog that fails on either coverage or UX falls behind. The teams that publish the ask-back rate as a quarterly KPI have catalogs that improve continuously; the teams that don't measure it accept gradual decay.

Where catalogs fail in production

The "two catalogs" problem. A team buys a vendor catalog (Atlan, Collibra) for the business users and runs DataHub for the engineering team. Inside a year, the two catalogs disagree about ownership, descriptions, and lineage — because each ingests from slightly different sources at slightly different times. Both are wrong, in opposite ways. The fix is to pick one substrate and run a thin adapter for the other side's needs; the teams that have shipped two-catalog architectures uniformly regret it. The variant that does work: one catalog as the system of record (typically DataHub or OpenMetadata for engineering-led teams, Atlan or Collibra for business-led ones), with the other surface re-skinned via the system of record's API. The skin is not its own catalog — it is a view layer reading authoritative metadata from one source. This pattern preserves the user experience the business team wanted while avoiding the dual-write inconsistency that destroys multi-catalog deployments.

The PII tagging gap. DPDP-2023 in India requires every column containing personal data to be classified. Automated PII classifiers (Spirion, BigID, the open-source Piiranha) catch obvious cases — emails, phone numbers, Aadhaar — but miss derived PII (hash(email + salt)) and contextual PII (merchant_name is PII when joined with transaction_amount for a sole-proprietor merchant). Catalogs that rely solely on auto-classification leave compliance gaps; catalogs that require manual review for every column slow data team productivity to a crawl. The middle path: auto-classify the obvious cases, queue the ambiguous ones for human review with a 7-day SLA, expire untagged columns to "PII-suspected" after 30 days. Why expiry is non-negotiable: in regulated industries, the default-to-private posture is a legal requirement. A column that has never been classified is treated as potentially PII until a human confirms otherwise — which forces classification rather than letting it perpetually slide.

The contextual-PII problem is the part that auto-classification will not solve in 2026 and probably will not solve for years. A column called merchant_name is harmless on its own; joined with payment volume on a single-row level it identifies a specific small business and reveals their revenue. The classification depends on the join — meaning the catalog has to track not only "is this column PII?" but "is this column PII when materialised together with these other columns?". The Razorpay platform team encodes this as a "co-classification" rule attached to view definitions: certain join patterns elevate the privacy class of the resulting view past either input's class. This is where catalog-driven governance pulls ahead of static tagging — the rules are first-class catalog entities themselves, audited and reviewed alongside the data they protect.

Stale lineage as confident wrongness. A catalog displays lineage that was true last week but isn't anymore — a dbt model's source was renamed, the lineage extractor ran but didn't propagate, the catalog still shows the old edge. The on-call traces an incident through the catalog, follows the lineage, and ends up at the wrong column. This is worse than no lineage. Production catalogs need lineage freshness SLAs (regenerate on every dbt build, never serve lineage older than 4 hours) and must surface staleness ("this lineage is 6 hours old; the table was changed 2 hours ago") rather than hide it.

The deprecated-but-still-queried table. A table marked deprecated in the catalog is still queried 50,000 times per day because three legacy dashboards never migrated. The catalog's "deprecated" tag is information without consequence. The fix that works: pair deprecation tags with usage decay — a deprecated table that still has high read traffic auto-creates a Jira ticket against the table owner; a deprecated table that has not been queried in 90 days auto-schedules for deletion. Tags without enforcement become noise; tags with enforcement become governance.

The PhonePe data platform team automated this in 2025 with a deprecation lifecycle: tag → email migration request to all readers → 60-day grace → auto-create blocking PR for each consumer's repo → enforced cutoff date stored as catalog metadata. Once a deprecation moves past the cutoff, the catalog returns the table only with a warning ribbon, and after a further 30 days the underlying table is hidden and only owner-restored. This automation moved their deprecated-table count from 1,200+ stale entries to under 200 within six months — without any deprecation discussion taking longer than the original tag.

Search that returns the wrong column first. The most common everyday catalog failure mode: an analyst searches revenue and the top result is silver.test_data.revenue_demo — a test table from 2022 — instead of mart.merchant_revenue_daily.gmv_inr, the production answer. Search relevance has to be ranked by usage and tier, not just by string match. Production catalogs use a relevance score that combines (a) string match, (b) entity tier (gold > silver > bronze), (c) recent usage (queries per day in last 30 days), (d) freshness (recently updated entities rank higher), and (e) ownership (entities with active owners rank higher than abandoned ones). The teams that skip this end up with a catalog where the right answer is on page 3 — which means the catalog is unused.

Schema-evolution gaps in ingestion. A catalog that ingests on a 15-minute schedule misses every schema change that happens and gets queried within that 15-minute window — and in a fast-moving warehouse, that window catches dozens of events per day. The mitigations stack: ingest on warehouse change events (Snowflake's STREAMS, BigQuery's audit logs) instead of polling; show the analyst a freshness banner on every entity page so they know the metadata they are seeing might be stale; cache-bust on user-initiated refreshes. The Cred platform team moved from 15-minute polling to event-driven ingestion in late 2025 and reported that the average staleness for a queried entity dropped from 7 minutes to 22 seconds — small in absolute terms, but enough to eliminate the "wait, that column doesn't exist anymore?" failure mode that was causing 4–5 incidents per quarter.

The metadata-pipeline as second-class citizen. The pipelines that ingest metadata into the catalog are themselves data pipelines, with all the same failure modes — but most teams treat them as setup-and-forget. The metadata ingestion fails silently for 10 days; the catalog gradually becomes wrong; nobody notices because there are no SLAs on metadata pipelines. The fix is to treat catalog ingestion with the same rigor as production data pipelines: monitoring, retries, on-call coverage, freshness checks. The Razorpay platform team runs their catalog ingestion on the same Airflow cluster as production marts, with the same SLA tier. Anything else, in their experience, leads to slow erosion that is hard to detect.

The "everything is a tier-1 asset" tag inflation. A team that starts tagging tables by tier (gold / silver / bronze) often ends up with 80% of tables tagged gold within 18 months — because every table owner thinks their work is critical, and there is no governing rule for downgrades. Once everything is tier-1, the tier system is information without signal, and the on-call cannot prioritise during an incident. The discipline that works: tier upgrades require sign-off from a governance council; tier downgrades happen automatically based on usage decay (a tier-1 table queried fewer than 100 times per week for two consecutive months auto-demotes to tier-2 with a notification to the owner). The Cred data platform team enforces a quarterly tier audit where every gold-tagged table must justify its classification or be demoted; the result is a healthy 12–15% gold ratio that actually means something.

The catalog as governance theatre. Some teams adopt a catalog primarily to satisfy a regulator's checklist — DPDP-2023 requires PII inventory, the catalog has a PII tagging feature, the box gets ticked. The catalog gets populated minimally for the audit and then ignored. Six months later when a real PII deletion request lands, the tags are stale and the catalog cannot answer the question. Governance theatre catalogs fail their next real audit even though they passed their checklist audit, because the auditor in 2026 increasingly asks "show me how a deletion request was handled last month" rather than "show me your inventory". The fix is to ensure that catalog usage is part of the actual operational workflow — the deletion runbook requires a catalog query before deletion proceeds — so the catalog cannot go stale without breaking real work.

Common confusions

Going deeper

The DataHub URN scheme and why identity is the hardest catalog problem

DataHub's URN scheme — urn:li:dataset:(platform, qualified_name, env) for datasets, urn:li:column:(<datasetUrn>, fieldPath) for columns — exists to solve identity once across an entire data ecosystem. The platform component handles the "is this Snowflake or BigQuery?" question; the qualified_name handles "is this mart.x or MART.X?" with platform-specific case sensitivity rules; the env component handles "prod vs staging vs dev". The same physical column referenced from dbt, OpenLineage, BI tools, and the warehouse all collapse to one URN — and every metadata fact about that column attaches to one entity. The reason this is hard in practice: every tool has its own naming convention, and the catalog's URN-resolver has to know all of them. The OpenLineage Naming Spec (2024) attempted to standardise this across tools, with mixed adoption. As of 2026, most catalogs ship a "URN reconciler" component that handles the cross-tool identity translation; teams that try to use raw qualified names without a reconciler end up with duplicate entities.

Search relevance: why the third result is the catalog's hardest UX problem

The single feature that determines whether a catalog gets used is the quality of its search ranking. Naive substring search returns the first matching string; production catalogs use a learning-to-rank model with features like entity tier, recent usage, freshness, owner activity, query-string token overlap, and historical click-through rate. The training signal is the catalog's own click logs: when an analyst searches revenue and clicks the third result, the model learns that the third result was the right answer for that query and demotes the first two next time. DataHub ships a default ranker tuned on LinkedIn's internal usage; Atlan ships one tuned on financial-services usage patterns. Teams that retain the default ranker without tuning to their own click logs typically see a 30–40% search-success-rate gap compared to teams that retrain quarterly on their own data — which is the difference between analysts using the catalog daily vs. occasionally. The Razorpay platform team reported that a single quarterly retrain of their search ranker, costing about 8 engineering-days per cycle, raised first-result click-through from 41% to 67% — measurably more than any other catalog improvement they made that year.

How dbt manifest.json drives 60% of catalog ingestion in 2026

dbt's manifest.json, regenerated on every dbt build, contains: every model's SQL, parsed dependencies, column-level lineage (since dbt 1.5), descriptions, tests, tags, and ownership metadata from meta: blocks. This single artifact is the highest-leverage source for any catalog: one cheap-to-parse JSON file gives lineage, descriptions, tests, and ownership for the entire dbt-managed portion of the platform. The pattern that works: pipe manifest.json to the catalog on every successful dbt build (5–15 minutes after the run finishes), parse it server-side, update the corresponding entities and edges. This is why dbt + DataHub became the de facto Indian fintech catalog stack in 2024–2026: the integration is one config file. The teams that built custom catalogs without dbt-manifest ingestion uniformly added it within 6 months of going live.

Catalogs at petabyte scale — Flipkart's metadata graph

A Flipkart-scale catalog has order 10⁶ tables, 10⁷ columns, 10⁸ lineage edges, and ingests 10⁵ metadata events per day (every dbt run, every Airflow task, every Spark job, every Looker query). Storing this naively in Postgres collapses under graph queries; storing it in a generic graph DB makes simple lookups slow. The pattern that works at this scale, used by Flipkart, Swiggy, and PhonePe: dual storage — a graph store (DataHub uses Neo4j; OpenMetadata uses JanusGraph or a sharded RDBMS with materialized closure tables) for traversal queries, plus an Elasticsearch / OpenSearch index for full-text search and faceted filtering. Writes go to both stores via a Kafka-backed event log; reads route based on query shape. The total infrastructure cost for a Flipkart-tier catalog is 8–12 nodes (search) plus 6 nodes (graph) plus 4 nodes (event ingestion) — roughly ₹40–60 lakh annual cloud spend at 2026 prices, which is small compared to the warehouse it indexes (₹15–30 crore annual spend for the warehouse itself).

The hot-path lookups (column-page render, search, lineage one-hop) need sub-200ms p99 to feel native inside other tools; everything else (multi-hop closure queries, asset-graph traversals across the whole platform) tolerates seconds because they happen during incident triage, not during analyst workflow. Production catalogs split traffic accordingly: a hot read-path served from a denormalised cache rebuilt every few minutes, a cold path that hits the underlying graph store for arbitrary queries. The teams that try to serve everything from one tier discover within the first major incident that their catalog buckles exactly when the on-call needs it most — because incident response involves running expensive multi-hop traversals that starve the hot path of resources.

Why governance and the catalog have started merging in 2026

Through 2022, catalogs and governance tools were separate categories — catalogs answered "what data exists?" and governance tools (Apache Ranger, Privacera, Immuta) answered "who can access what?". By 2026, the line has blurred: DPDP-2023 in India and similar regulations globally require that the answer to access control depends on column-level metadata (is this PII? what tier? what retention class?), and that metadata lives in the catalog. Unity Catalog ships with row- and column-level access control built on metadata tags. DataHub added access policies in 2025. The convergence is structural: the same metadata that powers search and lineage also powers governance, so the systems either merge or become tightly coupled. Teams that still run separate catalog and governance stacks in 2026 are either at smaller scale (where the duplication is tolerable) or running on legacy architectures.

The implication for catalog choice is concrete. A team picking a catalog in 2026 is implicitly picking a governance framework, because the governance policies will reference the catalog's tags and ownership records. Switching the catalog later means rewriting every governance policy that referenced the old catalog's URN scheme. The teams that have switched between catalogs (DataHub → OpenMetadata, or vice versa) describe it as a 6–9 month migration, much of which is governance-policy translation rather than metadata re-ingestion. Pick the substrate carefully, and pick once.

Where this leads next

The next chapter (31, OpenLineage and Marquez) covers the protocol layer that makes lineage events portable across the catalog ingestion paths described here. Chapter 32 (data contracts) covers the producer-side metadata that catalogs increasingly enforce as gates. Chapter 33 (freshness SLAs) covers the operational metadata that catalogs surface as the green/red dot on every entity page.

These three chapters together complete the Build 5 story: lineage tells you the topology, the catalog tells you the meaning, contracts tell you the guarantees, freshness tells you the current state. None of them works without the others. The catalog is the entity-keyed substrate; lineage is the edges; contracts and freshness are the labels. Skipping any one of these layers leaves the others operationally incomplete — which is why the whole build is a coherent unit, not a menu of independent choices.

Beyond Build 5, the catalog becomes the substrate for the metric layer (Build 13, semantic layers) — the catalog's glossary terms link to the metric definitions in dbt's MetricFlow or Cube; the catalog enforces governance on metric usage. It also becomes the substrate for feature stores (Build 15) — features are first-class catalog entities with the same ownership, lineage, and quality semantics as columns.

The progression is consistent. Each later Build assumes the catalog exists and uses it as the join point for its own metadata. Without a working catalog, every later Build re-invents partial catalog functionality inside its own tool — a feature store with its own ownership model, a metric layer with its own glossary, a governance system with its own PII tags — and the same "which definition is right?" problem reappears at every layer. The catalog is the place these scattered systems agree, and the discipline of maintaining one shared catalog is what makes a data platform feel coherent rather than balkanised.

The framing the senior data platform engineers at Cred and Razorpay use for catalog investment: it is a productivity multiplier, not a tool. A catalog that is well-adopted reduces every analyst's time-to-answer by 30–50%, every new joiner's onboarding from weeks to days, every audit's prep effort by an order of magnitude. Across a 200-person data org, that compounds into thousands of person-hours per quarter — far more than the catalog's infrastructure and curation cost. The teams that delay catalog adoption by a year usually do so because they are "too busy with pipelines"; the irony is that every one of those pipelines costs more to build and maintain because the team has no catalog. The compounding goes the wrong way without it.

There is a second cost that the productivity framing under-counts: the catalog reduces the probability of incidents in the first place, not just the time to resolve them. A SQL change that would have broken three downstream dashboards gets caught at PR review because the GitHub Action surfaced the blast radius. A schema change that would have orphaned a regulated dataset gets blocked because the catalog flagged the dataset as tier-1. A new analyst who would have computed the wrong metric finds the right column on first search instead of the wrong one. The incidents that didn't happen are the most under-counted benefit of any platform investment, and the catalog is unusual in that the prevention rate can be measured directly via the GitHub Action's "would-have-broken" metric.

A practical bar for "is the catalog working?": ask any analyst, picked at random, to find the definition and owner of the column underlying the most-used metric on the company's main dashboard. If they can do it in under 30 seconds without asking anyone, the catalog is working. If they can't, every other metric in the org is being computed against partly-understood inputs — and the cost shows up as wrong numbers in board decks, slow incident response, and audits that stretch into months. The catalog's job is to make that 30-second answer routine.

A second bar that complements the first: ask the on-call data engineer to point at the catalog the next time they get paged for a data incident. If the catalog is the first tool they open — before Slack, before the warehouse SQL editor, before the BI tool — it is delivering its operational value. If they instinctively reach for Slack first, the catalog is still a documentation system, not an operational one. The transition between these two states is what separates the data platforms that feel like infrastructure from the ones that feel like a collection of tools. Every Build 5 chapter from here on — contracts, freshness, observability — is an investment that compounds only if the catalog has crossed this line.

References

  1. DataHub Project documentation — the LinkedIn-originated open-source catalog that became the most widely deployed in Indian fintechs by 2025.
  2. OpenMetadata documentation — the alternative open-source catalog with strongest data-quality + lineage joint workflows.
  3. Unity Catalog: Databricks' governance layer — the governance-first catalog that ships with Databricks.
  4. Atlan engineering blog on the active metadata movement — the vendor-side framing of why catalogs need to be operational, not just informational.
  5. Maxime Beauchemin, "The rise of the data catalog" (2020) — the Airbnb-era essay that framed catalogs as the missing primitive of modern data platforms.
  6. DPDP Act 2023, India — the regulatory driver for catalog-mediated PII tracking in Indian data platforms.
  7. Column-level lineage: why it's hard and why it matters — chapter 29, the granular lineage surfaced on every catalog column page.
  8. Shirshanka Das et al., "DataHub: A generalized metadata search & discovery tool" (2020 LinkedIn engineering) — the original architectural paper for the URN-based metadata graph design.