Scott Wueschinski
← All articles

Insight

The retail data lake retrofit nobody is publishing

Three years ago every Tier 1 retailer built a data lake. The AI conversation moved on and the lake went quiet. Here's the retrofit playbook winning retailers are quietly running.

· 8 min read

Three years ago, every Tier 1 retailer’s transformation budget had a data lake line item. Twenty to fifty million dollars each. Same vendor parade. Same architectural diagrams. Same promise: unify the data, unlock the insights, modernize the stack.

Now the conversations have moved on. Agentic. GenAI. AI-native. Foundation models. The data lake has gone quiet.

It didn’t go away. It went underutilized.

This article is the long-form retrofit playbook — the one nobody is publishing because the consultants who sold the original lake aren’t motivated to surface what got missed, and the vendors selling agentic AI on top of a 2023 lake aren’t motivated to talk about why the substrate is the problem.

What got built versus what got needed

The 2022-2024 data lake builds were optimized for dashboards. The conversation at every steering committee was about consolidation, single-source-of-truth, and getting Tableau / Looker / Power BI to render the same number for two different VPs.

That conversation produced a substrate that is, on its own terms, working. Dashboards render. Queries run. The CFO can pull the weekly P&L cut and the CMO can pull the campaign attribution view, both pointing at the lake.

What it didn’t produce is a substrate AI agents can reason against. The two consumers — humans reading dashboards, agents executing decisions — have very different requirements:

RequirementDashboard consumerAgent consumer
Data freshnessDaily or weekly is fineOften needs minutes-to-hours
Schema understandingImplicit, with humans applying judgmentExplicit, machine-readable
Field semanticsTribal knowledge captured in the dashboard queryMust be encoded as data
LineageRarely consultedRequired for agent confidence and recovery
Confidence/quality flagsOptionalMandatory
Forgiveness for messy inputsHigh (humans notice and adjust)Low (agents act on the mess at scale)

The lake built for the left column does not serve the right column. AI agents need clean, queryable, contextually-tagged data more than your dashboards ever did — because agents act, and dashboards merely report.

That’s the retrofit gap.

What breaks when agents read the unretrofitted lake

Three things break, in increasing order of cost.

1. Semantic context is missing

Your fields have names that made sense to the dashboard team. They don’t make sense to an agent that needs to reason about whether a SKU is promoted, returnable, age-restricted, or seasonally indexed.

sku_status_flag = 'A' means “active” to the dashboard team because they wrote the query that filters on it. To the agent, it means “literal string A,” and the agent has to guess at the encoding — usually badly.

Multiply this across thousands of fields. Every field with an opaque encoding is a failure mode the agent runs into at production scale.

2. Latency assumptions are wrong

Batch refresh windows that were fine for weekly reports are not fine for closed-loop scoring agents that need to react inside the conversion window.

The dashboard team accepted “yesterday’s data” because nobody is making operational decisions inside a 5-minute window from a Tableau dashboard. An agentic pricing system, an agentic merchandising system, an agentic customer-recovery system — those all need fresh-enough data to decide now.

The lake’s refresh cadence becomes a hard ceiling on agentic capability. Most retrofits surface this constraint within the first month.

3. Lineage is non-existent

Agents that act in a loop need to know which data is fresh, which is derived, which is canonical. Most lakes can’t answer that for any given field without an engineer to spelunk through the dbt or stored-procedure history.

Without lineage, the agent has no way to know whether to trust a low-confidence input. Should it act anyway? Should it escalate? Should it route to a human? The right answer depends on data freshness and derivation chain — neither of which the unretrofitted lake exposes.

The retrofit playbook

Three moves, in roughly this order. Each is a quarter of focused work for a Tier 1 retailer, less for smaller retailers.

Move 1: A semantic abstraction layer on top of the lake

Not a new lake. Not a new vendor. A layer — usually built on dbt, sometimes on a metric store, sometimes hand-rolled — that gives every important entity and field a stable, agent-readable identity and definition.

What “agent-readable” means in practice:

  • Every field has a human-language description. (“returnable_flag indicates whether the SKU is eligible for return per policy R-2024-08.”)
  • Every field has explicit allowed values, not opaque encodings. (sku_status_flag becomes sku_status: 'active' | 'discontinued' | 'pending_launch' | 'recalled'.)
  • Every entity has a stable ID that survives schema migration. (Critical: agents that reference a SKU by sku_id shouldn’t break when the underlying table gets renamed.)
  • Every metric has one canonical definition. (“Weekly comparable sales” means exactly one thing across the agent layer, even if the dashboard team has three variants.)

The semantic layer is the most labor-intensive of the three moves and the highest-ROI. Build it well and the next two retrofits are dramatically easier.

Move 2: Latency tiering, not latency optimization

Most data doesn’t need to be real-time. The retrofit identifies which 5-10% does, isolates it, and routes agents to the right tier. The other 90% stays on its existing batch cadence.

Three tiers in most retrofits:

  • Tier 1 — sub-minute. Inventory state, current price, current promotion status. Streamed from the source-of-record system, not from the lake. The lake’s role here is lookup of slowly-changing metadata, not the live transactional data.
  • Tier 2 — sub-hour. Customer activity recency, campaign exposure, competitor pricing scrapes. Refreshed via change-data-capture or short-batch ETL. Acceptable lag for most agentic-merchandising and agentic-pricing decisions.
  • Tier 3 — overnight. Aggregations, forecasts, segmentation features. The traditional lake refresh cadence, untouched.

The mistake retailers make is trying to make everything Tier 1. That’s expensive and unnecessary. The right move is identifying the 5-10% that needs Tier 1 and leaving the rest in Tier 3.

Move 3: Lineage as a first-class artifact

Every field exposed to an agent has a freshness tag, a derivation history, and a confidence band. Agents that see this information make better decisions and recover from bad data more gracefully.

What “first-class artifact” means:

  • The lineage isn’t documentation; it’s data the agent can query at decision time.
  • A scoring agent reading a customer’s “predicted churn probability” also reads “this prediction is from yesterday’s batch, derived from these five inputs, with confidence band 0.72-0.81.”
  • The agent’s decision logic uses the lineage. Stale data triggers a different code path than fresh data. Low-confidence triggers escalation.

Most lake builds skipped lineage entirely or buried it in dbt logs. Surfacing it as queryable data is a meaningful but tractable engineering lift — and it’s what turns the agent from a vibes-decision-maker into a system that knows when to act and when to wait.

What this looks like in production

A retailer I worked with in 2025 had a $35M data lake from a 2023 build. The lake was working as a dashboard substrate; AI initiatives layered on top weren’t moving the needle.

The retrofit:

  • Quarter 1: semantic abstraction layer on dbt. ~600 fields normalized across SKU, customer, campaign, promotion, inventory entities. Agent-readable descriptions for every field. Stable IDs across schema migrations.
  • Quarter 2: latency tiering. ~8% of fields moved to Tier 1 (CDC + Kafka), ~22% to Tier 2 (short-batch). Rest stayed on the existing nightly refresh.
  • Quarter 3: lineage surfaced as data. Every Tier-1 and Tier-2 field has freshness, derivation, and confidence available to agent queries. Agent decision logic updated to consume lineage.

Three quarters. ~$3M of engineering investment on top of the existing lake. The same agentic merchandising deployment that had been stuck in pilot for nine months hit production in the fourth quarter and started compounding feedback data immediately.

That’s the difference. The retrofit didn’t rebuild the lake. It made the lake serviceable for the consumer — agents — that the original build didn’t anticipate.

The CODN angle

The cost of not retrofitting is the multiplier on every AI initiative downstream.

A retailer with an unretrofitted lake spending $5M on agentic deployments in 2026 will get roughly half the value a peer with a retrofitted lake gets for the same spend. Not because the agents are different — they’re the same agents. Because the substrate is different.

Every quarter the retrofit gets deferred:

  • Adds another quarter of agentic deployments running on bad substrate
  • Adds another round of vendor pitches that won’t survive contact with the actual data
  • Adds another quarter of senior-engineer time spent debugging schema-and-semantics issues that the retrofit would have solved
  • Compounds against the cohort that retrofitted in 2024-2025 and is now shipping at full agentic capacity

By the time the gap is visible in margin or share, it’s already two cycles too late to close cheaply.

The CODN of staying on an unretrofitted lake through 2026 is roughly: a 2x multiplier against the AI-investment ROI of the retrofitted cohort, plus a senior-engineering capacity tax that compounds.

The bottom line

Three years ago, every Tier 1 retailer built a data lake. Most are now sitting underutilized while the AI conversation has moved on.

The retrofit — semantic abstraction layer, latency tiering, lineage as first-class data — is the unsexy work that makes every AI dollar downstream worth ten times what it would otherwise be worth. It is not a new lake. It is not a new vendor. It is the connective tissue between the lake you already built and the agentic systems you want to run on top of it.

If your lake doesn’t have a semantic layer, latency tiering, and first-class lineage today, you are underwater on every AI dollar that comes after it. That’s not a vendor problem. It’s a CDO problem. And the CDOs treating it like one are the ones whose names will be on the case studies in 2027.