Skip to content
← ALL WRITING

2026-04-23 / 11 MIN READ

The warehouse-first analytics rebuild for DTC operators

Pillar hub for DTC analytics infrastructure. Why GA4 alone is not the answer, what a warehouse-first stack looks like, and the sequence I use to rebuild it.

DTC operators keep asking the same question in different shapes. Why does our CAC report disagree with our ad spend? Why do three dashboards show three different revenue numbers for the same week? Why can we not answer "what is the LTV of someone who bought product X in Q3" without a week of spreadsheet work?

The answer is almost always the same. The stack is platform-first instead of warehouse-first. GA4 is doing the job of a warehouse, Shopify is doing the job of a BI tool, and Klaviyo is doing the job of a customer data layer. Each platform answers its own narrow question well and none of them answer the operator's question at all.

This is the hub article for the warehouse-first analytics rebuild I walk through with every DTC operator who has outgrown the defaults. It is opinionated and ordered. The subordinate articles each deep-dive one stage of the sequence.

warehouse-first sequence
stage 1 of 5
input
20 event names, typed payloads
output
YAML spec, contract

[stage 1] Before any tooling. Survives replatform.

Each stage is a contract with the next. Skipping one compounds into weeks of rework later.

Why warehouse-first, and why now

Five years ago, a mid-market DTC brand could run on GA4, a Klaviyo dashboard, and a weekly screenshot from Shopify. The ad platforms reported their own conversions, the numbers roughly lined up, and nobody was asking for cohort LTV by acquisition channel.

Three things broke that stack. iOS 14.5 and 17 took visibility out of the browser pixel. Consent mode v2 introduced legal gating that GA4 respects and your warehouse does not have to. And the operator questions got sharper because competition got sharper. "Which cohort is the one worth running another flight at" is now a weekly question, not a quarterly one, and GA4 cannot answer it because GA4 does not know what a cohort is in your language.

A warehouse-first stack puts the source-of-truth raw data in a place you own and query. GA4, Meta, Klaviyo, and Shopify become upstream sources that feed the warehouse instead of tools you try to stitch together at read time. The operator question lives in SQL, not in a data-studio joining nightmare.

I shipped a warehouse-first rebuild for a DTC client in Q1 2026 as part of the analytics engine case study. That engine unified commerce, web, ads, and email under one roof, with a handful of production dashboards reading from a single warehouse. The sequence below is what I repeat, roughly, on every rebuild since.

The five-stage sequence

Before any tooling, there is a sequence. Skip a stage and the next one fails in expensive ways.

  1. Event schema and naming that survives replatform
  2. Raw Shopify and marketing data into a warehouse (BigQuery is my default)
  3. Server-side measurement protocol so GA4 is a consumer, not the source
  4. Modeling and joins in SQL, with cohort LTV as the first real query
  5. Dashboards that read from models, not from raw tables

The reason most DTC warehouses fail is that operators start at stage 5, reverse into stage 2 after a year of dashboard drift, and never do stage 1 at all. Event names collide with Shopify schema changes, field naming drifts between Meta and Klaviyo, and a replatform from one email provider to another forces a rename of the whole model.

The field notes on event schema design for DTC is where I start every new engagement. It is deliberately tool-agnostic. Once the schema exists on paper, every other stage becomes mechanical.

Stage 1: schema before tooling

Event schemas for DTC are deceptively small. Fewer than 20 events cover most of the operator surface. The trap is that each one has a dozen properties, and each property has three or four semantic variants across Shopify, Klaviyo, Meta, and GA4.

The minimum viable event list I use: page_view, product_view, collection_view, add_to_cart, remove_from_cart, view_cart, begin_checkout, add_shipping_info, add_payment_info, purchase, refund, subscription_renewal, subscription_cancelled, identified, newsletter_subscribed, email_opened, email_clicked, email_unsubscribed. Plus three or four brand-specific events.

Every event carries an event_id (UUID v4 minted at the edge), an occurred_at (ISO 8601 with offset), a customer_id (your canonical id, not Shopify's), an anonymous_id (cookie-based), and a per-event payload. The payload shape is what most teams get wrong.

The full schema lives in the event schema design for DTC pattern library, including the JSON shape for each event, the naming rules, and the migration path from Shopify's default naming to the one your warehouse is going to live with.

Stage 2: raw Shopify and marketing data into BigQuery

BigQuery is my default warehouse for DTC. On-demand pricing is $6.25 per TB scanned as of April 2026, which means a brand doing 10K orders a month can run their warehouse for under $50 of query cost. Storage is $0.02 per GB per month for active tables. Most of the early cost is Shopify backfill.

The Shopify to BigQuery path has three real options. Fivetran or Airbyte with the Shopify connector. A self-hosted dbt-shopify pipeline from Rittman's open-source project. Or direct streaming from Shopify's webhook and Admin API into a BigQuery ingestion service you write yourself. I have used all three.

Fivetran is the fastest to stand up. Airbyte is the most flexible. A self-hosted dbt-shopify plus a lightweight Node or Python ingestion service is the cheapest at scale and the most controllable, but costs a week of engineering up front.

The BigQuery for Shopify data pattern library covers the exact schema you land on, the partitioning and clustering that keeps query cost down, and the dbt model layer that sits on top.

If you are budget-constrained, the DTC data warehouse on $1,000 a month field notes has the actual line items I run into on a mid-market brand. Ingestion is the single biggest cost, not storage or compute.

Stage 3: server-side GA4 so the warehouse is upstream of the dashboard

GA4 as a source is broken for operator use cases. Consent mode gating, bot filtering, cookie loss, and Google's own sampling mean your GA4 numbers are a model of your traffic, not a record of it.

The fix is to send GA4 events server-side via the Measurement Protocol, from the same ingestion service that is writing to the warehouse. GA4 becomes a consumer that gets a lossy copy of your warehouse truth. The warehouse has the full record.

This sounds theoretical until the first time a VP of marketing asks why last Tuesday's purchase count dropped 12 percent in GA4 but Shopify shows normal. With warehouse-first, the answer is in one query, not a three-hour investigation.

The server-side GA4 via Measurement Protocol tutorial walks the real setup, including the API key rotation cadence and the consent-aware event filter that keeps you out of legal trouble without also blinding your warehouse.

The same pattern applies to Meta CAPI, covered separately in the field guide to Meta CAPI for DTC operators, which is the sibling hub in the attribution cluster.

Stage 4: modeling, joins, and the first real operator query

Raw data in a warehouse is not useful. It has to be modeled. dbt is the default and it earns its keep here. A minimum modeling layer has three strata.

Staging models rename and standardize raw source tables. stg_shopify__orders maps every field from Shopify's raw schema to a canonical name, handles type casting, and applies the filters Shopify does not do for you (no test orders, no draft orders, currency normalized).

Intermediate models join across sources. int_order_with_sessions joins Shopify orders to GA4 sessions via ga_client_id, with Meta click-ids and Klaviyo email events layered on top. This is where attribution modeling lives.

Mart models answer operator questions. mart_cohort_ltv groups by acquisition month and cohort, computes LTV at 30/60/90/180/365 days, and exposes it at the dashboard layer. One query, runnable from any BI tool.

The attribution modeling in BigQuery SQL tutorial has the actual SQL for a data-driven-lite attribution model that runs against a warehouse shape you already own. It is deliberately simpler than a full Markov chain and it earns its keep on day one.

The cohort LTV from Shopify raw data pattern library is the SQL I keep copying for LTV by cohort, by channel, by product, by promo. If you only get one mart model right, make it this one.

Stage 5: dashboards that read from models, not raw tables

The final stage is dashboards. And this is where most operator efforts stall, because the temptation is to bypass the model layer and read raw Shopify into Looker Studio.

Do not do that. Every dashboard should read from a mart model. If a dashboard needs a number that does not have a model, you add the model, you do not add SQL to the dashboard.

The Looker Studio DTC templates decision log covers the templates worth building yourself, the ones worth buying from the open community, and the ones worth skipping. Looker Studio is the default because it is free and BigQuery-native; Metabase and Hex are live alternatives that pay for themselves at a certain brand scale.

The dashboard design for operators contrarian essay is about what founders actually read versus what dashboards usually show. The answer is uncomfortable and it matters, because a dashboard that nobody reads is worse than no dashboard at all.

Where Klaviyo and reconciliation fit in

Email data is a first-class warehouse source, not an afterthought. The Klaviyo to warehouse ETL pattern library covers three ways to land Klaviyo events in BigQuery. The Klaviyo native BigQuery connector (new in 2025, thin but fast), Fivetran (reliable but pricier), and the raw API with your own ingestion service (cheapest at scale).

Once email lives in the warehouse, flow engagement, click attribution, and the revenue-per-subscriber tables that Klaviyo calls "Dashboards" all become dbt models. The Klaviyo lifecycle playbook for DTC is the cross-cluster sibling hub on the email side.

Reconciliation is the last stop. The reconciling Shopify, GA4, and Meta forensic workflow is the diff-the-numbers field notes I run on every new engagement. Healthy variance bands exist; zero variance almost always means something is broken.

For GA4 migration specifically, which is still the single most common entry point for operators who never did Universal Analytics cleanly, the GA4 migration playbook for DTC covers what breaks in the move and what breaks after it ships.

The actual cost of doing this right

A mid-market DTC brand, warehouse-first, as of April 2026:

  • BigQuery on-demand: $40 to $120 per month in query cost, $20 to $60 in storage
  • Fivetran for Shopify and Klaviyo: $500 to $1,200 per month depending on monthly active rows
  • Looker Studio: free
  • dbt Cloud: $100 per developer per month, or dbt-core self-hosted for free
  • Cloud Run for the ingestion service: $20 to $80 per month

Total: roughly $700 to $1,500 per month for the infrastructure. Plus the one-time build, which is where most of the expense lives. A full rebuild runs 4 to 8 weeks of engineer time for a brand with one Shopify store, one Klaviyo account, one GA4 property, and one or two ad platforms. Multiple stores, headless Hydrogen, or a Recharge subscription layer push that to 8 to 16 weeks.

The warehouse is not a tool, it is a commitment. The operators who win are the ones who stopped asking dashboards to agree and started asking their warehouse the question directly.

What to try this week

Run the diff once. Pull last month's Shopify order count for web orders only, the GA4 purchase event count for the same window, and the Meta purchases count from Ads Manager for the same window. Put them in three columns of a spreadsheet. If they disagree by more than 30 percent on any row, start the warehouse-first conversation. If they agree suspiciously, run the CAPI field guide through your event_id setup first, because zero variance usually means dedup is broken.

If the diff looks rough and you do not know where to start, a DTC Stack Audit runs this exact exercise against your live stack and tells you which stage of the sequence is the lowest-hanging rebuild. It is the fastest way to find out whether the next six months of analytics work should be a migration, a rebuild, or a tune-up.

Sources and specifics

// related

Let us talk

If something in here connected, feel free to reach out. No pitch deck, no intake form. Just a direct conversation.

>Get in touch