Warehouse first attribution for DTC: the reporting layer

A DTC operator I worked with had three dashboards open during our first call. Shopify said the brand had done 1,483 orders the previous week. GA4 said 1,127. The Meta Ads Manager number, after the platform's attribution windows had closed, said 1,841. None of them matched. The team had quietly chosen "1,500-ish" as the number they reported up the chain because nobody could defend any of the three to the CEO.

That gap, the one between three platforms each insisting it has the right answer, is the entire reason the warehouse first attribution DTC pattern exists. The brand was not going to fix the discrepancy by configuring GA4 harder. They were going to fix it by moving the question out of any single platform and into a place where all three platforms became inputs.

This is the tutorial walkthrough for that move. I shipped it most recently inside an analytics suite for a regulated platform in Q1 2026, a multi-service pipeline with separate ingestion, warehouse, API, and web UI layers, feeding six production dashboards. The version before that, the dashboard-v3 build a quarter earlier, used the same pattern at smaller scale. The shape of the rebuild is identical for any Shopify store doing $2-10M and asking the same Monday morning question.

Why DTC reporting that lives in GA4 lies

GA4 was built to count sessions. Orders, payments, customers, refunds, subscription renewals, those are not what GA4 is good at. They are what Shopify is good at. When you let GA4 be the place the operator looks for the weekly revenue number, you have asked a session counter to be a financial system. It will lie to you, and it will lie consistently in the same direction.

The first lie is sampling. GA4 silently samples reports above a complexity threshold, and the UI does not always tell you when a result is sampled. The unsampled API export exists, but it requires a daily pipeline to land into a place you control, which is the very thing operators avoid because GA4 was supposed to be the easy option.

The second lie is consent mode v2. Consent mode is doing exactly what it should: respecting user choices. The downstream effect is that GA4 sees a smaller share of conversions every quarter as more browsers default to consent gates, and the gap between "what GA4 reports" and "what actually happened" widens silently. Your warehouse, fed from Shopify webhooks, does not have the same gating problem because Shopify records the order regardless of whether the customer accepted analytics cookies.

The third lie is attribution windows. GA4's session-scoped model is a different shape than Meta's view-through window, which is again a different shape than Shopify's last-click landing_site_ref. Each platform has its own definition of "where the order came from", each gives a different number, and the drift compounds. Twelve months in, the team is reporting numbers nobody can audit and the CEO has stopped trusting the deck.

Close-up fragment of a translucent slab on a dark studio backdrop, single chipped corner catching cold blue rim light and faint pink dispersion. — // the fragment · chipped corner under cold rim light

Shopify orders as the source of truth

Shopify records what actually happened. A payment was processed. A refund was issued. Those events have an irrefutable financial trail, and the order objects in the Shopify Admin API are the only DTC dataset where the count is the count.

That is the whole reason Shopify becomes the anchor in this pattern. Every other platform reports what it claims credit for, what it observed, what it modeled. Shopify reports what was settled by the payment processor.

“
Every other platform reports what it claims credit for. Shopify reports what was settled.
”

The Shopify ingestion has two halves. Real-time webhooks for low-latency dashboards (orders/create, orders/updated, orders/paid, refunds/create, plus the customer events) and a nightly bulk export for whatever fell through the webhook cracks. I have caught webhook gaps as wide as 0.4% of orders on a single quiet Tuesday because of platform restarts on Shopify's side. The nightly reconcile catches them.

The fields that survive a Shopify replatform are the ones to anchor on. order_id, order_number, created_at, processed_at, total_price, currency, financial_status, customer_id, email_hashed, source_name, landing_site_ref, plus the line items and discounts arrays. Skip the per-app metafields and connector-specific blobs unless you are sure they will outlive the connector. They almost never do.

Shopify does not know which Meta ad the customer saw, or which GA4 channel grouping they came from, or how much you spent on the campaign that drove the order. It knows the click-through landing_site_ref and the UTM string at first session, and those are useful, but they are not attribution. Attribution gets built in the warehouse, on top of Shopify orders joined to ad platform spend.

Ultra-wide distant view of a solitary translucent monolith on a dark plain at dusk, vanishing horizon, electric-blue atmospheric haze, single hot-pink rim on the far edge. — // the monolith · solitary form against vanishing horizon

GA4 and Meta as diagnostics, not sources

Once Shopify is the anchor, GA4 and Meta stop being the source of truth and become diagnostic instruments. That reframe is most of the work. The marketing team will resist it for about a week, because GA4 and the Ads Manager are where they live, and you are asking them to live somewhere else.

GA4 is genuinely good at two things. Site behavior, like which pages, which scroll depths, which session paths convert. And channel groupings at the session level, which give a useful directional read on where traffic is coming from before consent gating distorts it. Use GA4 for those and stop asking it for revenue.

Meta is genuinely good at two things too. The match quality of your CAPI integration, which is a real technical signal you cannot get anywhere else, and the ad-level performance metrics, which Meta needs to optimize delivery against. Use Meta for those.

The questions neither platform can answer are the ones the operator asks every week. What is the LTV of the customer cohort acquired from the September Meta flight, twelve months in? What is the blended CAC after refunds and chargebacks net out? Which discount code drove the highest margin orders, not just the highest revenue orders? How does subscription churn vary by acquisition source? Which paid channel actually contributed incrementally last quarter, versus claimed credit for organic conversions? What is the gross margin per order, after Shopify fees, payment processing, and shipping?

None of those live inside any single platform. They live in joins, and joins happen in a warehouse.

BigQuery (or Snowflake or Postgres) as the reconciliation layer

The warehouse choice matters less than people think. BigQuery is my default because the on-demand pricing model is friendly to mid-market DTC. As of April 2026, BigQuery on-demand is $6.25 per TB scanned, which means the quarterly warehouse bill for a $2-10M Shopify brand stays under $100 if the schema is partitioned and clustered correctly. Snowflake is the right pick if the team already has Snowflake, or if multi-cluster compute is in scope. Postgres is fine if the team has a Postgres operator and the data volume stays under low billions of rows. The pattern is identical across the three.

The schema has three layers, and skipping a layer is the most common mistake.

Raw. The ingestion connector lands data exactly as the source returns it. Raw Shopify, raw GA4 export, raw Meta API. No transformation. This layer is for replay and for trust. If something downstream looks wrong, you walk back to raw and find out which step lied.

Staging. Raw becomes typed, named, filtered. shopify_raw.orders becomes stg_shopify__orders with NUMERIC price columns instead of strings, canonical column names, no test orders, no draft orders, no POS orders unless explicitly included. dbt is the standard here for a reason: the staging layer is where most of the modeling work lives, and dbt's tests catch schema drift before it lands in a dashboard.

Mart. Joined, modeled, business-aware. mart_orders_with_attribution, mart_customer_cohorts, mart_subscription_lifecycle. This is the layer dashboards read from. Dashboards never read staging directly, and they absolutely never read raw. The reason is simple: when business logic changes, you change it in one mart model, not in fourteen dashboards.

The five tables I always start with: stg_shopify__orders, stg_shopify__customers, stg_shopify__line_items, stg_ga4__events_purchase, stg_meta__ads_insights_daily. Everything else gets added when an operator asks a question those five cannot answer. The schema details for landing Shopify into BigQuery specifically are in the BigQuery Shopify schema pattern, which is the partitioning and cost math on top of this layer.

Macro shot of crystalline edge dispersion on a single glass slab, refractive bands of cold blue splitting into hot-pink at the chipped corner, fine surface detail. — // the edge · refractive bands and dispersion close up

The warehouse first attribution DTC sequence

The rebuild order matters. Get it wrong and the next stage fails in expensive ways. This is the sequence I run, in order, on every engagement that starts from "we have GA4 and we cannot get a number we trust out of it."

Stage 1: land Shopify orders raw, daily. Webhook + nightly reconcile. The first deliverable is a query that counts Shopify orders for any given window and matches the count in the Shopify admin to the row. If those two numbers do not match, nothing downstream is allowed to proceed. This usually takes one to three days, end to end, including connector setup.

Stage 2: land GA4 export and Meta API daily. GA4 BigQuery export is configured at the property level and lands raw events into BigQuery automatically. Meta requires either a connector (Fivetran, Stitch, or a custom Python job) or the Meta Marketing API direct, depending on how custom the spend dimensions need to be. The deliverable here is a daily batch that is idempotent: you can re-run it for any historical date and get identical results. One week.

Stage 3: build the reconciliation view. This is the one query the team will run every Monday morning. It joins Shopify orders, GA4 purchase events, and Meta reported conversions, side by side, with the discrepancy column computed inline. The forensic workflow for this query is in the field notes on reconciling Shopify, GA4, and Meta, which is the order-of-operations I run before the rebuild even starts. Two days.

Stage 4: build the operator question views. This is where mart models start to earn their keep. The first three are usually cohort LTV by acquisition channel, subscription retention curves, and margin per order, which between them answer most of the weekly operator questions. The SQL pattern for cohort LTV from Shopify raw data is in the cohort LTV walkthrough, which is the version I copy on every new build. One week.

Stage 5: dashboards read from mart, never from raw. Looker Studio, Metabase, Hex, internal Next.js, whichever the team will actually open. The dashboard layer is dumb on purpose: it formats and visualizes, it does not compute. When a number is wrong, you fix it in the mart model and every dashboard updates at the same time. Two days.

The full hub article on the DTC warehouse first analytics rebuild covers the same sequence at the strategic layer; this piece is the operator-rebuild order.

-- The reconciliation view, simplified to its core shape
WITH shopify_orders AS (
  SELECT
    DATE(processed_at, 'America/Los_Angeles') AS order_date,
    COUNT(DISTINCT order_id) AS shopify_orders,
    SUM(total_price) AS shopify_revenue
  FROM `project.staging.stg_shopify__orders`
  WHERE financial_status = 'paid'
    AND source_name = 'web'
    AND processed_at BETWEEN @start AND @end
  GROUP BY order_date
),
ga4_purchases AS (
  SELECT
    DATE(event_timestamp_pst) AS order_date,
    COUNT(DISTINCT transaction_id) AS ga4_orders,
    SUM(purchase_revenue) AS ga4_revenue
  FROM `project.staging.stg_ga4__events_purchase`
  WHERE event_timestamp_pst BETWEEN @start AND @end
  GROUP BY order_date
),
meta_reported AS (
  SELECT
    date AS order_date,
    SUM(actions_purchase) AS meta_orders,
    SUM(action_values_purchase) AS meta_revenue
  FROM `project.staging.stg_meta__ads_insights_daily`
  WHERE date BETWEEN @start AND @end
  GROUP BY order_date
)
SELECT
  s.order_date,
  s.shopify_orders,
  g.ga4_orders,
  m.meta_orders,
  s.shopify_orders - g.ga4_orders AS ga4_gap,
  m.meta_orders - s.shopify_orders AS meta_overcount,
  s.shopify_revenue,
  g.ga4_revenue,
  m.meta_revenue
FROM shopify_orders s
LEFT JOIN ga4_purchases g USING (order_date)
LEFT JOIN meta_reported m USING (order_date)
ORDER BY s.order_date;

That single view answers the Monday morning question. Shopify is the anchor column. GA4 and Meta sit next to it as diagnostics. The gap columns make the drift visible instead of hidden.

What Monday morning looks like when this works

The Q1 2026 build I shipped at the regulated DTC platform sends a discrepancy email at 06:00 Pacific every weekday. It contains one number, three diagnostics, and a flag column. The number is yesterday's Shopify revenue. The three diagnostics are GA4 reported, Meta reported, and the gap percentage to each. The flag column is "investigate" or "ignore" based on whether the gap exceeded the rolling four-week standard deviation.

In a normal week, the email is six lines and gets glanced at over coffee. The team knows the GA4 gap is around twelve percent and stable, knows the Meta overcount is around eight percent and stable, knows revenue is what Shopify says, and the meeting moves on.

The week the gap blows out is the week the email earns its keep. The cause is usually one of the same handful of things, a Meta CAPI version change, a GA4 gtag that broke after a theme update, or a consent mode misconfiguration in a new geo. The discrepancy view tells you which platform drifted, the rest of the warehouse tells you why, and the fix lands the same day instead of a quarter later.

The conversation that disappears is the one the operator hated. "Why does GA4 disagree with Shopify." The answer is a chart in the Monday email, and the chart is updated automatically. The CEO stops asking the question, the marketing team stops defending GA4, and the deck reports the Shopify number every week with a paper trail anyone can audit.

Ultra-wide low angle of a translucent slab reflected in a still dark pool, twin pink-and-blue light columns mirrored, magic-hour palette. — // the mirror · twin columns reflected in still water

Common ways the rebuild stalls

Trying to model attribution before the warehouse is loaded. The operator wants to see the cohort LTV chart on day one, before the Shopify webhook is even firing. Resist this. The order matters. Modeling on top of an unstable raw layer means rewriting the model every time the connector schema drifts. Land raw first, get the count to match Shopify admin, then model.

Picking a BI tool before the schema is stable. The team will want to evaluate Looker Studio versus Metabase versus Hex in week one. Defer the decision until the mart layer exists, because the right BI tool depends on what the mart looks like and which questions are most frequent. Building dashboards against staging or raw creates pressure to lock in business logic before it is validated. Pick the BI tool last.

Letting the marketing team write SQL against raw tables. They will if you let them. The result is a hundred slightly different definitions of "purchase" floating around in saved queries, none of which will match the official mart number, and the meeting where someone discovers this becomes a half-day fire. Lock raw and staging behind grants that only the analytics or engineering owner has. The marketing team writes against mart views, period.

Treating the warehouse as a backup of GA4. The temptation is to use the warehouse as a place to put GA4 data so it is "safe", and then to keep using GA4 for everything. That arrangement is just a GA4 stack with a backup hanging off it, not a warehouse first stack. The whole point of warehouse first is that the operator question moves out of GA4 and into SQL. If the team is still opening GA4 to answer revenue, the rebuild has not finished.

A close cousin is the Looker Studio dashboard built directly on top of the GA4 BigQuery export with no Shopify join. The gaps and the consent gating are still in the data, the dashboard is answering the wrong question, and the CEO is still asking why the number changed.

FAQ

How long does a warehouse first attribution DTC rebuild actually take

From engagement start to a working Monday morning email, two to three weeks for a $2-10M Shopify brand. Stage 1 is one to three days, stage 2 is a week, stages 3 and 4 are a week combined, stage 5 is two days. The Q1 2026 build I shipped at a regulated DTC platform took longer because it was a multi-service pipeline with six dashboards, but the core reconciliation view was running by end of week two.

Why BigQuery instead of Snowflake or Postgres

BigQuery is my default for mid-market DTC because the on-demand pricing model fits the workload. A $2-10M Shopify brand running this stack scans on the order of a few hundred GB per month, and at $6.25 per TB, that keeps the warehouse bill under $100 quarterly. Snowflake makes more sense at scale or when the team already has it. Postgres works fine for low data volume and a team that has a Postgres operator. The pattern in this article is identical across the three; only the cost math changes.

Do I still need GA4 if Shopify is the source of truth

Yes. GA4 is doing two things you cannot replace easily. It is reading site behavior at the session level, which Shopify does not see, and it is feeding the channel grouping that gives a directional read on where traffic is coming from. The reframe is that GA4 is a diagnostic input, not the answer. Keep it. Stop asking it for revenue.

Will the marketing team buy into this

They will resist for about a week and then they will love it. The pattern that flips them is the Monday morning email with one number and the gap columns. After the second week of "yesterday's revenue is X, here is what each platform reported, here is what to investigate", the conversation about which dashboard to trust ends. The team gets their time back.

What about Meta CAPI in this stack

Meta CAPI is upstream of the warehouse. It is the integration that makes Meta's reported numbers more accurate, which makes the diagnostic input more useful. CAPI does not replace the warehouse, and the warehouse does not replace CAPI. They sit at different layers. The warehouse first stack assumes CAPI is configured correctly; the diagnostic value of Meta's column in the reconciliation view depends on it.

Can I run this on Postgres instead of a cloud warehouse

Yes for low data volume. A $2-10M Shopify brand can fit Shopify orders, GA4 exports, and Meta daily insights into a Postgres instance comfortably for the first year. The pattern is the same: raw schema, staging schema, mart schema, dbt on top. BigQuery becomes the right pick at scale because columnar storage and on-demand pricing handle wide aggregate queries cheaply.

Sources and specifics

The Q1 2026 build referenced was a multi-service data pipeline with separate ingestion, warehouse, API, and web UI layers, feeding six production dashboards across commerce, web, ads, and email data sources, anonymized as a regulated DTC platform.
The dashboard-v3 predecessor proved the pattern at smaller scale: five dashboards, sixty-second auto-refresh, ninety percent reduction in API call volume through smart caching.
BigQuery on-demand pricing was $6.25 per TB scanned as of April 2026 (cloud.google.com/bigquery/pricing). Pricing changes; verify before quoting cost numbers to the CFO or the founder.
The reconciliation SQL pattern in this article runs against staging-layer dbt models. The full schema for landing Shopify raw data into BigQuery is in the BigQuery Shopify schema pattern.
The forensic order-of-operations for diagnosing platform drift, the step that runs before this rebuild even starts, is in the field notes on reconciling Shopify, GA4, and Meta.
The pre-rebuild diagnostic that catches most of the issues this stack solves is the DTC stack audit, which surfaces the platform discrepancies before the engagement starts.