The dbt reconciliation model that catches Meta drift

Two months into a Q1 2026 build for a mid-market DTC operator, the analytics team flagged a number that did not feel right. Meta said 1,847 purchases for the week. Shopify said 1,902. GA4 said 1,718. Three platforms, three answers, and the team had been quietly pasting whichever figure made the weekly read best, depending on who was reading it.

That is the moment a reconciliation model earns its keep. Not a one-off SQL diff that lives in someone's Notion, but a daily dbt job that joins all three sources on the only honest grain (order_id), surfaces every flavor of drift, and pages somebody when the gap crosses a threshold you can defend in a meeting.

This piece is the model itself. The schema, the staging layer, the full SQL of fct_purchase_reconciliation, the alert thresholds, and the four classes of bugs the alerts actually catch in production. It plugs into the warehouse-first reporting layer at the diagnostic stage, sits on top of the staging layer for Shopify orders in BigQuery, and pairs with the manual forensic workflow for Shopify, GA4, and Meta that this model automates.

Prerequisites: what you need wired before this model makes sense

The model assumes you already have:

A BigQuery dataset holding raw Shopify orders (via Fivetran, Airbyte, or Hightouch), the GA4 BigQuery export linked at the property level, and a Meta CAPI events log written by your tag manager or a small ingestion service.
A dbt project running on a daily schedule. 06:00 UTC works for most North American DTC operators (after Pacific Time midnight, before the team logs on).
A Slack incoming webhook or PagerDuty service key for the alert tail. The model produces a number; somebody still has to read it.
The staging layer in place for Shopify, at minimum. If you do not have stg_shopify__orders yet, build that first using the pattern in the schema piece linked above. Trying to reconcile against the raw layer will work for a week and then cost you $40 a day in scanned bytes.

If GA4 export is not enabled, stop. The Data API gives you sampled, aggregated numbers and the whole point of this model is unsampled, row-level diffs.

Tight macro of a single ice shard with sharp crystalline fractures, refractive close composition. — // the shard close in · fractures and refraction

Step 1: the join key problem and why order_id is the only honest grain

The first instinct is to join on transaction_id because GA4 has it and Meta has it and Shopify has order numbers. The instinct is wrong. GA4's transaction_id is whatever the developer typed into the dataLayer push. Meta's event_id is whatever the GTM template generated for dedup. Shopify's order_number is human-readable and prefixed.

In every implementation I have audited, at least one of those three is silently truncated, padded, or formatted differently than the other two. They look alike but refuse to join.

“
Join on order_id and accept that everything else lies a little.
”

The honest grain is the Shopify order.id (the numeric one, not order_number). Every other platform either already carries it or can be made to carry it. For GA4, push it as a custom event parameter on the purchase event (most themes already do, sometimes named shopify_order_id). For Meta CAPI, include it in the custom_data.order_id field on the Purchase event payload.

If your stack is already firing CAPI without order_id, fixing that ingest is the first thing to do. Without a stable join key, the rest of this article does nothing.

The dbt model that handles the mapping is small:

-- models/intermediate/int_orders__platform_keys.sql
{{ config(materialized='view') }}

with shopify as (
  select
    id::string as order_id,
    name as order_number,
    created_at as order_created_at_utc,
    total_price_cents,
    currency
  from {{ ref('stg_shopify__orders') }}
  where financial_status = 'paid'
    and source_name = 'web'
)

select * from shopify

This is the spine. Every reconciliation join hangs off it.

Wide atmospheric interior of an icy cave at dusk, soft pink and blue ambient haze across the space. — // the cave at dusk · pink and blue in the haze

Step 2: stage each platform's purchase signal

Each platform gets a staging model that publishes the same five columns: order_id, occurred_at_utc, value_cents, currency, platform_event_id. Same shape, three sources.

Shopify. Already covered above as int_orders__platform_keys. Treat it as the truth column.

GA4 purchase events. Pull from the BigQuery export's events_* tables. Filter to event_name = 'purchase'. The order_id lives in event_params as a custom parameter. Watch for daily partition naming (events_YYYYMMDD and events_intraday_YYYYMMDD for current-day data).

-- models/staging/stg_ga4__purchase.sql
{{ config(
    materialized='incremental',
    partition_by={'field': 'occurred_at_utc', 'data_type': 'timestamp'},
    incremental_strategy='insert_overwrite'
) }}

select
  (select value.string_value
     from unnest(event_params)
     where key = 'shopify_order_id') as order_id,
  timestamp_micros(event_timestamp) as occurred_at_utc,
  cast(ecommerce.purchase_revenue * 100 as int64) as value_cents,
  ecommerce.currency as currency,
  (select value.string_value
     from unnest(event_params)
     where key = 'transaction_id') as platform_event_id
from {{ source('ga4', 'events_*') }}
where event_name = 'purchase'
{% if is_incremental() %}
  and _table_suffix >= format_date('%Y%m%d', date_sub(current_date(), interval 8 day))
{% endif %}

Meta CAPI purchases. This depends on how you are logging CAPI. The cleanest pattern is a small Cloudflare Worker or Vercel function that fans out to Meta and writes a row to BigQuery in the same request. If you are running CAPI through Stape or a similar provider, export to a logs bucket and ingest hourly. Either way, the staging shape is the same.

-- models/staging/stg_meta__capi_purchases.sql
{{ config(
    materialized='incremental',
    partition_by={'field': 'occurred_at_utc', 'data_type': 'timestamp'},
    incremental_strategy='insert_overwrite'
) }}

select
  custom_data_order_id as order_id,
  event_time_utc as occurred_at_utc,
  cast(custom_data_value * 100 as int64) as value_cents,
  custom_data_currency as currency,
  event_id as platform_event_id
from {{ source('meta_capi', 'purchase_events_log') }}
where event_name = 'Purchase'
  and is_test_event = false
{% if is_incremental() %}
  and event_time_utc >= timestamp_sub(current_timestamp(), interval 8 day)
{% endif %}

Three staging models, same five columns. Now they can be reconciled.

Step 3: the reconciliation model, fct_purchase_reconciliation

The fact table full-outer-joins the three staging models on order_id. Anything missing on a side becomes a NULL, and a NULL is a finding.

-- models/marts/fct_purchase_reconciliation.sql
{{ config(
    materialized='incremental',
    partition_by={'field': 'window_date', 'data_type': 'date'},
    incremental_strategy='insert_overwrite'
) }}

with shopify as (
  select * from {{ ref('int_orders__platform_keys') }}
  where order_created_at_utc >= timestamp_sub(current_timestamp(), interval 8 day)
),
ga4 as (
  select * from {{ ref('stg_ga4__purchase') }}
  where occurred_at_utc >= timestamp_sub(current_timestamp(), interval 8 day)
),
meta as (
  select * from {{ ref('stg_meta__capi_purchases') }}
  where occurred_at_utc >= timestamp_sub(current_timestamp(), interval 8 day)
),

joined as (
  select
    coalesce(s.order_id, g.order_id, m.order_id) as order_id,
    s.order_created_at_utc as shopify_at,
    g.occurred_at_utc as ga4_at,
    m.occurred_at_utc as meta_at,
    s.total_price_cents as shopify_value_cents,
    g.value_cents as ga4_value_cents,
    m.value_cents as meta_value_cents,
    s.currency as shopify_currency,
    s.order_id is not null as shopify_present,
    g.order_id is not null as ga4_present,
    m.order_id is not null as meta_present
  from shopify s
  full outer join ga4 g on s.order_id = g.order_id
  full outer join meta m on coalesce(s.order_id, g.order_id) = m.order_id
)

select
  order_id,
  date(coalesce(shopify_at, ga4_at, meta_at)) as window_date,
  shopify_present,
  ga4_present,
  meta_present,
  shopify_value_cents,
  ga4_value_cents,
  meta_value_cents,
  abs(coalesce(shopify_value_cents, 0) - coalesce(meta_value_cents, 0)) as value_diff_cents,
  timestamp_diff(meta_at, shopify_at, second) as meta_time_diff_seconds,
  timestamp_diff(ga4_at, shopify_at, second) as ga4_time_diff_seconds,
  case
    when shopify_present and ga4_present and meta_present
         and abs(coalesce(shopify_value_cents, 0) - coalesce(meta_value_cents, 0)) <= 1
      then 'matched'
    when shopify_present and not meta_present then 'meta_missing'
    when shopify_present and not ga4_present then 'ga4_missing'
    when not shopify_present and meta_present then 'meta_only'
    when not shopify_present and ga4_present then 'ga4_only'
    when abs(coalesce(shopify_value_cents, 0) - coalesce(meta_value_cents, 0)) > 1
      then 'value_mismatch'
    else 'other'
  end as drift_category
from joined

The model materializes as a partitioned table on window_date. A 7-day rolling window keeps it cheap (partition pruning means a daily run scans about 8 days of data, not the whole history).

The seven drift_category values are the entire vocabulary of findings. A row is either matched or it is a specific kind of broken.

Close fragment of split ice with crystalline edges and a deep glow at the break interior. — // fragment of ice · glow inside the break

Step 4: the alert query and threshold logic

The alert is a separate model (or a dbt exposure pointing at a reverse-ETL job) that aggregates the fact table over the last 7 days and pushes a result if any threshold is breached.

-- models/marts/alerts/agg_reconciliation_drift_7d.sql
{{ config(materialized='view') }}

with rollup as (
  select
    drift_category,
    count(*) as row_count,
    sum(value_diff_cents) / 100.0 as value_diff_dollars
  from {{ ref('fct_purchase_reconciliation') }}
  where window_date >= date_sub(current_date(), interval 7 day)
  group by drift_category
),

totals as (
  select sum(row_count) as total_rows
  from rollup
)

select
  r.drift_category,
  r.row_count,
  r.value_diff_dollars,
  safe_divide(r.row_count, t.total_rows) as share_of_total,
  case
    when r.drift_category in ('meta_missing', 'meta_only')
         and safe_divide(r.row_count, t.total_rows) > 0.02
      then 'page'
    when r.drift_category = 'value_mismatch' and r.value_diff_dollars > 100
      then 'page'
    when r.drift_category in ('ga4_missing', 'ga4_only')
         and safe_divide(r.row_count, t.total_rows) > 0.05
      then 'warn'
    else 'ok'
  end as alert_level
from rollup r
cross join totals t
where r.drift_category != 'matched'

Two thresholds. Two percent drift on the Meta side triggers a page (the operator I worked with chose 2 percent because Meta's own reported variance against Shopify hovered around 0.8 percent in steady state; doubling that as the bound left enough headroom for normal noise but caught real regressions). A hundred dollars of cumulative value mismatch in a week triggers a page on its own, regardless of row count, because a single big-ticket order with the wrong value is a bigger deal than ten small ones.

GA4 drift gets a warn, not a page. GA4 will always lose 3 to 6 percent of purchases to consent and ad blockers; that is structural, not a regression. The page is reserved for the Meta side, which is the platform you are buying inventory against.

The dbt exposure that points at the alert:

# models/marts/alerts/exposures.yml
version: 2

exposures:
  - name: reconciliation_drift_alert
    type: notebook
    maturity: high
    description: Daily 7-day rolling drift check across Shopify, GA4, and Meta CAPI.
    depends_on:
      - ref('agg_reconciliation_drift_7d')
    owner:
      name: Data team
      email: data@example.com

A small Python script (run as a scheduled GitHub Action, a Cloud Function, or a Vercel cron) reads from the view and posts to Slack when any row has alert_level = 'page'. That part is plumbing, not modeling, so it lives outside dbt.

Ultra-wide distant shot of an ice cave mouth as a small bright opening in a vast cold field. — // the mouth from afar · one point of light in cold field

Step 5: what the alerts actually catch in practice

Over the first 60 days running this on an analytics engine rebuild I shipped, the alert tail caught four classes of bugs.

A pixel-only purchase the server never saw. The browser pixel fired, GA4 logged the purchase, Shopify wrote the order, but the CAPI request 504'd against the edge function and was never retried. Without this model, that purchase was just missing match quality on the Meta side. With this model, the row showed up in meta_missing and we shipped a retry queue the same week.

A Meta event with a different value than Shopify. A Shopify Plus account had multi-currency on, but the CAPI tag was sending the customer's display currency in the value field while sending USD in the currency code. The drift was 1.18x or 1.25x depending on the customer's locale. value_mismatch rows piled up; the fix was a one-line change in the GTM server-side template.

A GA4 purchase with no Shopify match. Showed up as ga4_only. Almost always a Shopify draft order that the ops team converted to a paid order in admin (which fired the GA4 purchase event via the thank-you page tracking pattern documented in the BigQuery attribution piece) but with source_name = 'shopify_draft_order', which the staging filter excluded. The fix was either to relax the staging filter or to teach the team to use a different conversion path. They chose the latter.

A Shopify order with no Meta or GA4 trace. This is the consent-rejection or ad-block bucket. Expected, structural, never alerts. But over time, watching the share of Shopify orders with zero traces is itself a signal. It crept from 4 percent to 9 percent over a quarter, which surfaced an unrelated issue: the consent banner had been A/B tested into a configuration where 18 percent of users were rejecting all cookies, up from 7 percent. The reconciliation model caught the marketing-platform symptom before the CRO team realized the banner was rejecting more.

Common mistakes

A few of these will save you a Slack thread.

Joining on session_id or click_id. GA4 sessions and Meta clicks do not align with Shopify orders. A single session can produce zero, one, or several orders, and a single order can span sessions. Order_id is the only stable grain.
Reading from GA4's standard reports view instead of the BigQuery export. GA4's UI samples above 10 million events. The export does not. If your alert is sometimes wrong, sampling is the first place to look.
Running the model on the raw layer. A SELECT * against shopify_raw.orders is roughly $0.40 per run on a busy DTC. The reconciliation model runs daily; you want it reading from staging, which is already partitioned and trimmed.
Ignoring meta_time_diff_seconds under 60 seconds. Meta CAPI events fire client-side then server-side, and the server timestamp is often a few hundred milliseconds off Shopify's. That is normal. The alert threshold should ignore time skew under a minute. Drift over 5 minutes is the real signal (it usually means a queued event got delayed).

What to try next

Once the three-source reconciliation is stable, the same pattern extends.

Add Klaviyo Placed Order events as a fourth source if email attribution is meaningful to your business. The staging pattern is identical: a stg_klaviyo__placed_order model that publishes the same five columns, and an additional klaviyo_present boolean in the fact table.

Build a dashboard that plots share_of_total per drift_category over time. The 7-day numbers in the alert are useful for paging; the 90-day trend is what catches slow regressions (a tag manager change six weeks ago that started dropping 0.3 percent more events per week).

Wire the alert to your incident response channel, not your analytics Slack. The whole point is that drift over threshold is an incident, not a number to discuss at the next standup. The first time it pages on a Saturday is when you know the model is earning its keep.

FAQ

Do I need dbt specifically, or can I do this with plain SQL?

Plain SQL works for the reconciliation query itself. dbt buys you incremental materialization, dependency graph, testing, and the exposure pattern that documents the alert. If you are already in dbt, use it. If you are not, a scheduled BigQuery query plus a small webhook script gets you 80 percent of the value with 20 percent of the setup. The model SQL is the same either way.

Why 7 days for the rolling window?

Long enough to smooth out single-day weirdness (a cron miss, a Saturday with low volume), short enough that Meta's attribution windows have closed for the bulk of conversions. Some operators run a parallel 30-day version for quarterly drift trend, but the alert lives on the 7-day window.

What if I do not have a CAPI events log to read from?

You will need one. The cleanest pattern is to write CAPI events to BigQuery in the same function that posts to Meta. If you are using a managed service like Stape, every plan above the free tier exports event logs to a bucket; ingest those hourly. The reconciliation model is only as good as the Meta-side ground truth you can join against.

How much does this cost to run on BigQuery?

On a DTC doing 2,000 orders a week, the daily run scans about 1.2 GB and costs roughly $0.008 per day in on-demand pricing. The expensive part is the initial backfill, which can scan 50 to 200 GB depending on history. Run that once with a tight partition filter, then let incremental do the work.

Can I extend this to subscription orders or recurring billing?

Yes, but the Shopify staging model needs to handle the source_name correctly (Recharge and Bold both fire purchases through different paths). The grain is still order_id; the staging filter is what changes. If subscription is meaningful for your business, the cohort retention pattern in the SQL piece covers the related cohort math.

Sources and specifics

The reconciliation model and alert thresholds described above were deployed during a Q1 2026 analytics rebuild for a mid-market DTC operator.
The 2 percent Meta-drift threshold was chosen against the operator's observed steady-state variance of approximately 0.8 percent. Your number may differ; calibrate against your own historical drift before paging on it.
BigQuery pricing referenced is on-demand at $6.25 per TB scanned as of April 2026. Reserved capacity changes the math substantially.
The four bug classes described in Step 5 are real findings from the first 60 days of the model running daily, not hypothetical examples.
This pattern assumes the GA4 BigQuery export is enabled at the property level. The Data API alone is too aggregated to support row-level diffs.

If you want a pre-engagement diagnostic that confirms whether your stack is wired well enough for this model to be useful, the DTC stack audit tests the join keys, the GA4 export, and the CAPI logging in a single sweep.