Attribution modeling in BigQuery: the SQL I keep copying

Most attribution debates in DTC are theological. Last-click undercounts top-of-funnel, first-click overcounts discovery, data-driven is a black box, and Markov chains are academically elegant and operationally useless.

The pragmatic answer is "run five models, compare, report the range." This article is the SQL I keep copying into every warehouse-first engagement to do that. Last-click, first-click, linear, time-decay, and position-based. All five run against a standard warehouse shape in under a minute on a mid-market DTC brand.

Fits into the warehouse-first analytics rebuild hub at stage 4, the modeling layer. Assumes you have already landed Shopify and marketing-source data in BigQuery via the three-layer schema pattern.

The shape the SQL expects

Every attribution query below assumes two tables:

int_touchpoints has one row per marketing touchpoint, with columns customer_id, occurred_at, channel (organic, paid_search, paid_social, email, direct, etc.), campaign_id, and touchpoint_id. This is built from a union of GA4 sessions, Meta click data, Klaviyo email clicks, and Google Ads click data.

int_orders has one row per completed Shopify order, with customer_id, order_id, occurred_at, and revenue.

The attribution models join these two tables on customer_id with the touchpoints that preceded the order in some lookback window (usually 30 days).

The canonical lookback window join:

WITH orders_with_touches AS (
  SELECT
    o.order_id,
    o.customer_id,
    o.occurred_at AS order_at,
    o.revenue,
    t.touchpoint_id,
    t.channel,
    t.campaign_id,
    t.occurred_at AS touch_at,
    TIMESTAMP_DIFF(o.occurred_at, t.occurred_at, DAY) AS days_before_order
  FROM {{ ref('int_orders') }} o
  LEFT JOIN {{ ref('int_touchpoints') }} t
    ON o.customer_id = t.customer_id
    AND t.occurred_at <= o.occurred_at
    AND t.occurred_at >= TIMESTAMP_SUB(o.occurred_at, INTERVAL 30 DAY)
)
SELECT * FROM orders_with_touches

This is the base CTE every model below uses. It is one line of real logic: touches within 30 days before an order, joined by customer.

Close-up macro of a single weathered stone column at dusk showing fine erosion patterns, mineral deposits, and ancient weathering. — // the weathering · single column close up

Model 1: last-click

The simplest. Credit goes to the last touch before the order.

WITH last_click AS (
  SELECT
    order_id,
    customer_id,
    channel,
    campaign_id,
    revenue,
    ROW_NUMBER() OVER (
      PARTITION BY order_id
      ORDER BY touch_at DESC
    ) AS rn
  FROM orders_with_touches
)
SELECT
  channel,
  COUNT(DISTINCT order_id) AS orders,
  SUM(revenue) AS attributed_revenue
FROM last_click
WHERE rn = 1
GROUP BY channel
ORDER BY attributed_revenue DESC;

Run this on a mid-market DTC brand and paid_social usually dominates. That is last-click's bias: it rewards the channel closest to checkout, which is usually where the retargeting lives.

Model 2: first-click

Credit to the first touch in the lookback window.

WITH first_click AS (
  SELECT
    order_id,
    channel,
    campaign_id,
    revenue,
    ROW_NUMBER() OVER (
      PARTITION BY order_id
      ORDER BY touch_at ASC
    ) AS rn
  FROM orders_with_touches
)
SELECT
  channel,
  COUNT(DISTINCT order_id) AS orders,
  SUM(revenue) AS attributed_revenue
FROM first_click
WHERE rn = 1
GROUP BY channel
ORDER BY attributed_revenue DESC;

First-click rewards discovery. Paid_social still often wins for DTC because it is the most common first-touch channel, but organic search moves up meaningfully compared to last-click.

Ultra-wide looking down the length of the colonnade at dusk, dark weathered columns receding to a vanishing point with dramatic leading lines. — // down the line · columns to a vanishing point

Model 3: linear

Credit split evenly across all touches in the window.

WITH touch_counts AS (
  SELECT
    order_id,
    COUNT(*) AS total_touches
  FROM orders_with_touches
  GROUP BY order_id
),
linear AS (
  SELECT
    o.order_id,
    o.channel,
    o.revenue / tc.total_touches AS attributed_revenue
  FROM orders_with_touches o
  JOIN touch_counts tc USING (order_id)
)
SELECT
  channel,
  SUM(attributed_revenue) AS attributed_revenue
FROM linear
GROUP BY channel
ORDER BY attributed_revenue DESC;

Linear is the model operators intuitively trust. It does not reward any one channel; it spreads credit. Good for comparing "which channels show up in the journey" rather than "which channel closed."

Model 4: time-decay

Recent touches get more credit. I use a 7-day half-life, which is Meta's default.

WITH weighted AS (
  SELECT
    order_id,
    channel,
    campaign_id,
    revenue,
    POW(0.5, days_before_order / 7.0) AS weight
  FROM orders_with_touches
),
normalized AS (
  SELECT
    order_id,
    channel,
    revenue,
    weight,
    SUM(weight) OVER (PARTITION BY order_id) AS total_weight
  FROM weighted
)
SELECT
  channel,
  SUM(revenue * (weight / total_weight)) AS attributed_revenue
FROM normalized
WHERE total_weight > 0
GROUP BY channel
ORDER BY attributed_revenue DESC;

Time-decay is a reasonable compromise between last-click and linear. The 7-day half-life means a touch 7 days before the order gets half the credit of a touch at the moment of purchase. Adjust the half-life (14, 30 days) based on your category's typical consideration window.

Close-up of the base of a single weathered stone column at dusk where the column meets the cracked stone floor. — // where column meets floor · the seam

Model 5: position-based (U-shaped)

40 percent to first touch, 40 percent to last touch, 20 percent split evenly across middle touches.

WITH ordered_touches AS (
  SELECT
    order_id,
    channel,
    touch_at,
    revenue,
    ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY touch_at ASC) AS rn_asc,
    ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY touch_at DESC) AS rn_desc,
    COUNT(*) OVER (PARTITION BY order_id) AS total_touches
  FROM orders_with_touches
),
positioned AS (
  SELECT
    order_id,
    channel,
    revenue,
    CASE
      WHEN total_touches = 1 THEN 1.0
      WHEN total_touches = 2 AND rn_asc = 1 THEN 0.5
      WHEN total_touches = 2 AND rn_desc = 1 THEN 0.5
      WHEN rn_asc = 1 THEN 0.4
      WHEN rn_desc = 1 THEN 0.4
      ELSE 0.2 / (total_touches - 2)
    END AS weight
  FROM ordered_touches
)
SELECT
  channel,
  SUM(revenue * weight) AS attributed_revenue
FROM positioned
GROUP BY channel
ORDER BY attributed_revenue DESC;

Position-based is what Google Ads used to call "U-shaped." It acknowledges that discovery and conversion are the two moments that matter and middle touches are noise. For DTC with moderate consideration windows, this is often my default.

Single tall weathered stone column at dusk filling the frame from base to capital, vertical composition. — // the column rising · vertical presence

The comparison view

The payoff is running all five side by side. The operator question is not "which model is right" but "how much does the ranking change across models?" If paid_social is top in every model, it is genuinely the top channel. If it is top in last-click and fifth in first-click, the ranking is a last-click artifact.

WITH last_click AS ( ... ),
first_click AS ( ... ),
linear AS ( ... ),
time_decay AS ( ... ),
position_based AS ( ... )

SELECT
  channel,
  ROUND(lc.revenue, 0) AS last_click,
  ROUND(fc.revenue, 0) AS first_click,
  ROUND(ln.revenue, 0) AS linear,
  ROUND(td.revenue, 0) AS time_decay,
  ROUND(pb.revenue, 0) AS position_based
FROM last_click lc
FULL OUTER JOIN first_click fc USING (channel)
FULL OUTER JOIN linear ln USING (channel)
FULL OUTER JOIN time_decay td USING (channel)
FULL OUTER JOIN position_based pb USING (channel)
ORDER BY linear DESC;

The output is a small table, typically 5 to 10 rows by 6 columns. This is the table that actually drives operator decisions, and it is the one I pin into the Looker Studio exec dashboard.

Cost check

For a mid-market DTC brand with 10K orders per month, ~40K touchpoints per month, on partitioned and clustered tables:

Each model above scans ~50 to 100 MB of data.
Running all five in a single query: ~500 MB.
At $6.25 per TB, each full comparison costs about $0.003.
Running daily: $0.10 per month. Free under the 1 TB monthly free tier.

Attribution math is not the expensive part of a warehouse. Ingestion is. See the BigQuery for Shopify data article for the partitioning pattern that makes this cheap.

Entire colonnade structure at dusk seen from a distant vantage, the diagonal line of columns visible against the gradient sky. — // the full structure · diagonal line at dusk

The thing the SQL does not solve

All five models above are click-based (or event-based, for email). None of them model view-through, which is where Meta CAPI, cross-device, and privacy-compliant modeling come in. For that, see the field guide to Meta CAPI for DTC operators, which is the sibling hub on the attribution side.

The pattern that works for DTC: run all five models in SQL against warehouse data for baseline decisions, then layer incrementality tests (holdout groups, geo-split) on the 2 or 3 channels that matter most for validation. Skip Markov chains unless you have a legitimate reason; they are academically elegant and rarely operationally useful at mid-market scale.

“
Attribution is a range, not a number. The operators who win are the ones who stopped looking for the "right" model and started reporting the spread.
”

FAQ

Which attribution model should I report to the CEO?

Report linear or position-based as the "default view," and show the range across all five as context. Last-click alone is misleading; first-click alone is overgenerous to discovery. Linear and position-based are the two most defensible for cross-channel comparison.

Why is my last-click attribution so skewed toward paid social?

Because paid social is your retargeting channel. Retargeting lives at the bottom of the funnel and gets the last-click by construction. First-click will show you what is actually driving discovery; the gap between first and last is the "retargeting tax" the model is paying.

Do I need GA4's data-driven attribution?

No. Data-driven attribution in GA4 is a black box Shapley-value model that is hard to audit. The five models above are transparent, run against your warehouse, and you can change the rules. Use GA4 DDA as a sanity check, not as the source of truth.

What lookback window should I use?

30 days is the default that fits most DTC consideration windows. Short-consideration categories (impulse beauty, commodity food) work fine at 14 days. Long-consideration (furniture, premium wellness) need 60 to 90. Whatever you pick, apply it consistently across all models.

How do I handle multi-touch from the same channel?

Treat each touch as a separate row in int_touchpoints. Models handle them naturally; last-click only credits the last paid_social touch, linear spreads across all, etc. Do not deduplicate touches before the model runs; deduplication belongs at the source-table level, not at the attribution-model level.

Pure-atmosphere of the gradient pink and blue dusk sky stretching across the entire frame at last light, no columns in frame. — // just sky · pink and blue gradient, no subject

What to try this week

Drop the base CTE into a BigQuery query tab. Substitute your real table names for int_orders and int_touchpoints. Run the last-click and linear queries side by side. If last-click shows one channel dominating and linear shows a much flatter distribution, the last-click number is a construct, not a truth.

If you do not have int_touchpoints yet because you have not unified your marketing-source data, that is the deeper blocker. A DTC Stack Audit scopes the warehouse rebuild needed to support attribution modeling against unified data.

Sources and specifics

The five-model comparison is the industry-standard set for multi-touch attribution; Google Ads and Adobe both document similar sets.
Time-decay half-life of 7 days matches Meta's default conversion window.
Position-based (U-shaped) 40/40/20 weighting is the Google Analytics legacy default from Universal Analytics.
Cost estimates assume partitioned and clustered tables per BigQuery for Shopify data.
Pattern drawn from a Q1 2026 analytics engine rebuild for a mid-market DTC operator.