Cohort LTV from Shopify raw data: SQL patterns that hold up

Q: What if my customer_id is different between Shopify and my other sources?

Reconcile identity first. customer_id should be your canonical id, not Shopify's. A staging-layer model should match Shopify customers to Klaviyo profiles (via email hash) and emit a single canonical customer_id. Until identity is clean, cohort LTV is unreliable.

Q: How do I handle refunds in LTV?

Subtract refunds at the order level before aggregating. total_price - total_refunded is the net revenue. A customer who ordered $200, kept $150, and got $50 back should contribute $150 to LTV, not $200. Most Shopify raw schemas do not expose total_refunded directly; build it from the refunds[] sub-table.

Cohort LTV is the single most valuable mart model in a DTC warehouse. It is also the one most commonly built wrong. Teams ship "LTV by channel" dashboards that average across all customers, miss the compounding effect of time, and report a single number that hides a 4x spread between cohorts.

This is the pattern library I copy into every warehouse-first engagement. Cohort LTV by acquisition month, by channel, by first-product, by promo code. Plus the retention curves that make the numbers meaningful. All SQL, all runnable against a BigQuery schema for Shopify data.

Medium-wide view of white chalk cliffs along a coastline where the ocean meets the dramatic vertical cliff base — // cohort layers, vertical scale

Fits into the warehouse-first analytics rebuild hub at stage 4, modeling. This is the first mart model I build on every new engagement.

The base CTE every LTV model uses

Start with a customer-level table that has acquisition date, acquisition channel, and total revenue as of each time horizon.

WITH customer_first_order AS (
  SELECT
    customer_id,
    MIN(created_at) AS first_order_at,
    -- The channel of the first order is the cohort
    ARRAY_AGG(source_name ORDER BY created_at ASC LIMIT 1)[OFFSET(0)]
      AS acquisition_channel,
    -- The first-ordered line item, for product-cohort analysis
    ARRAY_AGG(
      (SELECT li.product_id FROM UNNEST(line_items) li LIMIT 1)
      ORDER BY created_at ASC LIMIT 1
    )[OFFSET(0)] AS first_product_id
  FROM {{ ref('stg_shopify__orders') }}
  WHERE financial_status = 'paid'
  GROUP BY customer_id
),
customer_revenue AS (
  SELECT
    c.customer_id,
    c.first_order_at,
    c.acquisition_channel,
    c.first_product_id,
    DATE_TRUNC(DATE(c.first_order_at), MONTH) AS cohort_month,
    o.order_id,
    o.created_at AS order_at,
    TIMESTAMP_DIFF(o.created_at, c.first_order_at, DAY) AS days_since_first,
    o.total_price
  FROM customer_first_order c
  LEFT JOIN {{ ref('stg_shopify__orders') }} o
    ON c.customer_id = o.customer_id
    AND o.financial_status = 'paid'
)
SELECT * FROM customer_revenue

Every cohort LTV query below reads from this CTE. The shape is one row per (customer, order), with the customer's acquisition metadata attached. Cohort LTV models are aggregations on this CTE, nothing more.

Close-up macro of the white chalk cliff face surface texture showing fine layers and grain — // surface, up close

Cohort LTV by acquisition month, at multiple horizons

The canonical cohort LTV query. Customers grouped by the month they first purchased, with cumulative revenue at 30, 60, 90, 180, and 365 days.

SELECT
  cohort_month,
  COUNT(DISTINCT customer_id) AS cohort_size,
  ROUND(SUM(CASE WHEN days_since_first <= 30 THEN total_price END)
    / COUNT(DISTINCT customer_id), 2) AS ltv_day_30,
  ROUND(SUM(CASE WHEN days_since_first <= 60 THEN total_price END)
    / COUNT(DISTINCT customer_id), 2) AS ltv_day_60,
  ROUND(SUM(CASE WHEN days_since_first <= 90 THEN total_price END)
    / COUNT(DISTINCT customer_id), 2) AS ltv_day_90,
  ROUND(SUM(CASE WHEN days_since_first <= 180 THEN total_price END)
    / COUNT(DISTINCT customer_id), 2) AS ltv_day_180,
  ROUND(SUM(CASE WHEN days_since_first <= 365 THEN total_price END)
    / COUNT(DISTINCT customer_id), 2) AS ltv_day_365
FROM customer_revenue
WHERE cohort_month >= DATE_SUB(CURRENT_DATE(), INTERVAL 18 MONTH)
GROUP BY cohort_month
ORDER BY cohort_month;

The output is a monthly table with five LTV columns. Read it top to bottom and you see the cohort trajectory; read it column by column and you see how LTV at any horizon has shifted month over month.

Two operator questions this answers immediately. "Did our Q3 acquisition cohort monetize better than Q2?" Compare the 90-day LTV column across months. "Is our 365-day LTV still growing or has it plateaued?" Compare 180-day LTV this year to the same cohort's eventual 365-day LTV (requires at least 18 months of history).

Grass-covered chalk cliff edge meeting an open sky — // edge against sky

Cohort LTV by acquisition channel

Same shape, split by channel.

SELECT
  cohort_month,
  acquisition_channel,
  COUNT(DISTINCT customer_id) AS cohort_size,
  ROUND(SUM(CASE WHEN days_since_first <= 90 THEN total_price END)
    / COUNT(DISTINCT customer_id), 2) AS ltv_90,
  ROUND(SUM(CASE WHEN days_since_first <= 365 THEN total_price END)
    / COUNT(DISTINCT customer_id), 2) AS ltv_365
FROM customer_revenue
WHERE cohort_month >= DATE_SUB(CURRENT_DATE(), INTERVAL 12 MONTH)
  AND acquisition_channel IN ('web', 'google', 'facebook', 'klaviyo', 'email')
GROUP BY cohort_month, acquisition_channel
ORDER BY cohort_month, ltv_365 DESC;

This is the query that usually changes how the ad team thinks about channel quality. Paid social often wins on last-click revenue but loses on 90-day LTV because retargeting-heavy channels tend to acquire lower-quality repeat buyers.

The acquisition_channel column here is a proxy; for real multi-touch attribution of the acquisition, join this cohort query against the attribution modeling SQL output.

Close-up of where the white chalk cliff meets the ocean water at its base — // chalk meets water

Cohort LTV by first product

The question every merch team wants answered. Do customers who first buy product X have higher or lower LTV than customers who first buy product Y?

WITH product_cohorts AS (
  SELECT
    first_product_id,
    COUNT(DISTINCT customer_id) AS cohort_size,
    AVG(CASE WHEN days_since_first <= 90 THEN total_price END)
      AS avg_revenue_90
  FROM customer_revenue
  WHERE first_order_at >= DATE_SUB(CURRENT_DATE(), INTERVAL 12 MONTH)
  GROUP BY first_product_id
  HAVING cohort_size >= 100 -- filter out low-volume products
)
SELECT
  p.product_title,
  pc.cohort_size,
  ROUND(pc.avg_revenue_90, 2) AS avg_90_day_revenue_per_first_buyer
FROM product_cohorts pc
JOIN {{ ref('stg_shopify__products') }} p
  ON pc.first_product_id = p.product_id
ORDER BY avg_90_day_revenue_per_first_buyer DESC;

The output tells you which SKUs are actually customer-acquisition engines versus which ones are one-time-buyer dead ends. The range is usually 3x to 5x across products at mid-market brands, and it almost always surprises the team.

Vertical-emphasis composition of a single tall white chalk cliff dominating the frame — // single tall cohort

The retention curve

LTV numbers are meaningless without the retention curve underneath. The retention curve shows what percent of the cohort has made a second purchase by day X.

WITH second_orders AS (
  SELECT
    customer_id,
    first_order_at,
    DATE_TRUNC(DATE(first_order_at), MONTH) AS cohort_month,
    MIN(CASE WHEN days_since_first > 0 THEN days_since_first END)
      AS days_to_second
  FROM customer_revenue
  GROUP BY customer_id, first_order_at
)
SELECT
  cohort_month,
  COUNT(*) AS cohort_size,
  ROUND(100.0 * COUNTIF(days_to_second IS NOT NULL
    AND days_to_second <= 30) / COUNT(*), 1) AS repeat_rate_30d,
  ROUND(100.0 * COUNTIF(days_to_second IS NOT NULL
    AND days_to_second <= 90) / COUNT(*), 1) AS repeat_rate_90d,
  ROUND(100.0 * COUNTIF(days_to_second IS NOT NULL
    AND days_to_second <= 365) / COUNT(*), 1) AS repeat_rate_365d
FROM second_orders
WHERE cohort_month >= DATE_SUB(CURRENT_DATE(), INTERVAL 18 MONTH)
GROUP BY cohort_month
ORDER BY cohort_month;

For most DTC brands, 30-day repeat sits between 15 and 30 percent, and 365-day repeat sits between 40 and 65 percent. Cohorts that come in from paid-social retargeting usually have lower retention than cohorts that come from email or organic. Pair this with the LTV-by-channel query and you have a whole argument.

Ultra-wide coastline of white chalk cliffs receding into a vanishing point — // receding to horizon

The mart model that combines everything

mart_cohort_ltv is the dashboard-ready view.

-- models/marts/mart_cohort_ltv.sql
{{
  config(
    materialized='table',
    partition_by={'field': 'cohort_month', 'data_type': 'date'}
  )
}}

WITH customer_revenue AS ( ... ),  -- base CTE from above

cohort_rollup AS (
  SELECT
    cohort_month,
    acquisition_channel,
    COUNT(DISTINCT customer_id) AS cohort_size,
    SUM(CASE WHEN days_since_first <= 30 THEN total_price END) AS rev_30,
    SUM(CASE WHEN days_since_first <= 90 THEN total_price END) AS rev_90,
    SUM(CASE WHEN days_since_first <= 180 THEN total_price END) AS rev_180,
    SUM(CASE WHEN days_since_first <= 365 THEN total_price END) AS rev_365
  FROM customer_revenue
  GROUP BY cohort_month, acquisition_channel
)

SELECT
  cohort_month,
  acquisition_channel,
  cohort_size,
  ROUND(rev_30 / cohort_size, 2) AS ltv_30,
  ROUND(rev_90 / cohort_size, 2) AS ltv_90,
  ROUND(rev_180 / cohort_size, 2) AS ltv_180,
  ROUND(rev_365 / cohort_size, 2) AS ltv_365
FROM cohort_rollup
ORDER BY cohort_month, acquisition_channel;

This is the table the dashboard reads. It has roughly 18 months × 5 channels = 90 rows. Lookup is instant, no cost. Dashboard refresh is on-demand.

“
Averaging LTV across all customers hides the story. Cohorts are the story, and the SQL is not hard. The hard part is the discipline to never report a single number.
”

FAQ

What if my customer_id is different between Shopify and my other sources?

Reconcile identity first. customer_id should be your canonical id, not Shopify's. A staging-layer model should match Shopify customers to Klaviyo profiles (via email hash) and emit a single canonical customer_id. Until identity is clean, cohort LTV is unreliable.

How do I handle refunds in LTV?

Subtract refunds at the order level before aggregating. total_price - total_refunded is the net revenue. A customer who ordered $200, kept $150, and got $50 back should contribute $150 to LTV, not $200. Most Shopify raw schemas do not expose total_refunded directly; build it from the refunds[] sub-table.

Should LTV include shipping and tax?

Pick one definition and be consistent. Most DTC brands report LTV as net merchandise revenue (excluding shipping and tax). The alternative is "gross transactional value." Either is fine; what is not fine is mixing them across dashboards.

How do I project LTV beyond my history?

Simplest approach: assume future LTV is equal to the oldest cohort's LTV at each horizon. If your oldest cohort has 400 days of history and their LTV at day 365 was $180, project all incomplete cohorts to $180 at day 365. More sophisticated: fit a Weibull or Gamma-Gamma model on retention rates. That is where notebooks like Hex earn their keep, not SQL.

Can I use this for subscription LTV?

Not directly. Subscription LTV needs a different model: churn rate, MRR decay curve, and contraction vs expansion revenue. The SQL above assumes discrete orders. For Recharge or Bold subscription stacks, build a separate mart_subscription_cohort_ltv model on subscription_contract data.

What to try this week

Run the base CTE against your warehouse. Then the 30/60/90/180/365-day cohort LTV query. Look at the month-over-month column for 90-day LTV. If the number is trending down, your acquisition is degrading even if top-line revenue looks fine.

If the query breaks on customer_id mismatches between Shopify and your other sources, that is the deeper blocker. A DTC Stack Audit scopes the identity-reconciliation work needed to make cohort LTV actually reliable.

Ultra-wide pure-atmosphere ocean horizon with no cliffs visible — // pure horizon

Sources and specifics

The SQL patterns above run against a standard dbt-shopify staging model; see BigQuery for Shopify data for the schema.
Retention rate ranges are typical for mid-market DTC Shopify brands; premium subscription brands trend higher, impulse-commodity brands lower.
Mart model structure follows dbt conventions (staging → intermediate → mart).
Pattern drawn from a Q1 2026 analytics engine rebuild where a handful of cohort-centric dashboards replaced multiple CSV exports.