BigQuery for Shopify data: the schema that does not regret you

A DTC brand's Shopify data is the single most important dataset in their warehouse and the one most commonly landed in the wrong shape. Teams copy the raw API response into BigQuery, run a SELECT * against it, and wonder why a single cohort-LTV query costs $12 to run.

This is the schema that does not regret you. It is specifically about how to land Shopify raw data in BigQuery such that the warehouse stays cheap, queries stay fast, and the dbt models on top do not break when Shopify changes a field name next quarter.

It fits into the warehouse-first analytics rebuild hub at stage 2, the ingest layer. The cost math here is the same math I used when shipping a Q1 2026 analytics engine rebuild.

Frozen waterfall inside a cave with hanging icicles, cool light reflecting blue and pink across dramatic ice formations. — // frozen mid-flow

The three-layer Shopify schema in BigQuery

The schema has three layers, and every one has a specific job.

Layer 1: raw. Exactly what the source sends, with no transformation. Columns are _airbyte_raw_data or _fivetran_data depending on connector, plus an _ingested_at timestamp.

Layer 2: staging. Renamed, type-cast, filtered. This is where shopify_raw.orders becomes stg_shopify__orders with canonical column names, NUMERIC prices instead of strings, and no test orders.

Layer 3: intermediate and mart. Joins across sources, business logic, cohort math. dbt's default naming is int_* and mart_*.

The mistake most teams make is building dashboards against raw or against a single staging model. Dashboards should read mart models. See the hub on warehouse-first rebuilds for why.

Close-up macro of a single icicle with light refracting through clear ice. — // each layer earns its keep

Partitioning and clustering: the cost lever

BigQuery charges per TB scanned at $6.25 on-demand pricing (as of April 2026, see cloud.google.com/bigquery/pricing). Partitioning and clustering are how you avoid scanning the whole table on every query.

For Shopify orders, the schema I use:

CREATE TABLE `project.staging.stg_shopify__orders` (
  order_id STRING NOT NULL,
  shopify_order_number INT64,
  customer_id STRING,
  email_hashed STRING,
  financial_status STRING,
  fulfillment_status STRING,
  currency STRING,
  subtotal_price NUMERIC,
  total_tax NUMERIC,
  total_shipping NUMERIC,
  total_price NUMERIC,
  total_discounts NUMERIC,
  source_name STRING,
  landing_site_ref STRING,
  referring_site STRING,
  tags ARRAY<STRING>,
  line_items ARRAY<STRUCT<
    line_item_id STRING,
    product_id STRING,
    variant_id STRING,
    sku STRING,
    title STRING,
    quantity INT64,
    price NUMERIC,
    total_discount NUMERIC
  >>,
  created_at TIMESTAMP NOT NULL,
  updated_at TIMESTAMP,
  cancelled_at TIMESTAMP,
  _ingested_at TIMESTAMP
)
PARTITION BY DATE(created_at)
CLUSTER BY customer_id, financial_status;

Three things to notice. First, line_items is a nested repeated struct, not a separate table. BigQuery handles nested arrays natively and this saves a join on every order query. Second, partitioning is on DATE(created_at), not on _ingested_at, because operator queries almost always filter by order date. Third, clustering is on customer_id and financial_status because those are the two columns that appear in every mart model.

With this shape, a cohort query that filters to a single month scans roughly 1/30th of a year of data. At 10K orders per month, that is about 30 MB per query, which is free (BigQuery gives the first 1 TB per month free).

Close-up of frozen ice steps cascading down within a frozen waterfall. — // partitions, one shelf at a time

The Shopify raw fields that matter for DTC

Shopify exposes hundreds of fields via the Admin API. Most DTC mart models only need about 30. Here is the ranking by how often I actually join on them.

| Tier | Fields | Why | |---|---|---| | Must have | order_id, customer_id, created_at, total_price, currency, financial_status | Every mart model | | Very useful | line_items[], source_name, landing_site_ref, referring_site, tags | Attribution, cohort filters | | Useful | discount_codes, cancelled_at, refunds[], customer_accepts_marketing | Specific dashboards | | Rarely | note_attributes, risk_level, processing_method | Fraud or compliance | | Almost never | browser_ip, checkout_token, gateway | Debug only |

Most teams land everything. That is fine for raw; you are paying for storage at $0.02 per GB per month. The discipline is in staging: only project the fields you actually use into the typed staging model. It makes the schema readable and it keeps query costs down.

Landscape framing a single tall icicle dominant in the vertical composition of a dim cave. — // only the fields that matter

The dbt staging model that works

Here is the dbt staging model I start every new project with:

-- models/staging/shopify/stg_shopify__orders.sql
{{
  config(
    materialized='incremental',
    unique_key='order_id',
    partition_by={'field': 'created_at', 'data_type': 'timestamp'},
    cluster_by=['customer_id', 'financial_status']
  )
}}

WITH source AS (
  SELECT * FROM {{ source('shopify_raw', 'orders') }}
  {% if is_incremental() %}
    WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
  {% endif %}
),

renamed AS (
  SELECT
    CAST(id AS STRING) AS order_id,
    order_number AS shopify_order_number,
    CAST(customer_id AS STRING) AS customer_id,
    TO_HEX(SHA256(LOWER(TRIM(email)))) AS email_hashed,
    financial_status,
    fulfillment_status,
    currency,
    CAST(subtotal_price AS NUMERIC) AS subtotal_price,
    CAST(total_tax AS NUMERIC) AS total_tax,
    CAST(total_shipping_price_set.shop_money.amount AS NUMERIC)
      AS total_shipping,
    CAST(total_price AS NUMERIC) AS total_price,
    CAST(total_discounts AS NUMERIC) AS total_discounts,
    source_name,
    landing_site_ref,
    referring_site,
    SPLIT(tags, ', ') AS tags,
    line_items,
    created_at,
    updated_at,
    cancelled_at,
    CURRENT_TIMESTAMP() AS _ingested_at
  FROM source
  WHERE test = FALSE
    AND financial_status NOT IN ('voided')
)

SELECT * FROM renamed

Three decisions to note. Incremental materialization means dbt only processes new and updated rows, which matters at 100K+ orders per month. The email hash happens at the staging layer so downstream models never see PII. The tags column is split from a comma-separated string into an array so operators can filter WHERE 'gift' IN UNNEST(tags) without parsing strings.

Looking up at a cave ceiling with icicles hanging down in cold blue and pink light. — // upstream, looking back

Cost math for a real brand

A DTC brand doing 10K orders per month, with 18 months of history, running daily analyst queries and a live operator dashboard. Actual numbers from a brand I rebuilt in Q1 2026:

Storage: 180K total orders in Shopify, plus sessions, events, and ads data. Total warehouse storage ~4.5 GB. Cost: $0.09 per month.
Query cost: ~400 operator-queries per month at ~50 MB scanned each on a partitioned table. Total scanned: ~20 GB. Cost: free (under the 1 TB free tier).
Dashboard cost: 6 dashboards auto-refreshing every 10 minutes on Looker Studio. Cached queries scan ~2 MB each. Cost: negligible.

Total BigQuery cost for the brand: under $5 per month. The cost in the warehouse is almost entirely in the ingestion layer (Fivetran or equivalent), not in storage or query. See the DTC data warehouse on $1,000 a month field notes for the full budget line items.

Ultra-wide frozen waterfall with ice formations receding to a vanishing point in the dim depth of a cave. — // cost curve flattens out

What breaks and how to catch it

Three specific Shopify drift patterns that break downstream mart models.

Variants renamed mid-year. Shopify lets merchants rename variants without changing variant_id. If your mart model joins on variant_title, your time-series report for a single product shows an unexplained discontinuity at the rename date. Always join on variant_id, never on title.

Currency changes mid-order-lifecycle. If a brand converts from single-currency to multi-currency mid-year, historical orders have currency = NULL and new orders have currency = "USD". A revenue mart that does not handle NULL currency will drop old orders. Default to "USD" at the staging layer if NULL, or filter to post-cutover only.

Subscription rebills showing up as regular orders. Recharge and Bold fire Shopify orders for rebills. If you do not exclude them or tag them, your new-customer cohort report counts rebills as new customers. The source_name field is your friend here; subscription rebills usually have source_name = "subscription_contract".

“
A schema is a commitment to future queries. The best time to design it is before the first dashboard. The second best time is now.
”

Ultra-wide cave opening framing a frozen waterfall with dramatic horizontal scale. — // the long view

FAQ

Should I use Fivetran, Airbyte, or build my own Shopify ingestion?

Fivetran if you value time-to-value and have budget; $500 to $1,200 per month for a typical mid-market brand. Airbyte if you want flexibility and are willing to run your own ELT infrastructure. Build your own only if you have specific requirements (real-time streaming, headless multi-store, custom rate-limit handling).

Why partition by created_at instead of _ingested_at?

Operator queries filter by order date, not ingestion date. Partitioning by created_at means a "orders in Q2 2026" query scans only the Q2 partition. If you partitioned by _ingested_at instead, the same query would scan all rows that were ingested in the Q2 ingestion window, which is rarely what you want.

What about Shopify Plus BigQuery connector?

Shopify's own BigQuery connector for Plus merchants is new and still maturing as of April 2026. It is thinner than Fivetran or Airbyte but $0 to run. For Plus merchants with straightforward schemas it is worth testing; for complex setups the third-party connectors are still more reliable.

Do I need dbt or can I do this with raw SQL?

You can do this with raw SQL, and for a single-dashboard setup raw is fine. Once you have 3+ mart models and multiple dashboards, dbt's model graph, testing, and incremental materialization earn their keep. dbt-core self-hosted is free; dbt Cloud is $100 per developer per month.

How do I handle deleted orders?

Shopify soft-deletes by setting cancelled_at, financial_status, and fulfillment_status. Your staging filter should exclude voided orders (not cancelled ones; cancelled orders still count for cohort membership). Hard deletes via GDPR right-to-erase are rarer; when they happen, Fivetran and Airbyte both support sync of deleted records.

What to try this week

Run this query against your existing Shopify BigQuery table:

SELECT
  _TABLE_SUFFIX AS partition_or_whole,
  ROUND(SUM(total_logical_bytes) / POW(10, 9), 2) AS gb
FROM `region-us.INFORMATION_SCHEMA.TABLE_STORAGE`
WHERE table_name LIKE 'orders%'
GROUP BY 1;

If you see one giant row with tens of GB and no partitions, your table is unpartitioned and every query is scanning the whole thing. Rebuild as partitioned. It takes 20 minutes and cuts most query costs by 20x or more.

If you have not even landed Shopify in BigQuery yet, a DTC Stack Audit will tell you whether the analytics problem you have is a data-source problem or a modeling problem before you commit to the rebuild path.

Sources and specifics

BigQuery pricing: cloud.google.com/bigquery/pricing, $6.25 per TB on-demand, $0.02 per GB per month active storage.
dbt-shopify package: github.com/fivetran/dbt_shopify (Fivetran-maintained), and the Rittman Analytics open-source equivalent.
The schema shape above is derived from Q1 2026 production use on a multi-source analytics engine I shipped.
Cost numbers are actuals from a 10K-orders-per-month engagement; your mileage varies with query complexity and dashboard refresh cadence.