Agent-orchestrated Shopify theme builds: a working pipeline

I built a Shopify theme with an agent crew earlier this year. Production sections, metafield-driven layouts, live on a DTC storefront. Here's what the pipeline actually looks like from the inside - including the parts that still break.

What I was trying to solve with agent-orchestrated Shopify work

Every Shopify theme section is three files minimum: the schema JSON, the Liquid template, and usually a locales entry. On a mid-size theme you might have thirty of these. Writing them manually is slow in a specific way that isn't hard - just tedious. The cognitive overhead isn't in the Liquid logic, it's in the schema boilerplate: remembering input types, getting the block nesting right, keeping naming conventions consistent across sections.

The other problem: generic AI output gets DTC specifics wrong. A prompt like "create a product hero section" will give you something technically valid but commercially naive - no metafield bindings, no conditional visibility for bundles, no sensible defaults for the layout settings a merchant will actually touch. The model doesn't know what your specific DTC storefront needs. It just knows Liquid syntax.

How the four-agent pipeline works

I ended up with four agents running in sequence:

Spec agent - Takes a plain-English description of what a section should do and outputs a typed spec object. Field names, input types, block configurations, responsive behavior notes. This is the most valuable step even if you run nothing else.
Schema agent - Takes the spec and writes schema.json. Gets the input type selection right because the spec has already resolved the ambiguity.
Codegen agent - Takes the spec and schema together and writes the section Liquid. Knows to reference settings by the names the schema just defined.
QA agent - Validates the output against Shopify Online Store 2.0 schema rules and flags style drift from the established theme conventions.

The human judgment moments are at: spec review (does this spec actually describe what I want), block nesting decisions (complex structures need a human's eye on the spec before it goes downstream), and QA triage (the QA agent flags everything, but some flags are noise).

Week 1 - The spec agent alone is worth it

I started by running just the spec agent in isolation for the first week. No codegen yet. Just: describe a section in plain English, get a structured spec back, review it.

What I learned immediately: the spec step forces clarity I was skipping before. When you're writing a section by hand, you often start with the Liquid before you've fully decided what the section actually does. Setting names get made up mid-file, and you end up with section.settings.image_desktop in some places and section.settings.hero_image in others because you changed your mind halfway through.

The spec agent makes you commit before you write a line of Liquid. That alone saved revision time on every section I ran through it, even before the codegen was wired up.

By the end of week 1, I had specs for eight sections. All of them were cleaner than anything I'd written starting from scratch.

Week 2 - Wiring schema and codegen together

The first failure was nested blocks. A section with multiple block types - say, a features grid where each block can be either an icon card or a text card - broke the schema agent's output. It was producing valid JSON but with the wrong block structure: nesting blocks inside blocks in a way Shopify's schema parser rejects.

The fix was patching the spec format to explicitly flag block hierarchy. Once the spec included a blockTypes array with each type's allowed parent, the schema agent got it right. The codegen agent followed because it was reading the same spec.

The second issue was conditional render paths. Sections that show different layouts based on a setting (say, layout: "stacked" | "split") required the codegen agent to write branching Liquid. It handled simple binary conditions fine. It started producing malformed Liquid when the conditions nested three levels deep. Those became hand-write cases.

Where the pipeline still breaks

Adjacent-section awareness. Some sections change their appearance based on what's above or below them in the template. The pipeline has no concept of this. Every section it produces assumes it's standalone.

Dynamic source blocks. Metaobjects with multiple content types - where a single block can render as an image, a video, or a rich text block depending on which fields are populated - require conditional logic the codegen agent produces inconsistently. I've stopped trying to pipeline these.

Conditional CSS class logic. When a section needs to apply different Tailwind or custom class combinations based on setting values, the codegen output is usually wrong on the first pass. It understands Liquid conditionals fine; the breakdown is in knowing which classes to apply and when.

“
The pattern across all three: the failures happen where the output depends on context outside the spec.
”

The pipeline works best on self-contained sections with well-scoped inputs and outputs.

What this pipeline is actually good for

First drafts and scaffolding on sections you could write yourself. That's the real answer.

If you're a Shopify developer who knows what you're doing, this pipeline cuts your time-to-first-draft by something like 80% on standard sections. The QA pass catches the easy mistakes. You still read every line before it ships.

If you're not a Shopify developer, it produces output you won't be able to review competently. The pipeline doesn't replace the expertise, it accelerates it. If you're unsure whether your current DTC tech stack is set up to support this kind of work at all, the DTC Stack Audit is where I'd start.

The moments to skip the pipeline: complex conditional sections, anything requiring adjacent-section context, and any section where the creative direction is still unresolved. Agents are bad at "I'm not sure yet." They'll produce something confident and wrong.

I've used this across a full Shopify theme build and several standalone section additions since. The bigger version of the same idea now runs on Drupal, where an AI prompt system I built turns a raw product title into a fully populated product page. The pipeline is now the default first step for any net-new section. The exceptions are specific and known.

If you want the pipeline packaged with the DTC schema patterns and a set of section specs that cover the most common DTC section types, that's what the Shopify Theme Starter includes. The same four-agent structure, documented prompts, and a working first section set.

The agent orchestration pattern I use here also shows up in the agent council case study, which covers how the same crew-based approach applies to analytics pipeline work.

Does this work with any Shopify theme or just custom ones?

It works best with custom themes where you control the conventions. Adapting the pipeline to an existing purchased theme means updating the spec format and QA rules to match that theme's naming patterns. It's doable but adds a setup step.

Do I need a specific AI model for this?

I used Claude for all four agents. The spec and QA agents benefit from a model that's good at structured output - the spec in particular produces a JSON object the other agents consume, so schema accuracy matters. A weaker model at the spec step breaks everything downstream.

How do you handle versioning when agents update sections?

The same way you'd handle any code change: through git. Sections go through review before merge, agent-generated or not. I don't treat agent output as trusted code. It goes through the same pull request flow as anything hand-written.

What's the learning curve for setting up the pipeline?

If you're comfortable with the Claude API and have built a Shopify theme before, you can get a working version of the spec and schema agents running in a day. The codegen agent takes longer to tune because the output quality is sensitive to how you describe the section's Liquid conventions in the system prompt.

Is the 80% time reduction figure consistent?

On standard sections, yes. On sections that require the hand-write cases I described above, the pipeline is neutral to negative - it produces something that looks like it works but needs significant correction. Knowing which category a section falls into before you run the pipeline is most of the skill.

Sources and specifics

Pipeline built and shipped in Q1 2025 across a Shopify theme I built with an agent crew for a DTC storefront.
Four-agent structure: spec, schema, codegen, QA - each agent consumes the output of the prior stage.
Approximately 80% first-draft time reduction measured across 20+ standard sections.
All output validated against Shopify Online Store 2.0 schema before handoff to theme review.
Complex failure cases (adjacent-section awareness, dynamic metaobject blocks) remain hand-write scenarios as of Q1 2025.