Twelve months ago I started writing Shopify code inside a loop where an agent wrote the first pass and I reviewed. Today roughly 70% of my Liquid production is that shape. This is the retrospective: what worked, what I kept getting wrong, and the rhythm that survived after the pipeline stopped feeling new.
What I shipped in the 12 months
Two DTC Shopify engagements, one AI-driven page-generation CLI, and a set of Claude Code skills I use day to day. The engagements were full theme builds for mid-market DTC brands; the CLI was the agent-orchestrated Shopify pipeline I wrote up previously (see the agent-orchestrated Shopify build pipeline for the architecture). The skills are a running toolkit that grows as new patterns stabilize.
Across those, the agent did the first pass on roughly 150 Liquid sections, a similar number of schema JSONs, probably 80 metafield migration scripts, and most of my Admin API glue code. I reviewed everything. I rewrote maybe 25% of it. The rest shipped close to how the agent produced it.
What worked
The scaffolding loop for individual sections was the clear win. Given a well-specified brief (name, metafield contract, layout hint, variant logic, fallback behavior), a Claude Code agent produces a section schema plus Liquid template that compiles and passes basic theme-check inside a minute or two. Before this loop, the same work took me 30-45 minutes per section. Now it takes 10-15 including review.
Schema JSON in particular is the thing I stopped hand-writing. Shopify's section schema syntax is strict, repetitive, and easy to get wrong (nested block types, camelCase vs snake_case, the input-type enum). Agents handle it better than I do because they don't get bored at block definition #40.
Predictable architecture paid off more than I expected. The thinner my sections and the cleaner my metafield contracts, the more reliably the agent produces correct first drafts. The Shopify theme layering I run at 2M+ describes the architecture that makes agent-paired development viable; without it, the agent fails more than it succeeds.
Parallel work inside a single session. Once the loop was stable, I could have the agent scaffolding a new section in one branch while I finished reviewing the last one, using git worktrees to keep the contexts separate. That doubled my sustainable throughput on a long afternoon.
What didn't
Cross-section logic. Anything that touches more than one file (a new global snippet that three sections import, a shared JS utility for the cart drawer, a schema change that cascades across the theme) is where the agent regularly gets it wrong. Not subtly wrong. Confidently wrong. It'll produce a snippet that works in isolation but breaks the caller.
I spent a month trying to build prompts that handled cross-file reasoning automatically. Then I stopped. The loop now explicitly keeps cross-section work to me and narrows the agent's scope to one file at a time.
Metafield migrations on live data. I tried to let the agent write and run migration scripts against a staging copy of production metafields. It produced scripts that worked on the shapes it could see and silently broke on edge cases (null fields, weird legacy values, metafields set via app imports with unusual types). I now treat migration scripting as a "write with agent, verify on staging by hand, run in batches of 100 against prod with rollback" process, not a one-shot automation.
Context window thrash on large themes. The theme I shipped in Q4 was 40-50 sections across 80 liquid files. Trying to give the agent full context was either too expensive or too lossy. I ended up curating a per-task context, often just the one section plus its metafield contract plus the theme's config schema, and that worked better than trying to cram everything in.
Overconfidence in tests. Shopify's theme-check catches syntax errors and some policy issues, but it doesn't tell you whether the section renders correctly on a real product with the actual metafield values. I got burned twice by shipping theme-check-clean work that broke the rendered page on specific products. Now the loop includes a manual preview on at least two real product IDs before I consider the section done.
The actual dev loop rhythm I run now
A single section, end to end, runs about 45 to 90 minutes. Here's the shape it takes.
Brief phase, 10 minutes. I write the section brief myself: what it renders, what metafields it reads, what the layout variants are, what the fallback behavior is when metafields are empty. This is the deliberate part. Agents produce garbage from vague briefs.
Scaffold phase, 10 minutes. The agent produces the schema JSON, the Liquid template, and any tiny JS it needs. I read it. I'm not inspecting word-by-word; I'm checking the shape (schema is legal, Liquid iteration is correct, no obvious typos in metafield keys).
Review phase, 15-30 minutes. I pull up the section in the local theme-check, then in a preview storefront against two real products. I rewrite anything the agent botched, usually the parts where the metafield contract is ambiguous or where a layout variant produces an ugly rendered output.
Validation phase, 5-10 minutes. Theme-check clean, visual preview on two products, commit with a descriptive message. The commit per task discipline matters; it keeps the git history interpretable later when I need to blame a specific section.
Integration phase, 15-20 minutes. If the section needed new metafield definitions on the shop, I add them via the admin or via a scripted migration. If it needed app-facing data (for search indexing, for analytics), I wire that last. This is the part I don't delegate to agents because it touches shop state, not just files.
That rhythm is also what my agent skills encode. If you want the skills themselves, the Shopify-specific skills I lean on daily is the BOFU version of this post and describes the actual prompts.
“Any task that requires reading three files to write one file is a task for a human. Narrow agent scope to single-file transformations until you've built the test coverage to let it range wider.”
What I'd do differently from here
Two changes, both of which I'm rolling in currently.
First, write validation scripts alongside the skills. For each kind of section (product hero, collection card, metafield-driven content block), I now write a small validation script that loads the section against a fixture product and checks for basic rendered correctness (image loads, title is present, CTA links aren't undefined). That's lighter than a full test suite but catches the class of silent failure I got burned on.
Second, stop resisting Claude Code's file-write autonomy on small tasks. For the first six months I gate-kept every file write through a manual review step. That was overcautious for scaffolding work; the agent is reliable enough on schema JSON and thin Liquid templates that I'd rather review in git diff after the write than read through a tool-call preamble beforehand. The commit-per-task habit makes the review-after model safe.
If I were starting fresh today, I'd also put the Admin API rate-limit pattern in place on day one rather than retrofitting it after the first melt. The leaky-bucket backoff for Shopify's Admin API covers that and is still the main integration pattern I use now.
What the next 12 months will test
Three bets the loop is making that I can't fully justify yet.
Multi-file scope is still hard. I'm betting agent reliability on cross-file changes improves enough in 2026-2027 to let the loop expand past single-file scaffolding. If not, the human stays in the critical path for anything structural, and the throughput ceiling is lower than the optimistic projection.
Skill portability is real. The skills I've written are in Claude Code's format. If the agent tooling landscape fragments, I may have to rewrite them for whatever replaces or augments Claude Code. I haven't seen signal of that yet, but the bet is there.
Mid-market DTC demand holds. This loop is efficient for theme work on established DTC brands. It's overkill for a $20K startup store and underscoped for a $20M Hydrogen build. The mid-market is where the economics work; if that market softens, the loop shifts.
For the AI-assisted Shopify page generation pipeline that grounded a lot of this, see the CLI that turns prompts into live Shopify pages. If you want to run this loop yourself, the skills pack for DTC Shopify audits and builds is the first drop of the toolkit.
Do I need Claude Code specifically, or does any agent work?
The loop is tool-agnostic in principle. Claude Code is what I use because its skills format, tool-use model, and file-write behavior suit the flow. Cursor, Aider, and similar tools can run the same loop with prompt adjustments. The specific agent matters less than the discipline of narrow scope per task and commit-per-task review.
How much faster is this than writing Shopify sections by hand?
On scaffolding work (new sections with a clear brief), about 3-4x on first draft. On review-heavy work, closer to 1.5-2x because the human time compresses less. Aggregate across a full theme build: roughly 2x throughput without quality regression.
What kinds of Shopify tasks should I NOT hand to an agent?
Cross-section logic, migrations that touch live shop data, anything that requires reading three files to produce one, and any task where a silent failure has merchant-visible consequences. Those stay in human hands until you have validation coverage that makes the risk bounded.
How do you handle the context window problem on large themes?
Per-task context curation. I feed the agent only the files relevant to the task at hand, usually one section, its metafield contract, and maybe a sibling section for pattern reference. Trying to load the whole theme hits limits and degrades response quality.
Is this a solo developer thing or can a team run this loop?
Teams can run it, but the coordination overhead shifts. Individual developers run the loop personally. Teams need shared conventions for skills, branch strategy, and review norms so agents don't produce divergent styles across the codebase. The discipline that works solo requires explicit documentation on a team.
How often does the agent produce something I just throw away?
About 10% of the time on scaffolding, climbing to 30-40% if the brief is vague or the task crosses files. I treat throwaway drafts as cheap; the cost of the agent run is minutes, so "throw it away and redo with a sharper brief" is faster than "salvage a broken first draft."
Sources and specifics
- The retrospective covers work from April 2025 through April 2026.
- Two full DTC Shopify theme builds and one AI-driven Shopify page generation CLI grounded the timing estimates.
- Shopify theme-check referenced is version 2.x on Online Store 2.0 themes.
- The rate-limit handling pattern is described in more detail in the dedicated Admin API post and uses a leaky-bucket algorithm honoring Retry-After headers.
- Claude Code skills referenced are the first drop of the Skills Pack product and include tracking audit, Shopify scaffolding, metafield migration, and admin glue skills.
