Writing Claude Code skills that hold up in production

Most "agent skill" tutorials stop at hello world. This one picks up where those end. Over the last twelve months I have written and shipped roughly forty Claude Code skills into production use, across my own practice and for an NDA engagement where skills orchestrate a C-suite decision flow. The structure that holds up in production is not the structure the hello-world tutorials teach. This is the production version. By the end, you will have a shippable skill with a trigger that fires reliably, a reference layout that scales past one skill, and a review loop that keeps the skill honest as models change.

Prerequisites

Claude Code installed locally. If you are on the paid tier, the skill loader is built in. If you are on a free tier, you can still write skills, but you will not have the shared skill directory that the paid product gives you.

A real use case. Do not write a skill without one. Every skill I have cut was written for a hypothetical workflow I never actually ran. The ones that survive are the ones that solve a concrete task I repeat weekly.

A few reference documents the skill will pull in. These can be existing markdown files, decision frameworks, or code patterns. You do not need to have them written before you start, but you need a concrete idea of what a production invocation will read.

Basic familiarity with Markdown and YAML frontmatter. If you have shipped a Next.js blog or a static site, you already have this.

Extreme macro of frost crystals on a single tall marsh reed at dawn, ice details catching the dawn light, deep textural detail. — // frost on the reed · ice crystals close up

Step 1: pick a trigger condition that matches a real user prompt

Skill descriptions are not taglines. They are trigger conditions. When a user types a prompt, Claude Code scans the descriptions of available skills and decides which to load. A vague description fires on nothing or on everything, which is the same outcome: the skill does not help.

A bad description reads like marketing copy.

---
name: capi-helper
description: A helpful tool for all your CAPI needs.
---

That description will fire occasionally and miss most of the time, because "helpful tool" describes half of Claude Code.

A good description states the exact phrases or situations that should trigger it.

---
name: capi-leak-scan
description: Use when the user says "/capi-scan" or asks to audit a Shopify store's Meta Conversions API setup. Runs 14 checks across tracking, consent, and attribution. Outputs a structured JSON report.
---

The specificity in the second version does two things. It tells Claude exactly when to fire, and it tells Claude what the output shape should be. Both matter because the trigger and the output are two different failure modes.

Open a new file at .claude/skills/<your-skill-name>/SKILL.md. Write the YAML frontmatter first. Read the description aloud. Ask yourself: "What user prompt would trigger this? Is that prompt a real prompt I or a user would actually type?" If the answer is "probably not," rewrite the description.

Step 2: write the minimal SKILL.md body

The SKILL.md body is the minimum instruction set Claude needs to act on the trigger. It is not a reference manual, not a README, not a marketing page. Keep it under 200 lines. Larger bodies degrade trigger specificity because Claude has to hold more irrelevant context in working memory on every invocation.

A working minimal body has four sections.

The overview section. Two or three sentences explaining what this skill does, in operational terms. "Runs the CAPI Leak Scan diagnostic" beats "empowers your CAPI workflow."

The inputs section. What the user must provide. If the skill needs a store URL, a file path, or a specific flag, say so explicitly. If the skill can infer inputs from context, say what context it reads.

The action section. What the skill does when invoked. List the explicit tool calls, in order, with brief rationale for each. Keep this tight; if the skill needs ten steps, consider whether it should be two skills.

The output section. What the skill returns. Format, shape, where it writes files if it writes any, what naming convention it uses.

Here is the minimal body for a hypothetical audit skill:

## Overview

Runs a 14-check CAPI audit against a Shopify store URL. Produces a JSON
report with per-check status and remediation notes.

## Inputs

Store URL: required. Passed as first positional argument or inferred
from the most recent clipboard entry if not given.

## Action

1. Fetch the store homepage and extract theme metadata.
2. Run each of 14 checks from `references/checks/*.md` in parallel.
3. Score the results and generate a prioritized remediation list.
4. Write the full report to `./capi-scan-<date>.json`.

## Output

JSON report at the path above, plus a short summary printed to stdout
with the overall health score and the top 3 fixes to prioritize.

That is the entire body, clean prose with no padding, voice register, or preambles.

Ultra-wide of the calm marsh water at dawn perfectly reflecting frost-covered reeds and pink dawn sky, mirror composition. — // the mirror · reeds and sky inverted

Step 3: split references into separate files

The top-level SKILL.md stays short because the heavy material lives in references/*.md. The skill body tells Claude when to load each reference file, and Claude only loads them when the conditions match.

Create a subdirectory next to SKILL.md:

.claude/skills/capi-leak-scan/
  SKILL.md
  references/
    checks/
      01-capi-firing.md
      02-dedup-config.md
      03-match-quality.md
      ... (14 total)
    scoring.md
    remediation.md

In the SKILL.md action block, direct Claude to read specific references at specific points:

## Action

1. Fetch store homepage and read `references/scoring.md` to understand
   the check-weighting.
2. For each check file in `references/checks/`, run the check against
   the store and record the result.
3. Compile results and read `references/remediation.md` for the
   prioritized fix ordering.

This is the scaling pattern. The SKILL.md does not grow as the skill handles more cases. The references grow. Claude loads only what is relevant to the current invocation, which keeps working memory focused and reduces the trigger-drift problem.

I run this structure across most of the forty-ish skills I have in production. Several were built for the production case for this skill family, the agent council I shipped for an NDA engagement. The council's five persona skills each have roughly 400 lines of reference context and a SKILL.md under 150 lines. The references are what make the personas feel real; the SKILL.md is what makes them fire reliably.

Single tall frost-covered marsh reed isolated against the soft pink dawn sky, dramatic vertical composition. — // the lone reed · vertical against dawn

Step 4: write decision-gating, not prescriptive steps

Skills that read like fifteen sequential steps tend to fail mid-sequence. Every step is an opportunity for the model to drift, misinterpret, or skip. The fix is not more detailed steps. It is structural gating.

A bad pattern, prescriptive:

1. Run test A.
2. Run test B.
3. Run test C.
4. Run test D.
5. If any failed, print errors. Otherwise, print success.

A better pattern, gated:

## Action

Read the skill state.

If the store URL has not been provided:
  - Ask the user for the store URL.
  - Stop.

If the URL is provided and the store is unreachable:
  - Return the error to the user with a suggested fix.
  - Stop.

If the URL is reachable:
  - Run the 14 checks in `references/checks/`.
  - Proceed to scoring.

The gated version is easier for the model to follow because each branch has a single clear exit. It is also easier to debug when something goes wrong, because the failure mode maps to a specific gate, not a step number.

“
Skills that read like fifteen sequential steps tend to fail mid-sequence. Gating beats prescription.
”

The gating pattern is also what lets you compose skills. A high-level skill can gate on "if input is a URL, call skill A; if input is a file, call skill B" and the sub-skills stay simple because each one only handles the case it was designed for.

Looking through tall frost-covered marsh reeds receding to a vanishing point on the horizon, dramatic perspective. — // through the reeds · into the horizon

Step 5: test the trigger against real prompts

Before you commit the skill, run it against three or four plausible user prompts and confirm it triggers.

Open Claude Code in the directory where the skill lives. Type prompts like:

/capi-scan example-store.myshopify.com
Can you audit the CAPI setup on example-store.myshopify.com?
Run a CAPI leak scan on example-store.myshopify.com.

The skill should fire on all three. If it only fires on the slash-command version, the description is too narrow. If it fires on the first two but not the slash command, you have a different bug (slash commands require the skill name in the description to match).

Then run three or four adversarial prompts to confirm the skill does not fire when it should not:

Tell me about Meta Conversions API in general.
What does CAPI stand for?
Help me with my Shopify store.

The skill should not fire on those. The first is a conceptual question, the second is definitional, the third is too broad. If the skill fires on them, the description is too vague.

Check in a test log at references/triggers.md that records the expected-fire and expected-miss prompts. Future-you, reviewing the skill in six months, will thank you for this. I did not do this for the first year of skill authoring and paid for it in Q1 2026 when a model update changed trigger behavior on several skills and I had no baseline to diff against.

Close-up macro of frozen patterns on the marsh water surface at dawn, dendritic ice crystal formations and trapped frost. — // frozen patterns · dendrites on water

Step 6: version and review quarterly

Skills drift. The underlying models change. Tooling evolves. A skill that worked perfectly in Q1 may fire inconsistently by Q4 because Claude's behavior on ambiguous prompts has shifted.

Set a quarterly review cadence. Every twelve weeks, for each skill in your directory, do the same five checks:

Does the trigger still fire on the expected prompts?
Does it still miss on the adversarial prompts?
Do the references still load at the right points?
Does the output still match the documented shape?
Has the skill's use case changed (am I still running this workflow)?

The fifth question is the most important. Skills that do not get used should be retired. A cluttered skills directory degrades every other skill's trigger reliability because there is more surface area for the loader to disambiguate.

This review cadence mirrors the one I run on the broader practice, described in a hybrid operator role that lives inside these skills. Skills are leverage. Leverage that is not being used is drag.

Ultra-wide pure-atmosphere of the gradient dawn sky transitioning from cool blue to warm pink near the horizon, no marsh or reeds. — // just sky · cool blue to warm pink

Common mistakes

Using the description as a marketing tagline. "Unlock your CAPI potential" is a trigger that fires on nothing useful. Rewrite descriptions as literal user-prompt conditions.

Writing the SKILL.md as a generic helper. "A helpful assistant for X" collapses to a skill that never fires confidently. Write the skill body as a specific tool for a specific moment.

Embedding reference content directly in SKILL.md. If your skill body is 1,200 lines, you have three or four references masquerading as one file. Split them out. The SKILL.md should direct Claude to load references; it should not be them.

No test prompts checked in. The skill that silently started misfiring two months ago is the one that will waste your time debugging. Check in test prompts. Diff them quarterly.

Writing the skill before you have the use case. I wrote about a dozen skills in my first six months that solved hypothetical problems. None of them survived. The ones that survived were written against a real task I had done manually three or four times and knew I would do again.

What to try next

If you want production-grade skill templates you can adapt rather than author from scratch, the Claude Code Skills Pack product packages the patterns I use in my own practice, including the trigger-condition library and the reference-file scaffolding described above.

If you want the multi-skill orchestration pattern (one parent skill calling a family of sub-skills), start with the agent council pattern I shipped for an NDA client. It is the canonical example of a skill family that works cohesively.

If you want the broader tooling context that makes this investment pay off, see the tooling stack these skills run on. The skills are only as useful as the environment they run in.

Frequently asked questions

How many skills is too many?

For a solo operator practice, I run about 40 in active use. Past that, the trigger disambiguation gets harder. The fix is not more skills; it is better-scoped existing ones. When I catch myself writing a new skill that could have been a reference added to an existing skill, I consolidate. Twenty solid skills beat fifty mediocre ones.

Should the skill body include example invocations?

One or two, at most, and only if they clarify behavior that the description could not. Most of the time the description carries enough trigger signal that examples are padding. The references/triggers.md test log is a better place for invocation examples because it stays out of the skill body's context on every fire.

What if my skill needs to make multiple tool calls in a specific order?

Use the gating pattern. Each tool call lives inside a conditional branch, with the exit criteria spelled out. The model handles ordered tool calls reliably when each one has a clear predicate; it drifts when the sequence is implicit. Write the predicate, not the sequence.

How do I know when to split one skill into two?

Two signals. The first: the SKILL.md body exceeds 300 lines. That usually means two use cases are collapsed into one file. The second: the trigger description has to use "or" to cover both cases. If the description says "Use when the user asks A or B," they are probably two skills. Split them, write more-specific descriptions for each, and let the loader disambiguate.

What models work best for production skills?

Claude 3.5 Sonnet and newer handle skill instructions reliably. The Opus tier is stronger for skill bodies that involve open-ended reasoning; Sonnet is fine for most production skills and is cheaper per invocation. I have not successfully used other provider models for complex skill work because the file-aware workflow Claude Code provides is specific to this product.

Can I share skills with a team?

Yes, via the Claude Code shared skill directory on paid tiers. Each skill becomes a directory that other team members can load into their own Claude Code. Voice-calibrated skills (persona drafting, brand voice) should stay client-scoped because they contain proprietary context. Infrastructure skills (auditing, scaffolding, dev loops) are safer to share.

Sources and specifics

Patterns grounded in approximately 40 Claude Code skills shipped to production use across my own practice and the production case for this skill family, the agent council for an NDA client in Q1 2026.
The 200-line SKILL.md ceiling is an operational rule of thumb, not a hard limit; past that the trigger reliability degrades measurably on ambiguous prompts.
The reference-file split pattern draws from the internal structure used across the bzk-article, bzk-business, and bzk-marketing skill families on this practice.
Quarterly review cadence matches the broader weekly/quarterly review rhythm described across my solo operating work.
Code snippets and YAML frontmatter shown are representative of Claude Code as of April 2026; the exact API may evolve as Anthropic iterates.