Claude Code Skills Tutorial SKILL.md March 2026 · Andy

How to Build Custom Claude Code Skills (Step-by-Step)

Claude Code ships with a powerful skill system that most developers never touch. A custom skill turns a one-off prompt into a repeatable, shareable tool that activates automatically when the right context appears. This guide walks through building one from scratch — from scoping the task to writing triggers, output templates, and iron laws that actually hold up under real use.

Contents
  1. What Claude Code skills actually are
  2. When you should (and shouldn't) build a custom skill
  3. The five steps to building a skill
  4. Complete example: a PR description generator
  5. Testing and iterating your skill
  6. Common mistakes that break skills
  7. Tools to speed up the process

What Claude Code Skills Actually Are

A Claude Code skill is a SKILL.md file — a plain Markdown document that teaches the agent how to handle a specific type of task. Think of it as a procedure manual for one job. When you ask Claude Code to review a pull request, generate a changelog, or audit a configuration file, a skill can define exactly how that task should be performed: what triggers it, what the output looks like, and what the agent must never do.

Skills sit alongside your project's CLAUDE.md file but serve a different purpose. Your CLAUDE.md defines who the agent is — its identity, permissions, and cross-cutting rules. A SKILL.md defines what the agent does for one particular task. This separation matters because skills are modular. You can write a code review skill once, test it thoroughly, and drop it into any project that needs structured code reviews. The skill travels with the file, not with the configuration.

If you're unfamiliar with the SKILL.md format itself, the What Is a SKILL.md File? guide covers the anatomy in detail. This article focuses on the building process — how to go from "I keep doing this task manually" to a working, tested skill.

When You Should (and Shouldn't) Build a Custom Skill

Not every repeated task needs a skill. The investment pays off when three conditions are true simultaneously:

Good candidates for custom skills: code review, PR description generation, test plan writing, migration audits, dependency analysis, API documentation, changelog generation, and security reviews. Bad candidates: one-off research questions, creative writing tasks with no fixed structure, or tasks where the format changes every time.

The Five Steps to Building a Skill

Step 1
Write three example input/output pairs first
Before writing any SKILL.md syntax, produce three concrete examples of what the skill should do. Take a real input — an actual PR diff, a real config file, a genuine code snippet — and write out the exact output you want. Do this three times with different inputs. This forces you to discover what the output format actually needs to be and where the edge cases live. Most people skip this step, write the trigger conditions first, and spend twice as long debugging format inconsistencies later.
Step 2
Define the trigger conditions — both positive and negative
Every SKILL.md needs an ACTIVATE WHEN section that describes the conditions under which the skill should fire, and a DO NOT ACTIVATE WHEN section that names the cases where it should stay dormant. The negative conditions are more important than the positive ones. Without them, your skill will fire on adjacent tasks that look similar but need different handling. A code review skill that fires on "user pastes code" will also fire when someone pastes code to ask what it does — a completely different task.
Step 3
Build the output template with real headers and placeholders
Show the model what the output looks like — don't describe it. Use actual Markdown headers, label formats, and placeholder syntax. ### Performance Risks followed by - Risk: [description] / Severity: HIGH | MEDIUM | LOW is far more reliable than "list performance risks with severity labels." The model copies templates; it interprets descriptions, and interpretation introduces variance.
Step 4
Write iron laws from observed failures
Run your draft skill against ten varied inputs. Every wrong output becomes an iron law. "NEVER fabricate issues when the input is clean" comes from watching the model invent problems on a good PR. "NEVER skip the Severity label" comes from seeing it omit labels on low-severity items. Iron laws derived from actual failures are worth ten times the ones you write preemptively. Keep the list to 4-6 laws — more than that and the model starts ignoring the less prominent ones.
Step 5
Add one canonical example to the skill file
Include one complete input/output example directly in the SKILL.md. This anchors the model's understanding of the format more effectively than any amount of descriptive text. Choose an example that's representative but not trivial — it should demonstrate at least two output sections and show how severity labels, verdicts, or categories are applied in practice.
Token budget
Most effective skills land between 300 and 600 tokens. Below 300, there isn't enough structure to be useful. Above 600, you're usually repeating yourself or adding constraints that dilute the important ones. If your skill is over 700 tokens, look for sections you can cut without losing output quality — the answer is almost always yes.

Complete Example: A PR Description Generator

Here's a full SKILL.md for generating pull request descriptions from a diff. This is a real skill pattern — one of the most commonly needed Claude Code extensions.

# SKILL: PR Description Generator

DESCRIPTION: Generate a structured pull request description from a
git diff. Produces a summary, change breakdown, and test guidance.

ACTIVATE WHEN:
  - User provides a git diff AND asks for a PR description,
    PR summary, or "write the PR"
  - User runs /pr-describe or /describe-pr

DO NOT ACTIVATE WHEN:
  - User asks to review or critique a PR (that's code review)
  - User asks what a diff does (that's explanation)
  - User asks to modify or fix the code in the diff

OUTPUT FORMAT:
## Summary
[2-3 sentences: what this PR does and why]

## Changes
- **[area/file]**: [what changed and why]
- **[area/file]**: [what changed and why]

## Testing
- [ ] [specific test step based on the changes]
- [ ] [specific test step based on the changes]

## Risk Assessment
**Risk level:** LOW | MEDIUM | HIGH
**Reason:** [one sentence explaining the risk level]

IRON LAWS:
- NEVER invent changes not present in the diff.
- NEVER write vague summaries like "various improvements."
- NEVER skip Risk Assessment — every PR gets a risk level.
- NEVER list files without explaining what changed in them.
- ALWAYS derive test steps from the actual changes, not generic checklists.

EXAMPLE INPUT:
  diff showing a new rate limiter middleware added to an Express app

EXAMPLE OUTPUT:
  ## Summary
  Adds request rate limiting to the API gateway. Requests exceeding
  100/minute per IP receive a 429 response. Motivated by the traffic
  spike on March 3 that caused cascading timeouts.

  ## Changes
  - **middleware/rateLimiter.js**: New sliding-window rate limiter
    using Redis for counter storage. Configurable via env vars.
  - **app.js**: Rate limiter mounted before route handlers.
  - **config/defaults.js**: Added RATE_LIMIT_WINDOW and
    RATE_LIMIT_MAX with sensible defaults.

  ## Testing
  - [ ] Send 101 requests in 60s from one IP — verify 429 on 101st
  - [ ] Confirm rate limit resets after the window expires
  - [ ] Verify Redis connection failure degrades gracefully (allow traffic)

  ## Risk Assessment
  **Risk level:** MEDIUM
  **Reason:** New middleware in the request path; Redis dependency
  adds a failure mode that needs graceful degradation testing.

Notice the structure: tight description, paired trigger conditions, a template with real headers, iron laws from actual failure patterns, and one grounded example. The whole thing is under 450 tokens.

Testing and Iterating Your Skill

A skill isn't done when you write it. It's done when it handles ten different inputs correctly. Here's the testing workflow that catches the problems before they reach production.

Run ten varied inputs

Gather ten real examples that span the range of what the skill should handle. For the PR description skill above, that means: a one-file diff, a twenty-file diff, a diff with only test changes, a diff with a breaking change, a trivial typo fix, and a complex refactoring. Run each one and evaluate the output against the template.

Check trigger boundaries

Test five inputs that are adjacent to the skill's scope but should not trigger it. For the PR skill: paste a diff and ask "what does this do?" (explanation, not description). Paste a diff and ask "is this good?" (review, not description). If the skill fires on these, tighten the DO NOT ACTIVATE conditions.

Validate iron laws

For each iron law, construct an input specifically designed to trigger the failure it prevents. If the law says "NEVER invent changes not present in the diff," give it a minimal diff and check whether the output invents additional changes. If a law never triggers, it might be unnecessary — dead weight that dilutes the active constraints.

Use the linter

The SKILL.md Linter scores your file on eight dimensions: trigger precision, NOT-condition presence, output format completeness, iron law specificity, example quality, scope clarity, synonym coverage, and token efficiency. It flags which sections are pulling the overall score down. A score above 7/10 on each dimension usually indicates a production-ready skill.

Common Mistakes That Break Skills

Mistake Why it breaks Fix
No DO NOT ACTIVATE section Skill fires on adjacent tasks Name the 3 nearest tasks that look similar but aren't
Describing output format in prose Model interprets rather than copies, output varies Use a template with real headers and placeholder syntax
Generic iron laws ("be accurate") Too vague to prevent specific failures Derive laws from observed failures on real inputs
No example in the file Model guesses the format calibration Add one complete input/output pair
Skill over 700 tokens Key constraints get diluted in noise Cut until quality stops degrading — usually 400-500 tokens
Using OR instead of AND in triggers Skill fires on partial matches Require artifact AND intent to both be present

Tools to Speed Up the Process

Building skills from scratch is straightforward once you've done it a few times, but the first two or three take longer than they should. Two free tools can compress the learning curve significantly.

The SKILL.md Generator takes a plain-text description of what you want the skill to do and produces a complete, structured SKILL.md file. Describe the task, the inputs, and the output shape you want — it generates trigger conditions, output template, iron laws, and an example. Think of it as a first draft that gets you 80% of the way there. You'll still need to test and refine, but you skip the blank-page problem entirely.

The SKILL.md Linter is the testing counterpart. Paste your finished (or in-progress) skill file and get a scored breakdown of what's working and what isn't. It checks for missing NOT conditions, vague iron laws, prose-described output formats, synonym gaps in triggers, and token bloat. The score gives you a concrete target: fix the lowest-scoring dimension, re-lint, repeat until everything is above threshold.

Together, they turn skill development from a 30-minute writing exercise into a 10-minute generate-lint-refine loop. Both tools are free, no account required.

Build your first custom Claude Code skill in minutes.

Generate a SKILL.md → Lint Your Skill

Custom Claude Code skills are the difference between using an AI assistant and building one that works exactly the way your team needs. The investment is small — a single well-written skill file — and the payoff compounds every time the task runs. Start with the task you find yourself prompting for most often, build the skill, test it against ten inputs, and refine until it holds. That's the whole process.