Claude Code ships with a powerful skill system that most developers never touch. A custom skill turns a one-off prompt into a repeatable, shareable tool that activates automatically when the right context appears. This guide walks through building one from scratch — from scoping the task to writing triggers, output templates, and iron laws that actually hold up under real use.
A Claude Code skill is a SKILL.md file — a plain Markdown document that teaches the agent how to handle a specific type of task. Think of it as a procedure manual for one job. When you ask Claude Code to review a pull request, generate a changelog, or audit a configuration file, a skill can define exactly how that task should be performed: what triggers it, what the output looks like, and what the agent must never do.
Skills sit alongside your project's CLAUDE.md file but serve a different purpose. Your CLAUDE.md defines who the agent is — its identity, permissions, and cross-cutting rules. A SKILL.md defines what the agent does for one particular task. This separation matters because skills are modular. You can write a code review skill once, test it thoroughly, and drop it into any project that needs structured code reviews. The skill travels with the file, not with the configuration.
If you're unfamiliar with the SKILL.md format itself, the What Is a SKILL.md File? guide covers the anatomy in detail. This article focuses on the building process — how to go from "I keep doing this task manually" to a working, tested skill.
Not every repeated task needs a skill. The investment pays off when three conditions are true simultaneously:
Good candidates for custom skills: code review, PR description generation, test plan writing, migration audits, dependency analysis, API documentation, changelog generation, and security reviews. Bad candidates: one-off research questions, creative writing tasks with no fixed structure, or tasks where the format changes every time.
### Performance Risks followed by - Risk: [description] / Severity: HIGH | MEDIUM | LOW is far more reliable than "list performance risks with severity labels." The model copies templates; it interprets descriptions, and interpretation introduces variance.Here's a full SKILL.md for generating pull request descriptions from a diff. This is a real skill pattern — one of the most commonly needed Claude Code extensions.
# SKILL: PR Description Generator
DESCRIPTION: Generate a structured pull request description from a
git diff. Produces a summary, change breakdown, and test guidance.
ACTIVATE WHEN:
- User provides a git diff AND asks for a PR description,
PR summary, or "write the PR"
- User runs /pr-describe or /describe-pr
DO NOT ACTIVATE WHEN:
- User asks to review or critique a PR (that's code review)
- User asks what a diff does (that's explanation)
- User asks to modify or fix the code in the diff
OUTPUT FORMAT:
## Summary
[2-3 sentences: what this PR does and why]
## Changes
- **[area/file]**: [what changed and why]
- **[area/file]**: [what changed and why]
## Testing
- [ ] [specific test step based on the changes]
- [ ] [specific test step based on the changes]
## Risk Assessment
**Risk level:** LOW | MEDIUM | HIGH
**Reason:** [one sentence explaining the risk level]
IRON LAWS:
- NEVER invent changes not present in the diff.
- NEVER write vague summaries like "various improvements."
- NEVER skip Risk Assessment — every PR gets a risk level.
- NEVER list files without explaining what changed in them.
- ALWAYS derive test steps from the actual changes, not generic checklists.
EXAMPLE INPUT:
diff showing a new rate limiter middleware added to an Express app
EXAMPLE OUTPUT:
## Summary
Adds request rate limiting to the API gateway. Requests exceeding
100/minute per IP receive a 429 response. Motivated by the traffic
spike on March 3 that caused cascading timeouts.
## Changes
- **middleware/rateLimiter.js**: New sliding-window rate limiter
using Redis for counter storage. Configurable via env vars.
- **app.js**: Rate limiter mounted before route handlers.
- **config/defaults.js**: Added RATE_LIMIT_WINDOW and
RATE_LIMIT_MAX with sensible defaults.
## Testing
- [ ] Send 101 requests in 60s from one IP — verify 429 on 101st
- [ ] Confirm rate limit resets after the window expires
- [ ] Verify Redis connection failure degrades gracefully (allow traffic)
## Risk Assessment
**Risk level:** MEDIUM
**Reason:** New middleware in the request path; Redis dependency
adds a failure mode that needs graceful degradation testing.
Notice the structure: tight description, paired trigger conditions, a template with real headers, iron laws from actual failure patterns, and one grounded example. The whole thing is under 450 tokens.
A skill isn't done when you write it. It's done when it handles ten different inputs correctly. Here's the testing workflow that catches the problems before they reach production.
Gather ten real examples that span the range of what the skill should handle. For the PR description skill above, that means: a one-file diff, a twenty-file diff, a diff with only test changes, a diff with a breaking change, a trivial typo fix, and a complex refactoring. Run each one and evaluate the output against the template.
Test five inputs that are adjacent to the skill's scope but should not trigger it. For the PR skill: paste a diff and ask "what does this do?" (explanation, not description). Paste a diff and ask "is this good?" (review, not description). If the skill fires on these, tighten the DO NOT ACTIVATE conditions.
For each iron law, construct an input specifically designed to trigger the failure it prevents. If the law says "NEVER invent changes not present in the diff," give it a minimal diff and check whether the output invents additional changes. If a law never triggers, it might be unnecessary — dead weight that dilutes the active constraints.
The SKILL.md Linter scores your file on eight dimensions: trigger precision, NOT-condition presence, output format completeness, iron law specificity, example quality, scope clarity, synonym coverage, and token efficiency. It flags which sections are pulling the overall score down. A score above 7/10 on each dimension usually indicates a production-ready skill.
| Mistake | Why it breaks | Fix |
|---|---|---|
| No DO NOT ACTIVATE section | Skill fires on adjacent tasks | Name the 3 nearest tasks that look similar but aren't |
| Describing output format in prose | Model interprets rather than copies, output varies | Use a template with real headers and placeholder syntax |
| Generic iron laws ("be accurate") | Too vague to prevent specific failures | Derive laws from observed failures on real inputs |
| No example in the file | Model guesses the format calibration | Add one complete input/output pair |
| Skill over 700 tokens | Key constraints get diluted in noise | Cut until quality stops degrading — usually 400-500 tokens |
| Using OR instead of AND in triggers | Skill fires on partial matches | Require artifact AND intent to both be present |
Building skills from scratch is straightforward once you've done it a few times, but the first two or three take longer than they should. Two free tools can compress the learning curve significantly.
The SKILL.md Generator takes a plain-text description of what you want the skill to do and produces a complete, structured SKILL.md file. Describe the task, the inputs, and the output shape you want — it generates trigger conditions, output template, iron laws, and an example. Think of it as a first draft that gets you 80% of the way there. You'll still need to test and refine, but you skip the blank-page problem entirely.
The SKILL.md Linter is the testing counterpart. Paste your finished (or in-progress) skill file and get a scored breakdown of what's working and what isn't. It checks for missing NOT conditions, vague iron laws, prose-described output formats, synonym gaps in triggers, and token bloat. The score gives you a concrete target: fix the lowest-scoring dimension, re-lint, repeat until everything is above threshold.
Together, they turn skill development from a 30-minute writing exercise into a 10-minute generate-lint-refine loop. Both tools are free, no account required.
Custom Claude Code skills are the difference between using an AI assistant and building one that works exactly the way your team needs. The investment is small — a single well-written skill file — and the payoff compounds every time the task runs. Start with the task you find yourself prompting for most often, build the skill, test it against ten inputs, and refine until it holds. That's the whole process.