AI Agents SKILL.md Research March 2026 · Andy

How to Design AI Agent Skills That Don't Fail

I tested 20 AI agent skills against real prompts, measuring completeness, quality, and token efficiency at every budget level. Here's what the data shows — and what it means for designing skills that actually work in production.

In this article
  1. What a SKILL.md file actually is
  2. The four failure modes we saw
  3. Token saturation: the data
  4. What a high-scoring SKILL.md looks like
  5. A practical process
  6. The quick test

What a SKILL.md File Actually Is

A SKILL.md file is the instruction set for a specific agent capability. It tells Claude (or any LLM) how to handle a particular type of task: when to activate, what modes to run in, what outputs to produce, and what rules it must never break.

The difference between a good skill and a bad one isn't creativity — it's specificity. A skill that says "research things when the user asks" produces inconsistent results. A skill with precise trigger conditions, an exact output template, and iron laws derived from real failure modes produces the same quality output whether you run it at 8am or 3am on a Friday.

The Failure Modes We Actually Saw

Running 20 skills against structured test inputs, we found four consistent failure patterns:

Truncation at low token budgets
The most common failure. The skill starts well, then cuts off mid-output because it ran out of tokens. This isn't a quality problem — the reasoning was fine. It's a budget problem. Fix: test each skill at multiple token budgets to find the saturation point before deploying.
Vague trigger conditions
Skills with triggers like "when the user asks to research something" fire on the wrong prompts and miss the right ones. They get invoked for simple factual questions they weren't designed for, and ignored for complex synthesis tasks they were built to handle.
Missing iron laws
Skills without explicit constraints let the model fill in the blanks — sometimes correctly, often not, especially at the edges. A research skill without "NEVER state facts without citing sources" will eventually hallucinate a citation under pressure.
Generic output format
A skill that says "return a structured report" gets back whatever the model thinks "structured" means that session. A skill with an exact template — section headers, example content, character limits — gets that format every time.

Token Saturation: The Data

Here's what we found testing token budgets across 12 skills. "Saturation point" is where quality plateaus — adding more tokens produces no measurable improvement.

Skill Saturation Point Model Notes
SQL query 200 tok GLM-5 Focused output — fast and cheap
Debugging 300 tok GLM-5 Diagnosis + fix fits in 300
Code review 500 tok GLM-5 Needs room for structured findings
Bash scripting 500 tok GLM-5 Simple scripts don't need more
Social media repurposer 500 tok GLM-5 Below this: analysis truncates before output
Data analysis 800 tok GLM-5 Needs explanation depth
Dialogue writer 1,200 tok GLM-5 Voice + subtext needs space
World building 3,000+ tok Claude GLM-5 hits structural ceiling
Scene writing 3,000+ tok Claude Claude required
Plot structure No ceiling Claude Complex reasoning needs full context

The 77% token cost reduction comes from routing deterministic, output-bounded skills (debugging, SQL, bash) to small budgets while routing creative and reasoning-heavy skills to larger budgets or premium models.

The key insight: saturation is a property of the skill's output, not the model. A debugging skill produces a focused diagnosis — that fits in 300 tokens and doesn't benefit from more. A world-building skill produces a rich fictional system — it needs thousands of tokens to reach full quality.

What a High-Scoring SKILL.md Looks Like

Scoring every skill against a quality rubric, the difference between 40/100 and 94/100 usually came down to three things:

1. Specific trigger conditions vs. vague ones

Low score:

trigger: research topics when asked

High score:

TRIGGER WHEN:
- User asks to research a specific topic with source citations required
- User asks to compare multiple sources or synthesize conflicting information
- User asks for a structured report with confidence scoring on claims
- User is working on a fact-checked document and needs verified sources

DO NOT TRIGGER WHEN:
- User asks a factual question answerable from memory (e.g. "what year was X founded")
- User asks to write or edit code
- User asks for creative writing or brainstorming

The second version tells the model exactly where the edges are. It fires on the right prompts and doesn't fire on the wrong ones. The first version leaves that judgment to the model — which gets it wrong about 30% of the time on edge cases.

2. An exact output template with a real example

Low score:

output_format: structured report with sections

High score:

## Summary
[3 sentences: what the research found, main conclusion, confidence level]

## Key Findings
- Finding 1 [Source: Author, Publication, Year — URL]
- Finding 2 [Source: ...]

## Confidence: High / Medium / Low
[Reasoning — sample size, source quality, consensus level]

## Sources
1. [Full citation with URL]
2. [...]

This template is reproduced verbatim in the output. No ambiguity about what "structured" means.

3. Iron laws that address actual failure modes

Generic iron laws ("NEVER be inaccurate") don't constrain behavior — they're already implied by "be a good assistant." Specific iron laws target the ways this skill fails:

Each of these comes from a real failure mode observed during testing. They're not principles — they're guardrails.

A Practical Process

  1. Write the output template first. Before writing a single word of the skill file, produce three example outputs manually. This forces clarity about format, depth, and what "done" looks like.
  2. Write trigger conditions from examples, not abstractions. Take 10 real prompts the skill should handle, and 10 it shouldn't. Write trigger conditions that correctly classify all 20.
  3. Test at three token budgets. Run the skill at your expected budget, half that, and double that. Where does quality plateau? That's your saturation point.
  4. Derive iron laws from failure analysis. Run the skill 5 times. Note the 3 worst outputs. Turn each failure into an iron law.
  5. Add at least 2 complete worked examples. Not placeholders — real inputs with realistic expected outputs. These serve as your regression tests.

The Quick Test

If you have a SKILL.md and want to know if it's likely to work, ask three questions:

Quality check — can you answer yes to all three?

If any answer is "no," that's where quality is leaking. The SKILL.md Linter checks these automatically — it flags missing sections, vague trigger conditions, and iron laws that read as generic rather than skill-specific.