AI Agents SKILL.md Research March 2026 · Andy

How to Design AI Agent Skills That Don't Fail

I tested 20 AI agent skills against real prompts, measuring completeness, quality, and token efficiency at every budget level. Here's what the data shows — and what it means for designing skills that actually work in production.

In this article

What a SKILL.md file actually is
The four failure modes we saw
Token saturation: the data
What a high-scoring SKILL.md looks like
A practical process
The quick test

What a SKILL.md File Actually Is

A SKILL.md file is the instruction set for a specific agent capability. It tells Claude (or any LLM) how to handle a particular type of task: when to activate, what modes to run in, what outputs to produce, and what rules it must never break.

The difference between a good skill and a bad one isn't creativity — it's specificity. A skill that says "research things when the user asks" produces inconsistent results. A skill with precise trigger conditions, an exact output template, and iron laws derived from real failure modes produces the same quality output whether you run it at 8am or 3am on a Friday.

The Failure Modes We Actually Saw

Running 20 skills against structured test inputs, we found four consistent failure patterns:

Truncation at low token budgets

The most common failure. The skill starts well, then cuts off mid-output because it ran out of tokens. This isn't a quality problem — the reasoning was fine. It's a budget problem. Fix: test each skill at multiple token budgets to find the saturation point before deploying.

Vague trigger conditions

Skills with triggers like "when the user asks to research something" fire on the wrong prompts and miss the right ones. They get invoked for simple factual questions they weren't designed for, and ignored for complex synthesis tasks they were built to handle.

Missing iron laws

Skills without explicit constraints let the model fill in the blanks — sometimes correctly, often not, especially at the edges. A research skill without "NEVER state facts without citing sources" will eventually hallucinate a citation under pressure.

Generic output format

A skill that says "return a structured report" gets back whatever the model thinks "structured" means that session. A skill with an exact template — section headers, example content, character limits — gets that format every time.

Token Saturation: The Data

Here's what we found testing token budgets across 12 skills. "Saturation point" is where quality plateaus — adding more tokens produces no measurable improvement.

Skill	Saturation Point	Model	Notes
SQL query	200 tok	GLM-5	Focused output — fast and cheap
Debugging	300 tok	GLM-5	Diagnosis + fix fits in 300
Code review	500 tok	GLM-5	Needs room for structured findings
Bash scripting	500 tok	GLM-5	Simple scripts don't need more
Social media repurposer	500 tok	GLM-5	Below this: analysis truncates before output
Data analysis	800 tok	GLM-5	Needs explanation depth
Dialogue writer	1,200 tok	GLM-5	Voice + subtext needs space
World building	3,000+ tok	Claude	GLM-5 hits structural ceiling
Scene writing	3,000+ tok	Claude	Claude required
Plot structure	No ceiling	Claude	Complex reasoning needs full context

The 77% token cost reduction comes from routing deterministic, output-bounded skills (debugging, SQL, bash) to small budgets while routing creative and reasoning-heavy skills to larger budgets or premium models.

The key insight: saturation is a property of the skill's output, not the model. A debugging skill produces a focused diagnosis — that fits in 300 tokens and doesn't benefit from more. A world-building skill produces a rich fictional system — it needs thousands of tokens to reach full quality.

What a High-Scoring SKILL.md Looks Like

Scoring every skill against a quality rubric, the difference between 40/100 and 94/100 usually came down to three things:

1. Specific trigger conditions vs. vague ones

Low score:

trigger: research topics when asked

High score:

TRIGGER WHEN:
- User asks to research a specific topic with source citations required
- User asks to compare multiple sources or synthesize conflicting information
- User asks for a structured report with confidence scoring on claims
- User is working on a fact-checked document and needs verified sources

DO NOT TRIGGER WHEN:
- User asks a factual question answerable from memory (e.g. "what year was X founded")
- User asks to write or edit code
- User asks for creative writing or brainstorming

The second version tells the model exactly where the edges are. It fires on the right prompts and doesn't fire on the wrong ones. The first version leaves that judgment to the model — which gets it wrong about 30% of the time on edge cases.

2. An exact output template with a real example

Low score:

output_format: structured report with sections

High score:

## Summary
[3 sentences: what the research found, main conclusion, confidence level]

## Key Findings
- Finding 1 [Source: Author, Publication, Year — URL]
- Finding 2 [Source: ...]

## Confidence: High / Medium / Low
[Reasoning — sample size, source quality, consensus level]

## Sources
1. [Full citation with URL]
2. [...]

This template is reproduced verbatim in the output. No ambiguity about what "structured" means.

3. Iron laws that address actual failure modes

Generic iron laws ("NEVER be inaccurate") don't constrain behavior — they're already implied by "be a good assistant." Specific iron laws target the ways this skill fails:

NEVER state a statistic without a citation — if you can't cite it, flag it as unverified
NEVER present a single source as representing consensus — minimum 3 independent sources required for any factual claim
ALWAYS rate confidence per claim, not just overall — "High confidence on X, Low confidence on Y"
NEVER truncate the sources list — if running out of tokens, abbreviate the findings, not the citations
ALWAYS flag when the topic is too broad to research in one session, and ask to narrow it

Each of these comes from a real failure mode observed during testing. They're not principles — they're guardrails.

A Practical Process

Write the output template first. Before writing a single word of the skill file, produce three example outputs manually. This forces clarity about format, depth, and what "done" looks like.
Write trigger conditions from examples, not abstractions. Take 10 real prompts the skill should handle, and 10 it shouldn't. Write trigger conditions that correctly classify all 20.
Test at three token budgets. Run the skill at your expected budget, half that, and double that. Where does quality plateau? That's your saturation point.
Derive iron laws from failure analysis. Run the skill 5 times. Note the 3 worst outputs. Turn each failure into an iron law.
Add at least 2 complete worked examples. Not placeholders — real inputs with realistic expected outputs. These serve as your regression tests.

The Quick Test

If you have a SKILL.md and want to know if it's likely to work, ask three questions:

Quality check — can you answer yes to all three?

Can you read the trigger conditions and immediately know which of 10 random prompts it would activate on?
Can you read the output format and produce the expected output yourself without any other context?
Do the iron laws address things that could actually go wrong with this specific skill, or are they generic goodness statements?

If any answer is "no," that's where quality is leaking. The SKILL.md Linter checks these automatically — it flags missing sections, vague trigger conditions, and iron laws that read as generic rather than skill-specific.