I tested 20 AI agent skills against real prompts, measuring completeness, quality, and token efficiency at every budget level. Here's what the data shows — and what it means for designing skills that actually work in production.
A SKILL.md file is the instruction set for a specific agent capability. It tells Claude (or any LLM) how to handle a particular type of task: when to activate, what modes to run in, what outputs to produce, and what rules it must never break.
The difference between a good skill and a bad one isn't creativity — it's specificity. A skill that says "research things when the user asks" produces inconsistent results. A skill with precise trigger conditions, an exact output template, and iron laws derived from real failure modes produces the same quality output whether you run it at 8am or 3am on a Friday.
Running 20 skills against structured test inputs, we found four consistent failure patterns:
Here's what we found testing token budgets across 12 skills. "Saturation point" is where quality plateaus — adding more tokens produces no measurable improvement.
| Skill | Saturation Point | Model | Notes |
|---|---|---|---|
| SQL query | 200 tok | GLM-5 | Focused output — fast and cheap |
| Debugging | 300 tok | GLM-5 | Diagnosis + fix fits in 300 |
| Code review | 500 tok | GLM-5 | Needs room for structured findings |
| Bash scripting | 500 tok | GLM-5 | Simple scripts don't need more |
| Social media repurposer | 500 tok | GLM-5 | Below this: analysis truncates before output |
| Data analysis | 800 tok | GLM-5 | Needs explanation depth |
| Dialogue writer | 1,200 tok | GLM-5 | Voice + subtext needs space |
| World building | 3,000+ tok | Claude | GLM-5 hits structural ceiling |
| Scene writing | 3,000+ tok | Claude | Claude required |
| Plot structure | No ceiling | Claude | Complex reasoning needs full context |
The 77% token cost reduction comes from routing deterministic, output-bounded skills (debugging, SQL, bash) to small budgets while routing creative and reasoning-heavy skills to larger budgets or premium models.
The key insight: saturation is a property of the skill's output, not the model. A debugging skill produces a focused diagnosis — that fits in 300 tokens and doesn't benefit from more. A world-building skill produces a rich fictional system — it needs thousands of tokens to reach full quality.
Scoring every skill against a quality rubric, the difference between 40/100 and 94/100 usually came down to three things:
Low score:
trigger: research topics when asked
High score:
TRIGGER WHEN:
- User asks to research a specific topic with source citations required
- User asks to compare multiple sources or synthesize conflicting information
- User asks for a structured report with confidence scoring on claims
- User is working on a fact-checked document and needs verified sources
DO NOT TRIGGER WHEN:
- User asks a factual question answerable from memory (e.g. "what year was X founded")
- User asks to write or edit code
- User asks for creative writing or brainstorming
The second version tells the model exactly where the edges are. It fires on the right prompts and doesn't fire on the wrong ones. The first version leaves that judgment to the model — which gets it wrong about 30% of the time on edge cases.
Low score:
output_format: structured report with sections
High score:
## Summary
[3 sentences: what the research found, main conclusion, confidence level]
## Key Findings
- Finding 1 [Source: Author, Publication, Year — URL]
- Finding 2 [Source: ...]
## Confidence: High / Medium / Low
[Reasoning — sample size, source quality, consensus level]
## Sources
1. [Full citation with URL]
2. [...]
This template is reproduced verbatim in the output. No ambiguity about what "structured" means.
Generic iron laws ("NEVER be inaccurate") don't constrain behavior — they're already implied by "be a good assistant." Specific iron laws target the ways this skill fails:
Each of these comes from a real failure mode observed during testing. They're not principles — they're guardrails.
If you have a SKILL.md and want to know if it's likely to work, ask three questions:
If any answer is "no," that's where quality is leaking. The SKILL.md Linter checks these automatically — it flags missing sections, vague trigger conditions, and iron laws that read as generic rather than skill-specific.