AI Agents
SKILL.md
Validation
March 2026 · Andy
Is Your SKILL.md Any Good? How to Lint and Validate AI Agent Skills
Most SKILL.md files fail quietly. They pass a visual check — sections present, frontmatter looks right, examples included — and then produce inconsistent output in production. The problem isn't obvious until you've run the same skill fifty times and noticed the variance. Here's what a linter actually checks, and why.
What Makes a SKILL.md Bad
The most common problems we found testing 20+ skills across different complexity levels:
No frontmatter
Skills without a proper name, description, and version block are hard to index, hard to route, and often get confused with adjacent skills when multiple are active in the same context. This is also the first thing any automated tooling checks.
Vague trigger conditions
"Activate when the user asks to research something" covers too broad a surface. It fires on simple factual questions the skill wasn't designed for, and sometimes misses complex synthesis tasks it was built to handle. Triggers without explicit negative conditions — what the skill should not activate on — are especially prone to misfire.
No worked examples
The output format description can be perfect on paper and still get misinterpreted. A skill without at least one complete example — real input, full expected output — gives the model nothing to calibrate against. The example is the spec. Without it, "structured report" means something different each session.
No error recovery
What happens when a required input is missing? When the source data is ambiguous? When the user's request contradicts itself? Skills without explicit error handling let the model improvise — which means inconsistent behavior at the exact moments that matter most.
Token bloat
Adding more instructions past the saturation point doesn't improve output quality — it increases latency and, in some models, starts degrading quality by diluting the most relevant instructions. Each skill has a natural saturation point based on its output complexity. Padding past it is waste.
Generic iron laws
"NEVER be inaccurate" is already implied by wanting good output. Iron laws that actually improve consistency are derived from specific failure modes: what this skill actually gets wrong under pressure, at edge cases, with adversarial inputs. Generic laws signal the skill was never stress-tested.
The 8-Point Quality Rubric
Running structured evaluations on skill files, quality breaks down along 8 dimensions — 5 technical, 3 for output quality. The threshold for production-ready is 5/8 technical, 3/6 creative.
Technical dimensions (5 points)
01 / TECHNICAL
Trigger precision
Are activation conditions specific enough to correctly classify ambiguous prompts? Includes both positive and negative trigger examples.
02 / TECHNICAL
Output completeness
Does the skill define every section, format, and constraint of the expected output — not just describe it?
03 / TECHNICAL
Iron law specificity
Do the constraints address actual failure modes for this skill, or are they generic goodness statements?
04 / TECHNICAL
Error recovery coverage
Are the most likely failure modes explicitly handled — missing inputs, ambiguous data, contradictory requests?
05 / TECHNICAL
Example quality
Are examples complete, realistic, and useful as regression tests — or placeholder-level?
Output quality dimensions (3 points)
06 / OUTPUT
Concision
Does the skill produce outputs at the right depth for the task — not padded, not truncated?
07 / OUTPUT
Consistency
Does the same input produce structurally similar output across runs — not wildly different formats?
08 / OUTPUT
Edge case handling
Does the skill behave correctly on boundary inputs, not just the obvious center case?
Score threshold
A score of 5/8 technical is the minimum for a production-ready skill. Output quality below 3/6 usually means the output template is underspecified — the skill will produce different formats across runs even with identical inputs.
Token Saturation Data
One of the more actionable findings: saturation is a property of the skill's output structure, not the model, and not a setting you tune arbitrarily. Skills with bounded outputs reach a quality plateau early. Skills requiring narrative depth or structural reasoning don't saturate at low budgets.
| Skill |
Saturation Point |
Model Ceiling |
Notes |
| SQL query |
~200 tok |
Any |
Focused output saturates early |
| Debugging |
~300 tok |
Any |
Cause + fix + explanation fits in 300 |
| Code review |
~500 tok |
Any |
Structured findings need the extra room |
| Social media repurposer |
~500 tok |
Any |
Below this: analysis truncates before output |
| Data analysis |
~800 tok |
Any |
Explanation depth requires budget |
| Plot structure |
No ceiling observed |
Claude required |
Smaller models hit structural ceiling |
The practical implication: running a debugging skill at 2000 tokens doesn't make the output better. It costs more and can dilute the signal. The 77% token cost reduction we measured came from routing output-bounded skills (SQL, debugging, bash scripting) to appropriate budgets rather than defaulting to maximum context.
Skills requiring narrative coherence — world-building, complex research synthesis, plot structure — don't saturate at low budgets. They need room to develop reasoning. Conflating these two categories is a common cause of "the skill is fast but shallow" complaints.
What a Linter Checks Automatically
Running through a checklist manually works once. It doesn't scale as a skills library grows past a few files. A linter checks systematically:
Checks run on every SKILL.md
- Frontmatter completeness — name, description, version, and at least one mode defined
- Trigger section structure — positive trigger conditions, negative conditions (what NOT to activate on), and at least one trigger example
- Output template specificity — a concrete format with section headers, not just a description of what the output should contain
- Iron law count and specificity — fewer than 3 is usually underspecified; generic laws flag lower than skill-specific ones
- Example count and completeness — at least one complete example with realistic input and full expected output
- Error recovery presence — at least one explicit handling for a missing or ambiguous input scenario
- Token footprint estimate — flags skills likely to saturate before producing complete output at common budget settings
The linter produces a score per dimension with specific line flags — it doesn't rewrite the skill, it tells you exactly where the gaps are so you can fix them intentionally.
Common Fixes After Linting
On vague triggers
Write 10 example prompts the skill should handle. Write 5 it shouldn't. Derive your trigger conditions from those examples, not from abstract descriptions of the skill's purpose. The negative examples are usually more clarifying than the positive ones.
On missing iron laws
Run the skill on the hardest version of each type of input it handles. Note the worst output you got. Turn each failure into an iron law: "NEVER [the thing that went wrong] — instead [the correct behavior]." If you haven't stress-tested the skill, you don't yet know what the iron laws should be.
On absent examples
The example doesn't need to be long. It needs to be complete. A one-paragraph input with a two-section output is more useful than a half-page description of what the output should look like. The example is your regression test — treat it that way.
On token bloat
Look for duplicate instructions, hedged phrasing ("you should generally try to..."), and sections that restate earlier points in different words. Cut them. Precision beats volume — a 600-token skill that's specific outperforms a 2000-token skill that's vague.
What a high-scoring trigger section looks like
Low score:
trigger: research topics when asked
High score:
TRIGGER WHEN:
- User asks to research a specific topic with source citations required
- User asks to compare multiple sources or synthesize conflicting information
- User needs a structured report with confidence scoring on claims
DO NOT TRIGGER WHEN:
- User asks a factual question answerable from memory ("what year was X founded")
- User asks to write or edit code
- User asks for brainstorming or creative writing
TRIGGER EXAMPLES:
- "Research the current state of open-source LLM benchmarks and give me 3 sources"
- "Compare what Anthropic and OpenAI say about system prompt security"
NOT TRIGGER EXAMPLES:
- "What is Python used for?"
- "Write me a blog post about AI"
The second version tells the model exactly where the edges are. It fires on the right prompts and ignores the wrong ones. The first leaves that judgment to the model — which gets it wrong on roughly 30% of edge cases.
Try the SKILL.md Linter
Paste any SKILL.md file into the linter to get a score across all 8 dimensions with specific flags per issue. If you're starting from scratch, the SKILL.md Generator builds a high-scoring foundation from a plain-text description — including the sections the linter checks for.