AI Agents SKILL.md Validation March 2026 · Andy

Is Your SKILL.md Any Good? How to Lint and Validate AI Agent Skills

Most SKILL.md files fail quietly. They pass a visual check — sections present, frontmatter looks right, examples included — and then produce inconsistent output in production. The problem isn't obvious until you've run the same skill fifty times and noticed the variance. Here's what a linter actually checks, and why.

In this article
  1. What makes a SKILL.md bad
  2. The 8-point quality rubric
  3. Token saturation data
  4. What a linter checks automatically
  5. Common fixes after linting
  6. Try the SKILL.md Linter

What Makes a SKILL.md Bad

The most common problems we found testing 20+ skills across different complexity levels:

No frontmatter
Skills without a proper name, description, and version block are hard to index, hard to route, and often get confused with adjacent skills when multiple are active in the same context. This is also the first thing any automated tooling checks.
Vague trigger conditions
"Activate when the user asks to research something" covers too broad a surface. It fires on simple factual questions the skill wasn't designed for, and sometimes misses complex synthesis tasks it was built to handle. Triggers without explicit negative conditions — what the skill should not activate on — are especially prone to misfire.
No worked examples
The output format description can be perfect on paper and still get misinterpreted. A skill without at least one complete example — real input, full expected output — gives the model nothing to calibrate against. The example is the spec. Without it, "structured report" means something different each session.
No error recovery
What happens when a required input is missing? When the source data is ambiguous? When the user's request contradicts itself? Skills without explicit error handling let the model improvise — which means inconsistent behavior at the exact moments that matter most.
Token bloat
Adding more instructions past the saturation point doesn't improve output quality — it increases latency and, in some models, starts degrading quality by diluting the most relevant instructions. Each skill has a natural saturation point based on its output complexity. Padding past it is waste.
Generic iron laws
"NEVER be inaccurate" is already implied by wanting good output. Iron laws that actually improve consistency are derived from specific failure modes: what this skill actually gets wrong under pressure, at edge cases, with adversarial inputs. Generic laws signal the skill was never stress-tested.

The 8-Point Quality Rubric

Running structured evaluations on skill files, quality breaks down along 8 dimensions — 5 technical, 3 for output quality. The threshold for production-ready is 5/8 technical, 3/6 creative.

Technical dimensions (5 points)

01 / TECHNICAL
Trigger precision
Are activation conditions specific enough to correctly classify ambiguous prompts? Includes both positive and negative trigger examples.
02 / TECHNICAL
Output completeness
Does the skill define every section, format, and constraint of the expected output — not just describe it?
03 / TECHNICAL
Iron law specificity
Do the constraints address actual failure modes for this skill, or are they generic goodness statements?
04 / TECHNICAL
Error recovery coverage
Are the most likely failure modes explicitly handled — missing inputs, ambiguous data, contradictory requests?
05 / TECHNICAL
Example quality
Are examples complete, realistic, and useful as regression tests — or placeholder-level?

Output quality dimensions (3 points)

06 / OUTPUT
Concision
Does the skill produce outputs at the right depth for the task — not padded, not truncated?
07 / OUTPUT
Consistency
Does the same input produce structurally similar output across runs — not wildly different formats?
08 / OUTPUT
Edge case handling
Does the skill behave correctly on boundary inputs, not just the obvious center case?
Score threshold
A score of 5/8 technical is the minimum for a production-ready skill. Output quality below 3/6 usually means the output template is underspecified — the skill will produce different formats across runs even with identical inputs.

Token Saturation Data

One of the more actionable findings: saturation is a property of the skill's output structure, not the model, and not a setting you tune arbitrarily. Skills with bounded outputs reach a quality plateau early. Skills requiring narrative depth or structural reasoning don't saturate at low budgets.

Skill Saturation Point Model Ceiling Notes
SQL query ~200 tok Any Focused output saturates early
Debugging ~300 tok Any Cause + fix + explanation fits in 300
Code review ~500 tok Any Structured findings need the extra room
Social media repurposer ~500 tok Any Below this: analysis truncates before output
Data analysis ~800 tok Any Explanation depth requires budget
Plot structure No ceiling observed Claude required Smaller models hit structural ceiling

The practical implication: running a debugging skill at 2000 tokens doesn't make the output better. It costs more and can dilute the signal. The 77% token cost reduction we measured came from routing output-bounded skills (SQL, debugging, bash scripting) to appropriate budgets rather than defaulting to maximum context.

Skills requiring narrative coherence — world-building, complex research synthesis, plot structure — don't saturate at low budgets. They need room to develop reasoning. Conflating these two categories is a common cause of "the skill is fast but shallow" complaints.

What a Linter Checks Automatically

Running through a checklist manually works once. It doesn't scale as a skills library grows past a few files. A linter checks systematically:

Checks run on every SKILL.md

The linter produces a score per dimension with specific line flags — it doesn't rewrite the skill, it tells you exactly where the gaps are so you can fix them intentionally.

Common Fixes After Linting

On vague triggers
Write 10 example prompts the skill should handle. Write 5 it shouldn't. Derive your trigger conditions from those examples, not from abstract descriptions of the skill's purpose. The negative examples are usually more clarifying than the positive ones.
On missing iron laws
Run the skill on the hardest version of each type of input it handles. Note the worst output you got. Turn each failure into an iron law: "NEVER [the thing that went wrong] — instead [the correct behavior]." If you haven't stress-tested the skill, you don't yet know what the iron laws should be.
On absent examples
The example doesn't need to be long. It needs to be complete. A one-paragraph input with a two-section output is more useful than a half-page description of what the output should look like. The example is your regression test — treat it that way.
On token bloat
Look for duplicate instructions, hedged phrasing ("you should generally try to..."), and sections that restate earlier points in different words. Cut them. Precision beats volume — a 600-token skill that's specific outperforms a 2000-token skill that's vague.

What a high-scoring trigger section looks like

Low score:

trigger: research topics when asked

High score:

TRIGGER WHEN:
- User asks to research a specific topic with source citations required
- User asks to compare multiple sources or synthesize conflicting information
- User needs a structured report with confidence scoring on claims

DO NOT TRIGGER WHEN:
- User asks a factual question answerable from memory ("what year was X founded")
- User asks to write or edit code
- User asks for brainstorming or creative writing

TRIGGER EXAMPLES:
- "Research the current state of open-source LLM benchmarks and give me 3 sources"
- "Compare what Anthropic and OpenAI say about system prompt security"

NOT TRIGGER EXAMPLES:
- "What is Python used for?"
- "Write me a blog post about AI"

The second version tells the model exactly where the edges are. It fires on the right prompts and ignores the wrong ones. The first leaves that judgment to the model — which gets it wrong on roughly 30% of edge cases.

Try the SKILL.md Linter

Paste any SKILL.md file into the linter to get a score across all 8 dimensions with specific flags per issue. If you're starting from scratch, the SKILL.md Generator builds a high-scoring foundation from a plain-text description — including the sections the linter checks for.

Validate an existing skill or generate a new one from scratch.

SKILL.md Linter → SKILL.md Generator