AI Agents SKILL.md Validation March 2026 · Andy

Is Your SKILL.md Any Good? How to Lint and Validate AI Agent Skills

Most SKILL.md files fail quietly. They pass a visual check — sections present, frontmatter looks right, examples included — and then produce inconsistent output in production. The problem isn't obvious until you've run the same skill fifty times and noticed the variance. Here's what a linter actually checks, and why.

In this article

What makes a SKILL.md bad
The 8-point quality rubric
Token saturation data
What a linter checks automatically
Common fixes after linting
Try the SKILL.md Linter

What Makes a SKILL.md Bad

The most common problems we found testing 20+ skills across different complexity levels:

No frontmatter

Skills without a proper name, description, and version block are hard to index, hard to route, and often get confused with adjacent skills when multiple are active in the same context. This is also the first thing any automated tooling checks.

Vague trigger conditions

"Activate when the user asks to research something" covers too broad a surface. It fires on simple factual questions the skill wasn't designed for, and sometimes misses complex synthesis tasks it was built to handle. Triggers without explicit negative conditions — what the skill should not activate on — are especially prone to misfire.

No worked examples

The output format description can be perfect on paper and still get misinterpreted. A skill without at least one complete example — real input, full expected output — gives the model nothing to calibrate against. The example is the spec. Without it, "structured report" means something different each session.

No error recovery

What happens when a required input is missing? When the source data is ambiguous? When the user's request contradicts itself? Skills without explicit error handling let the model improvise — which means inconsistent behavior at the exact moments that matter most.

Token bloat

Adding more instructions past the saturation point doesn't improve output quality — it increases latency and, in some models, starts degrading quality by diluting the most relevant instructions. Each skill has a natural saturation point based on its output complexity. Padding past it is waste.

Generic iron laws

"NEVER be inaccurate" is already implied by wanting good output. Iron laws that actually improve consistency are derived from specific failure modes: what this skill actually gets wrong under pressure, at edge cases, with adversarial inputs. Generic laws signal the skill was never stress-tested.

The 8-Point Quality Rubric

Running structured evaluations on skill files, quality breaks down along 8 dimensions — 5 technical, 3 for output quality. The threshold for production-ready is 5/8 technical, 3/6 creative.

Technical dimensions (5 points)

01 / TECHNICAL

Trigger precision

Are activation conditions specific enough to correctly classify ambiguous prompts? Includes both positive and negative trigger examples.

02 / TECHNICAL

Output completeness

Does the skill define every section, format, and constraint of the expected output — not just describe it?

03 / TECHNICAL

Iron law specificity

Do the constraints address actual failure modes for this skill, or are they generic goodness statements?

04 / TECHNICAL

Error recovery coverage

Are the most likely failure modes explicitly handled — missing inputs, ambiguous data, contradictory requests?

05 / TECHNICAL

Example quality

Are examples complete, realistic, and useful as regression tests — or placeholder-level?

Output quality dimensions (3 points)

06 / OUTPUT

Concision

Does the skill produce outputs at the right depth for the task — not padded, not truncated?

07 / OUTPUT

Consistency

Does the same input produce structurally similar output across runs — not wildly different formats?

08 / OUTPUT

Edge case handling

Does the skill behave correctly on boundary inputs, not just the obvious center case?

Score threshold

A score of 5/8 technical is the minimum for a production-ready skill. Output quality below 3/6 usually means the output template is underspecified — the skill will produce different formats across runs even with identical inputs.

Token Saturation Data

One of the more actionable findings: saturation is a property of the skill's output structure, not the model, and not a setting you tune arbitrarily. Skills with bounded outputs reach a quality plateau early. Skills requiring narrative depth or structural reasoning don't saturate at low budgets.

Skill	Saturation Point	Model Ceiling	Notes
SQL query	~200 tok	Any	Focused output saturates early
Debugging	~300 tok	Any	Cause + fix + explanation fits in 300
Code review	~500 tok	Any	Structured findings need the extra room
Social media repurposer	~500 tok	Any	Below this: analysis truncates before output
Data analysis	~800 tok	Any	Explanation depth requires budget
Plot structure	No ceiling observed	Claude required	Smaller models hit structural ceiling

The practical implication: running a debugging skill at 2000 tokens doesn't make the output better. It costs more and can dilute the signal. The 77% token cost reduction we measured came from routing output-bounded skills (SQL, debugging, bash scripting) to appropriate budgets rather than defaulting to maximum context.

Skills requiring narrative coherence — world-building, complex research synthesis, plot structure — don't saturate at low budgets. They need room to develop reasoning. Conflating these two categories is a common cause of "the skill is fast but shallow" complaints.

What a Linter Checks Automatically

Running through a checklist manually works once. It doesn't scale as a skills library grows past a few files. A linter checks systematically:

Checks run on every SKILL.md

Frontmatter completeness — name, description, version, and at least one mode defined
Trigger section structure — positive trigger conditions, negative conditions (what NOT to activate on), and at least one trigger example
Output template specificity — a concrete format with section headers, not just a description of what the output should contain
Iron law count and specificity — fewer than 3 is usually underspecified; generic laws flag lower than skill-specific ones
Example count and completeness — at least one complete example with realistic input and full expected output
Error recovery presence — at least one explicit handling for a missing or ambiguous input scenario
Token footprint estimate — flags skills likely to saturate before producing complete output at common budget settings

The linter produces a score per dimension with specific line flags — it doesn't rewrite the skill, it tells you exactly where the gaps are so you can fix them intentionally.

Common Fixes After Linting

On vague triggers

Write 10 example prompts the skill should handle. Write 5 it shouldn't. Derive your trigger conditions from those examples, not from abstract descriptions of the skill's purpose. The negative examples are usually more clarifying than the positive ones.

On missing iron laws

Run the skill on the hardest version of each type of input it handles. Note the worst output you got. Turn each failure into an iron law: "NEVER [the thing that went wrong] — instead [the correct behavior]." If you haven't stress-tested the skill, you don't yet know what the iron laws should be.

On absent examples

The example doesn't need to be long. It needs to be complete. A one-paragraph input with a two-section output is more useful than a half-page description of what the output should look like. The example is your regression test — treat it that way.

On token bloat

Look for duplicate instructions, hedged phrasing ("you should generally try to..."), and sections that restate earlier points in different words. Cut them. Precision beats volume — a 600-token skill that's specific outperforms a 2000-token skill that's vague.

What a high-scoring trigger section looks like

Low score:

trigger: research topics when asked

High score:

TRIGGER WHEN:
- User asks to research a specific topic with source citations required
- User asks to compare multiple sources or synthesize conflicting information
- User needs a structured report with confidence scoring on claims

DO NOT TRIGGER WHEN:
- User asks a factual question answerable from memory ("what year was X founded")
- User asks to write or edit code
- User asks for brainstorming or creative writing

TRIGGER EXAMPLES:
- "Research the current state of open-source LLM benchmarks and give me 3 sources"
- "Compare what Anthropic and OpenAI say about system prompt security"

NOT TRIGGER EXAMPLES:
- "What is Python used for?"
- "Write me a blog post about AI"

The second version tells the model exactly where the edges are. It fires on the right prompts and ignores the wrong ones. The first leaves that judgment to the model — which gets it wrong on roughly 30% of edge cases.

Try the SKILL.md Linter

Paste any SKILL.md file into the linter to get a score across all 8 dimensions with specific flags per issue. If you're starting from scratch, the SKILL.md Generator builds a high-scoring foundation from a plain-text description — including the sections the linter checks for.

Validate an existing skill or generate a new one from scratch.

SKILL.md Linter → SKILL.md Generator