I Tested 14 AI Skills at Every Token Budget. Here's the Routing Table.

In most agent frameworks I've used, the default token budget is 2,000. I used that number for months without questioning it. Last week I ran every skill I've built through a full token ladder. Ten of them saturate below 800 tokens. Five saturate at or below 500.

The 2,000 default is costing you roughly 4x what the work actually requires.

What "saturation" means here

For each skill, I set up a token ladder — a sequence of budgets from the minimum plausible up to 2,000 or beyond. Each rung got the same input. I measured Completeness (0–10) and Quality (0–10) at each level. Saturation is the first rung where both scores plateau and the output isn't truncated. After that point, extra tokens produce longer text, not better text.

The test inputs were real tasks. An SQL injection bug. Twelve months of sales data with anomalies baked in. A story premise that needed seven-beat structure. Not synthetic benchmarks — actual work.

The routing table

Here's where each skill saturates and which model handles it at that budget:

Skill	Saturation	Efficiency Budget	Model	C / Q
sql-query	200 tokens	200–250	GLM-5	10 / 9
debugging	300 tokens	300–400	GLM-5	10 / 9
code-review	500 tokens	300–500	GLM-5	9 / 9
bash-scripting	500 tokens	300–500	GLM-5	10 / 10
headline-optimizer	500 tokens	300–500	GLM-5	10 / 10
social-media-repurposer	500 tokens	500–600	GLM-5	10 / 9
regex-builder	~800 tokens	500–800	GLM-5	10 / 10
data-analysis	800 tokens	800–900	GLM-5	10 / 10
dialogue-writer	1,200 tokens	1,200	GLM-5	10 / 10
plot-structure	Claude required	—	Claude	10 / 10
world-building	≤3,000 tokens	—	Claude	10 / 10
scene-writing	≤3,000 tokens	—	Claude	10 / 10
seo-content-writer	Claude required	—	Claude	—
research-synthesizer	Tool-dependent	—	Claude + tools	9 / 10

C = completeness, Q = quality, both out of 10.

The split is clean: 9 of 14 skills run on GLM-5 at 200–1,200 tokens. Five need Claude, and four of those need it for a structural reason that more tokens won't fix.

The finding I didn't expect: GLM-5 hits a hard ceiling

On sql-query, debugging, and code-review, GLM-5 is one quality point behind Claude. At 300 tokens, GLM-5 generated a complete SQL injection diagnosis — root cause, exploit explanation, parameterized query fix. Claude's output at the same task was about 650 characters longer and flagged two additional edge cases. For most production uses, that difference doesn't matter.

Plot-structure is different.

I ran the sweep from 800 to 4,000 tokens on GLM-5. At every single budget — 800, 1200, 1600, 2000, 2500, 3000, 3500, 4000 — the output was truncated. Completeness never exceeded 8/10. Quality held at 9/10 for seven consecutive runs. The model was producing good content. It just couldn't finish the document.

The full plot-structure output spec requires a logline, theme question, character arc table, a seven-row beat table, chapter breakdown, and subplot connections. That structural spec runs long. GLM-5 appears to self-limit at some output length before completing it — not a token ceiling in the traditional sense, but something in the model's generation behavior. Running the same input on Claude at 5,000+ tokens: complete output, all beats filled, C=10, Q=10.

More tokens on GLM-5 do not fix this. It's a model architecture issue.

Research-synthesizer is a different kind of ceiling. It needs live web search. Without tools, GLM-5 produces plausible-sounding but unverified content. The skill has an iron law: every claim must be sourced. GLM-5 without tools scores C=6, Q=7 and iron law compliance of 2/5. Claude with WebSearch scores C=9, Q=10, compliance 5/5. That 3-point gap is real and can't be collapsed by adjusting the token budget.

What the token reduction actually means

Moving from the 2,000-token default to the saturation-optimized budgets:

90%

sql-query

2,000 → 200 tokens

85%

debugging

2,000 → 300 tokens

75%

code-review

2,000 → 500 tokens

75%

social-media

2,000 → 500 tokens

60%

data-analysis

2,000 → 800 tokens

Average across these five: 77% fewer tokens, same output quality. If you're running a skill at 2,000 tokens 10,000 times a day, that's 20 million tokens per day. At sql-query's saturation point, it's 2 million. For the skills where the cheaper model handles it, the cost multiplier compounds.

Three levels of effort for implementing this

Use the table as-is

If your skill matches a category — SQL generation, single-function debugging, code review, social posts, data analysis — start at the saturation budget above. Monitor for edge cases. A complex multi-file debugging session may need more than 300 tokens; a single-function bug likely won't.

Run your own sweep

Pick 6–8 budgets starting at 200, doubling until you hit saturation or 4,000. Same representative input at each level. Measure completeness and quality. Look for the first rung where both scores plateau. One hour per skill. The time pays back on the first 10,000 calls.

Add routing logic

def route_skill(skill_name):
    if skill_requires_tools(skill_name):
        return ("claude", 4000)
    sat = SATURATION_TABLE[skill_name]
    if sat <= 800:
        return ("glm-5", sat)
    if sat <= 4000:
        return ("glm-5", sat)
    return ("claude", 5000)

The logic is simple. The work is in filling SATURATION_TABLE with real data from your own skills.

What I haven't tested yet

Multi-file code review. The tested input was a single 10-line function. A 500-line codebase probably pushes the saturation point well past 500 tokens. I don't have that data.

Plot-structure on Claude at high budgets. The estimated saturation for a complete, non-truncated output is 5,500–6,500 tokens based on what the 3,000-token run produced (10,729 characters at C=7). I haven't run that sweep.

Research-synthesizer with an open-source tool model. The routing decision sends this to Claude because GLM-5 lacks web access in this setup. If you have a cheaper model with search tools, the routing might look different.

I'll update the table as more data comes in.

The actual takeaway

Token budgets are a setting most people configure once and forget. 2,000 is a reasonable default when you don't have data. It becomes an expensive guess when you do.

The routing table above is what the data shows for these 14 skills with these test inputs. Your numbers will differ by input complexity, skill design, and model setup. The method transfers; the numbers need to be yours.

Run the saturation sweep on the skill you call most. One hour. The result will tell you what 10,000 subsequent calls should cost.