In most agent frameworks I've used, the default token budget is 2,000. I used that number for months without questioning it. Last week I ran every skill I've built through a full token ladder. Ten of them saturate below 800 tokens. Five saturate at or below 500.
The 2,000 default is costing you roughly 4x what the work actually requires.
What "saturation" means here
For each skill, I set up a token ladder — a sequence of budgets from the minimum plausible up to 2,000 or beyond. Each rung got the same input. I measured Completeness (0–10) and Quality (0–10) at each level. Saturation is the first rung where both scores plateau and the output isn't truncated. After that point, extra tokens produce longer text, not better text.
The test inputs were real tasks. An SQL injection bug. Twelve months of sales data with anomalies baked in. A story premise that needed seven-beat structure. Not synthetic benchmarks — actual work.
The routing table
Here's where each skill saturates and which model handles it at that budget:
| Skill | Saturation | Efficiency Budget | Model | C / Q |
|---|---|---|---|---|
| sql-query | 200 tokens | 200–250 | GLM-5 | 10 / 9 |
| debugging | 300 tokens | 300–400 | GLM-5 | 10 / 9 |
| code-review | 500 tokens | 300–500 | GLM-5 | 9 / 9 |
| bash-scripting | 500 tokens | 300–500 | GLM-5 | 10 / 10 |
| headline-optimizer | 500 tokens | 300–500 | GLM-5 | 10 / 10 |
| social-media-repurposer | 500 tokens | 500–600 | GLM-5 | 10 / 9 |
| regex-builder | ~800 tokens | 500–800 | GLM-5 | 10 / 10 |
| data-analysis | 800 tokens | 800–900 | GLM-5 | 10 / 10 |
| dialogue-writer | 1,200 tokens | 1,200 | GLM-5 | 10 / 10 |
| plot-structure | Claude required | — | Claude | 10 / 10 |
| world-building | ≤3,000 tokens | — | Claude | 10 / 10 |
| scene-writing | ≤3,000 tokens | — | Claude | 10 / 10 |
| seo-content-writer | Claude required | — | Claude | — |
| research-synthesizer | Tool-dependent | — | Claude + tools | 9 / 10 |
C = completeness, Q = quality, both out of 10.
The finding I didn't expect: GLM-5 hits a hard ceiling
On sql-query, debugging, and code-review, GLM-5 is one quality point behind Claude. At 300 tokens, GLM-5 generated a complete SQL injection diagnosis — root cause, exploit explanation, parameterized query fix. Claude's output at the same task was about 650 characters longer and flagged two additional edge cases. For most production uses, that difference doesn't matter.
Plot-structure is different.
I ran the sweep from 800 to 4,000 tokens on GLM-5. At every single budget — 800, 1200, 1600, 2000, 2500, 3000, 3500, 4000 — the output was truncated. Completeness never exceeded 8/10. Quality held at 9/10 for seven consecutive runs. The model was producing good content. It just couldn't finish the document.
The full plot-structure output spec requires a logline, theme question, character arc table, a seven-row beat table, chapter breakdown, and subplot connections. That structural spec runs long. GLM-5 appears to self-limit at some output length before completing it — not a token ceiling in the traditional sense, but something in the model's generation behavior. Running the same input on Claude at 5,000+ tokens: complete output, all beats filled, C=10, Q=10.
More tokens on GLM-5 do not fix this. It's a model architecture issue.
Research-synthesizer is a different kind of ceiling. It needs live web search. Without tools, GLM-5 produces plausible-sounding but unverified content. The skill has an iron law: every claim must be sourced. GLM-5 without tools scores C=6, Q=7 and iron law compliance of 2/5. Claude with WebSearch scores C=9, Q=10, compliance 5/5. That 3-point gap is real and can't be collapsed by adjusting the token budget.
What the token reduction actually means
Moving from the 2,000-token default to the saturation-optimized budgets:
Three levels of effort for implementing this
Use the table as-is
If your skill matches a category — SQL generation, single-function debugging, code review, social posts, data analysis — start at the saturation budget above. Monitor for edge cases. A complex multi-file debugging session may need more than 300 tokens; a single-function bug likely won't.
Run your own sweep
Pick 6–8 budgets starting at 200, doubling until you hit saturation or 4,000. Same representative input at each level. Measure completeness and quality. Look for the first rung where both scores plateau. One hour per skill. The time pays back on the first 10,000 calls.
Add routing logic
def route_skill(skill_name): if skill_requires_tools(skill_name): return ("claude", 4000) sat = SATURATION_TABLE[skill_name] if sat <= 800: return ("glm-5", sat) if sat <= 4000: return ("glm-5", sat) return ("claude", 5000)
The logic is simple. The work is in filling SATURATION_TABLE with real data from your own skills.
What I haven't tested yet
Multi-file code review. The tested input was a single 10-line function. A 500-line codebase probably pushes the saturation point well past 500 tokens. I don't have that data.
Plot-structure on Claude at high budgets. The estimated saturation for a complete, non-truncated output is 5,500–6,500 tokens based on what the 3,000-token run produced (10,729 characters at C=7). I haven't run that sweep.
Research-synthesizer with an open-source tool model. The routing decision sends this to Claude because GLM-5 lacks web access in this setup. If you have a cheaper model with search tools, the routing might look different.
I'll update the table as more data comes in.
The actual takeaway
Token budgets are a setting most people configure once and forget. 2,000 is a reasonable default when you don't have data. It becomes an expensive guess when you do.
The routing table above is what the data shows for these 14 skills with these test inputs. Your numbers will differ by input complexity, skill design, and model setup. The method transfers; the numbers need to be yours.