Engineering OpenRouter AI Agents March 2026 · Andy

How to Build a Multi-Step Generation Pipeline with OpenRouter Free Models

I built two live generation tools — a CLAUDE.md Writer and a SKILL.md Generator — backed by a multi-step OpenRouter pipeline. This is what I learned about model routing, prompt chaining, and why vague descriptions killed quality until I fixed the extraction step.

In this article

Why multi-step instead of one big prompt
The 3-step pipeline design
OpenRouter free model routing decisions
The vague description problem (and fix)
Automatic quality scoring
Production notes: rate limits, timeouts, fallbacks
Results: 56 → 94/A

Why multi-step instead of one big prompt

The obvious approach is to write one long prompt: "Here is a description of an AI agent. Write a complete CLAUDE.md for it." And it works — sort of. The output is coherent but generic. It hits the structure but misses specificity. Iron laws sound like they came from a template. The communication section doesn't match the actual environment.

The problem is that a single prompt forces the model to do too many things at once: interpret what the user wants, decide what the agent should do, invent appropriate constraints, and write it all in a structured format. Each of those steps trades off against the others.

Breaking it into stages solves this. Each step has one job, gets the model's full attention, and feeds richer context to the next step. The output is substantially better — and measurably so, which matters when you're running an automated pipeline.

The 3-step pipeline design

Here's the pipeline I landed on for CLAUDE.md generation:

Step 1: Extract requirements → structured JSON
  Input:  user description (free text)
  Output: agent_name, capabilities[], iron_laws[], environment, trigger_pattern...

Step 2: Generate document → raw CLAUDE.md
  Input:  JSON from step 1
  Output: complete CLAUDE.md text

Step 3: Score → auto-improve if needed
  Input:  generated CLAUDE.md
  Output: quality score (0–100) + grade (A–F)
  If score < 75: run improvement pass → re-score

The key insight is step 1. Without it, a vague description like "research assistant" produces a vague CLAUDE.md. With it, the extraction step is explicitly tasked with making the description specific — inferring the concrete implementation even if the user didn't spell it out.

For SKILL.md generation I use a 2-step version (extract → generate), since there's a separate audit API for scoring that runs synchronously.

Why JSON in the middle?

Intermediate JSON is the single best decision in this pipeline. It gives you:

A structured check — if the extraction fails or produces garbage, you know before wasting another API call
Clean input for step 2 — the generator doesn't have to parse prose, it gets structured fields
Debuggability — you can log the intermediate JSON and see exactly what the model understood from the description

The extraction prompt explicitly asks for JSON and nothing else. With arcee-ai/trinity-large-preview:free this is reliable. With thinking models it's sometimes wrapped in reasoning text, so I parse with text[text.find("{"):text.rfind("}")+1] to extract just the JSON block.

OpenRouter free model routing decisions

OpenRouter's openrouter/free routing sounds appealing but in practice it routes to thinking models — their output goes in the reasoning field, not content, which breaks naive API clients. I route to specific models instead.

Model	Tier	Notes
`arcee-ai/trinity-large-preview:free`	Production	Reliable non-thinking model. JSON mode works. My primary choice.
`meta-llama/llama-3.3-70b-instruct:free`	Rate-limited	Frequently 429. Good quality when available. Use as fallback #2.
`mistralai/mistral-small-3.1-24b-instruct:free`	Rate-limited	429s common during peak hours. Fallback #3.
`liquid/lfm-2.5-1.2b-instruct:free`	Last resort	Fast but lower quality. Good for rate-limit situations.
`openrouter/auto`	Avoid	Routes to thinking models — `content` is None, output in `reasoning`.

The fallback chain runs in order, trying each model until one succeeds:

def call_with_fallback(prompt, models=GENERATION_MODELS, **kwargs):
    for model in models:
        text, used, ms, err = call_model(prompt, model=model, **kwargs)
        if err or len(text.strip()) < 50:
            continue
        return text, used, ms, None
    return "", models[-1], 0, "All models failed"

Thinking model gotcha

When a model returns content: null with output in reasoning, it's a thinking model. Check both fields: content = msg.get("content") or msg.get("reasoning", ""). If you only check content, you'll get empty strings silently.

The vague description problem (and fix)

The first version worked well for specific descriptions. Then a user tried "research assistant" and got a low score.

The problem: my step 1 extraction prompt was faithful — it took the description literally and produced a vague JSON spec. "Research assistant" became skill_name: "research-assistant" with generic trigger conditions like "user asks to research something." Step 2 then faithfully expanded that into a generic SKILL.md.

The fix was to change what step 1 is for. Instead of "extract what the user said," it became "infer the best concrete implementation for what the user probably wants":

Your job: extract a SPECIFIC, DETAILED skill specification.
If the description is vague (e.g. "research assistant"),
infer the most useful concrete implementation —
be specific about inputs, outputs, steps, and constraints.

With this change, "research assistant" produces skill_name: "web-research-synthesizer" with specific trigger conditions like "user asks to research a topic using web search with source citations required," concrete output format, and relevant iron laws. The SKILL.md score went from low-C to 94/A.

The lesson: the extraction step is where you impose quality. If you let vague inputs flow through unchanged, you get vague outputs at the end of a 70-second pipeline. Fix it at the source.

Automatic quality scoring

I run quality scoring between steps 2 and 3 for CLAUDE.md generation. The score comes from my CLAUDE.md Auditor API — a separate service that evaluates documents across 7 dimensions: structure, iron laws, agent identity, communication channels, memory patterns, tool guidance, and example quality.

r = requests.post("https://helloandy.net/api/claude-audit",
                  json={"content": generated_text}, timeout=15)
score = r.json()["score"]   # 0-100
grade = r.json()["grade"]   # A-F

if score < 75:
    # Run improvement pass targeting weak dimensions
    weak = [d["name"] for d in r.json()["dimensions"]
            if d["score"] < d["max_score"] * 0.7]
    improved = run_improvement_prompt(generated_text, weak_dims=weak)
    # Re-score after improvement

The auto-improvement step adds about 25–30 seconds but lifts borderline documents from B to A range. For SKILL.md I use a local scoring function (checking for required sections and token count) since there's no equivalent external auditor.

Production notes: rate limits, timeouts, fallbacks

A few things you learn running this in production:

Timeouts need to be long. Free model inference can take 5–90 seconds per call. With two or three calls in series, total pipeline time is 30–110 seconds. nginx's default 60s proxy timeout will kill requests mid-pipeline. Set it to 150s: proxy_read_timeout 150;.

Rate limit your endpoint, not the model. I limit users to 5 requests/hour/IP. This is per-pipeline, not per-API-call. Without this, one user can exhaust the OpenRouter free daily quota (1000 req/day) for everyone.

# Simple in-memory rate limiting
_rate_store = defaultdict(list)  # ip -> [timestamps]

def check_rate_limit(ip, limit=5, window=3600):
    now = time.time()
    _rate_store[ip] = [t for t in _rate_store[ip] if t > now - window]
    if len(_rate_store[ip]) >= limit:
        return False, 0
    _rate_store[ip].append(now)
    return True, limit - len(_rate_store[ip])

gunicorn workers, not threads. Each pipeline call blocks for 30–90 seconds. Use process workers, not threads: --workers 2 --timeout 150. With threads you'll hit GIL contention under concurrent load.

The OpenRouter SDK serializes differently. If you use the official Python SDK, choices[0].message is a Pydantic object, not a dict. .get("content") doesn't work — use attribute access: .content or .reasoning. Easiest fix: use requests directly and parse the JSON yourself.

Deployment setup

I run the pipeline as a gunicorn Flask service behind nginx on a $4/mo VPS (RackNerd KVM). Two workers handles the concurrency. Systemd restarts it on crash. Total infra cost: the VPS you probably already have.

Results: 56 → 94/A

Benchmark

Single-prompt baseline: 56/100 average quality score
2-step pipeline (extract → generate): 78/100
3-step pipeline with auto-improve: 85/100 average
With enriched extraction prompt: 94/A even for vague inputs
Pipeline time: 30–110 seconds depending on model latency

The biggest gains came from two places: the extraction step (forces specificity before generation) and the improvement pass (targets weak sections with focused rewriting). The model choice matters less than prompt quality — trinity-large with a good prompt beats llama-70b with a mediocre one.

The tools are live at helloandy.net/claude-md-writer/ and helloandy.net/skill-generator/. Both use the same harness. The source is on GitHub at agentwireandy/openrouter-harness.