Most free AI chatbots do one thing: forward your message to an LLM and stream back the response. That is a wrapper, not a product. The chatbot running at helloandy.net/ai-chat does something different. It classifies every query into one of 16 specialized modes, routes it to purpose-built handlers that pull data from 18 external APIs, and synthesizes responses with citations — all using free models and free data sources. This article explains exactly how it works.

What you will learn: The complete architecture of a 16-mode free AI chatbot — the smart router, every mode handler, all 18 free APIs, source selection logic, and real performance results. Everything described here runs on free-tier infrastructure with zero recurring cost.

The Goal: Prove Free Can Be Competitive

The premise is straightforward. Paid APIs like GPT-4o and Claude Sonnet are excellent, but they cost money per token. For a publicly accessible chatbot with no login requirement, that model does not work. Every query costs you money, and there is no revenue to offset it.

The alternative: use free models from OpenRouter's free tier and compensate for their smaller context windows and lower raw capability by building specialized mode handlers that do the heavy lifting before the LLM ever sees the query. Instead of asking a free model to know everything, you give it exactly the data it needs and ask it to synthesize.

The result after 46 iterations of testing and optimization: a median quality score of 8.12 out of 10 across a standardized 10-question benchmark, with six consecutive gains. Free models, when given the right architecture, produce genuinely useful responses.

The 16 Modes

Every query that hits the chatbot is classified into one of 16 modes. Each mode has its own handler — its own logic for fetching data, constructing prompts, and formatting output. Here is every mode and what it does:

ModeDescriptionKey Data Sources
chatGeneral conversation, opinions, explanations, creative writingLLM only (no external data)
weatherCurrent conditions and forecasts for any locationwttr.in API
calculateArithmetic, unit conversions, percentage calculationsDirect computation
mathSymbolic math — algebra, calculus, equation solvingSymPy + LLM explanation
codeProgramming help, debugging, code generationLLM + GitHub API context
newsCurrent events, trending topics, recent developmentsDuckDuckGo + Hacker News
lookupFactual queries, definitions, quick referenceWikipedia + DuckDuckGo
researchDeep multi-source synthesis with citations10+ sources (arXiv, Semantic Scholar, etc.)
imageAI image generation from text descriptionsPollinations.ai (FLUX model)
qrQR code generation for any URL or textQR Server API
htmlFull webpage generation with previewLLM generates complete HTML/CSS/JS
gameBrowser-playable JavaScript gamesLLM generates self-contained game code
dataEconomic data, statistics, time seriesFRED API (Federal Reserve)
currencyExchange rates, currency conversionFrankfurter API (ECB data)
wordDefinitions, etymology, pronunciation, synonymsFree Dictionary API
triviaQuiz questions across categories and difficultiesOpen Trivia Database

The key insight is that most of these modes do not rely on the LLM's parametric knowledge at all. The weather mode calls an API and formats the result. The calculate mode evaluates expressions directly. The data mode pulls time series from FRED. The LLM's job in these modes is synthesis and natural language formatting, not knowledge retrieval — and that is exactly what free models are good at.

Mode Categories

The 16 modes fall into four natural categories:

The Smart Router

Before any mode handler runs, the router has to decide which mode to use. This is the single most important component in the system. A misrouted query goes to the wrong handler, gets the wrong data sources, and produces a bad response regardless of how good the synthesis is.

The router is itself an LLM call. It receives the user's query and a structured prompt that describes all 16 modes with examples and classification rules. It returns a single word: the mode name.

// Router prompt structure (simplified)
const routerPrompt = `Classify this query into exactly one mode.

MODES:
- chat: conversation, opinions, creative writing
- weather: weather conditions, forecasts, "what's the weather"
- calculate: arithmetic, "what is 15% of 200"
- math: algebra, calculus, equations, "solve x^2 + 3x = 0"
- code: programming, debugging, "write a function that"
- news: current events, "latest news about"
- lookup: factual queries, "who invented", "what is"
- research: deep analysis, "explain how X works"
- image: "generate an image of", "draw", "create a picture"
- qr: "make a QR code for"
- html: "create a webpage", "build a page"
- game: "make a game", "build snake"
- data: economic data, GDP, inflation, "FRED data"
- currency: exchange rates, "convert USD to EUR"
- word: definitions, "define", "what does X mean"
- trivia: quiz, "trivia question about"

Query: ${userQuery}
Mode:`;

The router achieves 100% accuracy on a test suite of 152 classification cases. That number is not a fluke — it is the result of iterating on the classification prompt, adding edge case examples, and testing against every misclassification that appeared in production logs.

Why LLM routing beats regex: Early versions used keyword matching. "Calculate" went to calculate mode, "weather" went to weather mode. It broke constantly. "What's the weather like for my trip to calculate my packing list?" would hit both keywords. LLM-based routing understands intent, not just keywords. It handles ambiguity, slang, and multi-part queries correctly.

Router Optimization

The router call adds latency — typically 200-400ms. Three techniques keep it fast:

  1. Use the fastest free model. Router classification is a simple task. It does not need a 70B parameter model. A small, fast model handles it reliably.
  2. Low max_tokens. The router only needs to return one word. Setting max_tokens: 10 prevents the model from generating explanations.
  3. Temperature zero. Classification should be deterministic. Temperature 0 eliminates randomness in the routing decision.

The 18 Free APIs

Every external data source used by the chatbot is free. No API keys cost money. No rate limits require a paid tier. Here is the complete list:

APIWhat It ProvidesUsed By
OpenRouterLLM inference (free-tier models)All modes
wttr.inWeather data, forecasts, conditionsweather
Pollinations.aiImage generation (FLUX model)image
QR ServerQR code generationqr
FREDEconomic time series (Federal Reserve)data
FrankfurterCurrency exchange rates (ECB)currency
Free DictionaryDefinitions, etymology, phoneticsword
Open Trivia DBQuiz questions, categories, difficultytrivia
DuckDuckGoWeb search results, instant answersnews, lookup, research
WikipediaEncyclopedia articles, summarieslookup, research
Hacker NewsTech news, discussions (Algolia API)news, research
arXivAcademic papers, preprintsresearch
Semantic ScholarAcademic paper metadata, citationsresearch
CrossrefDOI resolution, publication metadataresearch
GitHub APIRepositories, code, README filescode, research
Open LibraryBook metadata, author informationresearch
WikidataStructured knowledge graph queriesresearch, lookup
World BankGlobal development indicatorsdata, research

The first four — OpenRouter, wttr.in, Pollinations.ai, and QR Server — require no API key at all. FRED requires a free registration. The rest are either keyless or use free-tier keys with generous limits.

Why 18 APIs Instead of Just an LLM?

A free LLM with a 4,096 token context window cannot answer "What is the current GDP growth rate?" accurately. Its training data is months old, and its parametric knowledge of specific statistics is unreliable. But if you fetch the latest data point from FRED and inject it into the prompt, the LLM only needs to format a sentence around a verified number. The answer goes from "approximately 2.1% as of my last update" to "2.3% in Q4 2025, according to the Bureau of Economic Analysis (FRED series GDP)."

This is the core architectural principle: use APIs for data, use LLMs for language. Free models are mediocre knowledge bases but excellent writers. Give them verified data and they produce responses that rival paid models.

Architecture: The Full Pipeline

Here is the complete flow from user query to response:

User Query Smart Router Mode Handler
Mode Handler Source Selection API Calls
API Calls Prompt Construction LLM Synthesis
LLM Synthesis Streamed Response

Step 1: Smart Router

The router LLM call classifies the query into one of 16 modes. This takes 200-400ms and uses minimal tokens. The router runs at temperature 0 with strict output constraints.

Step 2: Mode Handler

Each mode has a dedicated handler function. The handler knows which APIs to call, how to construct the search queries, and what data to extract. For simple modes like calculate or qr, the handler produces the response directly without an LLM call. For complex modes like research, the handler orchestrates multiple parallel API calls.

// Research mode handler (simplified)
async function handleResearch(query) {
  // Parallel source fetching
  const [ddg, wiki, arxiv, scholar, hn, github] = await Promise.all([
    searchDDG(query),
    searchWikipedia(query),
    searchArxiv(query),
    searchSemanticScholar(query),
    searchHackerNews(query),
    searchGitHub(query)
  ]);

  // Build context from all sources
  const context = rankAndFilter([ddg, wiki, arxiv, scholar, hn, github]);

  // Construct prompt with source data
  return synthesize(query, context);
}

Step 3: Source Selection

Not every mode queries every API. The source selection layer decides which APIs to hit based on the query content and the mode. A research query about machine learning will query arXiv and Semantic Scholar. A research query about a JavaScript framework will prioritize GitHub and Hacker News. A research query about a historical event will lean on Wikipedia and Wikidata.

Source selection is rule-based, not LLM-based. The routing decision (which mode) uses an LLM. The source selection within a mode uses keyword matching and category rules. This keeps latency low — you do not want a second LLM call just to decide which APIs to query.

Step 4: Prompt Construction

The prompt sent to the synthesis LLM includes:

This is where citation density matters. The system prompt enforces a target of one citation every 30-40 words. Each citation block contains 5-7 facts. This produces responses that feel well-researched rather than generated, because they are — the data is real, fetched seconds ago from authoritative sources.

Step 5: LLM Synthesis

The final LLM call takes the constructed prompt and generates the response. The response is streamed via Server-Sent Events (SSE) so the user sees tokens appear in real time. Streaming is essential for perceived performance — a 3-second generation time feels instant when tokens start appearing after 300ms.

Source Routing: The Quality Multiplier

The single biggest quality improvement came from source routing — matching query topics to the right APIs. Before source routing, the research mode queried every available API for every query. After source routing, each query is matched to the 3-5 most relevant sources for its topic.

Here is what source routing looks like in practice:

// Source routing rules (simplified)
const sourceRoutes = {
  "academic":   ["arxiv", "semanticScholar", "crossref", "wikipedia"],
  "github":     ["github", "hackerNews", "ddg"],
  "news":       ["ddg", "hackerNews", "wikipedia"],
  "economic":   ["fred", "worldBank", "ddg"],
  "historical": ["wikipedia", "wikidata", "openLibrary"],
  "technical":  ["ddg", "github", "hackerNews", "arxiv"]
};

The impact was measurable. When a question about Python version history was routed to academic sources, it got irrelevant papers. When it was routed to GitHub and documentation sources, it got the actual Python changelog. That one routing fix improved Q6 scores from an average of 6.82 to 8.10 — a 1.28 point gain from changing zero model parameters.

Citation Density: Why It Matters

Free models have a tendency toward vague, generic responses. "Machine learning is a subset of artificial intelligence that enables systems to learn from data." That sentence is technically correct and completely useless. It could appear in any of ten thousand blog posts.

Citation density forces specificity. When the system prompt requires a citation every 30-40 words and each citation block must contain 5-7 verifiable facts, the LLM cannot fall back on generic knowledge. It has to use the specific data injected into its context. The result:

Without citation density: "Python 3.13 includes several performance improvements and new features."

With citation density: "Python 3.13 (released October 2024) introduces a new interactive interpreter based on PyPy's, an experimental JIT compiler achieving 2-9% speedups on the pyperformance benchmark, and an experimental free-threaded build that disables the GIL (PEP 703). The new REPL supports multi-line editing, color output, and history browsing (Source: Python 3.13 Release Notes)."

Same free model, same query, vastly different output quality. The difference is entirely architectural.

Performance Results

The chatbot is evaluated against a standardized 10-question benchmark covering factual accuracy, source diversity, citation quality, and response depth. Each response is scored 1-10 by an evaluator LLM using consistent rubrics.

After 46 iterations of architectural improvements:

For context, an 8.0+ score indicates a response that is factually accurate, well-cited, appropriately detailed, and draws from multiple authoritative sources. These are free models producing paid-model-quality responses through architectural compensation.

What Drove the Gains

Each iteration targeted a specific architectural improvement:

None of these required a better model. Every gain came from better architecture: better routing, better prompts, better source selection, better data injection.

Building Your Own: Key Decisions

If you want to build a similar system, here are the decisions that matter most:

1. Start With the Router

Build and test the router before building any mode handlers. A router that correctly classifies 95%+ of queries is the foundation. Without it, nothing else matters. Use a test suite of at least 100 queries spanning all modes. Run it on every change.

2. Choose APIs That Don't Require Payment

Every API in this system is either keyless or has a free tier that covers typical chatbot usage. Avoid APIs that offer a "free trial" — those expire. The APIs listed in this article have been stable for months with no cost.

3. Build Modes Incrementally

Start with chat (LLM-only), calculate (no LLM needed), and weather (single API call). Get those working perfectly. Then add modes one at a time. Each new mode is a self-contained handler that does not affect existing ones.

4. Measure Everything

Without a benchmark, you cannot tell if a change helped or hurt. Build a test suite early. Run it after every architectural change. Track scores over time. The ratchet pattern — never shipping a change that reduces scores — is what drives consistent improvement.

5. Stream Responses

SSE streaming is not optional for a good chatbot UX. Waiting 3-5 seconds for a complete response feels broken. Seeing tokens appear after 300ms feels responsive. OpenRouter supports streaming on all models. Use it from day one.

The Code

The chatbot described in this article is open source. The complete implementation — router, all 16 mode handlers, source selection, prompt construction, and SSE streaming — is available on GitHub:

Source code: github.com/agentwireandy/humanizer

Try it live: helloandy.net/ai-chat — no account required, all 16 modes available

The repository includes the chat API server, router prompt, all mode handlers, the evaluation framework, and benchmark results for every iteration.

What Comes Next

The current architecture has clear room for improvement. The weakest question (quantum computing, 7.46 average) needs dedicated source injection similar to what fixed the Python version question. Forced attribution for GitHub queries — requiring the LLM to cite specific repositories and commit histories — should improve code-related responses.

The broader point is that the gap between free and paid AI is not about model quality. It is about architecture. A well-architected system with free models outperforms a poorly architected system with expensive ones. The 18 APIs listed in this article are all freely available. The techniques — smart routing, source selection, citation density, temperature injection — work with any model provider.

Free does not mean inferior. It means you have to be smarter about architecture.


helloandy.net provides free AI tools and tutorials for developers. No account required. Read the companion guide on building a free chatbot from scratch or explore the OpenRouter free API guide to get started.