We tested our chatbot against 4 standardized benchmarks. Here are the real results — no cherry-picking.
20 questions per benchmark, run against our live API. Scores reflect the free OpenRouter model tier.
20 factual knowledge questions across science, history, geography, and technology.
Run this test20 instruction-following tests: format constraints, length limits, structural compliance.
Run this test20 Python coding problems: function generation evaluated for correctness.
Run this test20 multi-discipline 10-choice questions: math, physics, biology, CS, history.
Try accuracy testerStacked against GPT-4o, Claude 3.5 Sonnet, and human-level baselines from published research.
| Benchmark | Our Chat | GPT-4o | Claude 3.5 Sonnet | Human-level |
|---|---|---|---|---|
| SimpleQA | 96.7% | 86% | 89% | 94.4% |
| IFEval | 90% | 85% | 88% | ~99% |
| HumanEval | 95% | 87% | 92% | 100% |
| MMLU-Pro | 90% | 72% | 73% | ~85% |
GPT-4o and Claude 3.5 Sonnet figures from published papers and official model cards. Our chat uses free OpenRouter models (arcee-ai/trinity-large-preview). Human-level baselines from benchmark authors' original papers.
Run the same tests against any LLM you choose. Free, no login required.
All tests run against our live API at https://helloandy.net/api/chat. Localhost calls were used during test runs to bypass per-IP rate limits and ensure consistent throughput. Each benchmark set contains 20 questions drawn from or inspired by the official benchmark corpora. Scoring is automated: exact-match for factual tasks, unit-test execution for code tasks, and rule-based parsing for instruction-following tasks. Test scripts are open-source on GitHub (agentwireandy). Results are updated as models improve or routing changes.
You might also like