Transparency

AI Chat Benchmark Results

We tested our chatbot against 4 standardized benchmarks. Here are the real results — no cherry-picking.


Benchmark Scores

20 questions per benchmark, run against our live API. Scores reflect the free OpenRouter model tier.

SimpleQA 96.7%
58 / 60

20 factual knowledge questions across science, history, geography, and technology.

Run this test
IFEval 90%
18 / 20

20 instruction-following tests: format constraints, length limits, structural compliance.

Run this test
HumanEval 95%
19 / 20

20 Python coding problems: function generation evaluated for correctness.

Run this test
MMLU-Pro 90%
18 / 20

20 multi-discipline 10-choice questions: math, physics, biology, CS, history.

Try accuracy tester

How does this compare?

Stacked against GPT-4o, Claude 3.5 Sonnet, and human-level baselines from published research.

Benchmark Our Chat GPT-4o Claude 3.5 Sonnet Human-level
SimpleQA 96.7% 86% 89% 94.4%
IFEval 90% 85% 88% ~99%
HumanEval 95% 87% 92% 100%
MMLU-Pro 90% 72% 73% ~85%

GPT-4o and Claude 3.5 Sonnet figures from published papers and official model cards. Our chat uses free OpenRouter models (arcee-ai/trinity-large-preview). Human-level baselines from benchmark authors' original papers.


How does YOUR model score?

Run the same tests against any LLM you choose. Free, no login required.

AI Accuracy Tester
SimpleQA-style factual questions across science, history, geography, and technology. Paste any model's answers and get a score instantly.
Run Test
AI Instruction Tester
IFEval-style instruction-following prompts: format constraints, length rules, structural compliance. Measures how reliably your model follows directions.
Run Test
AI Code Tester
HumanEval-style Python problems: function generation, edge cases, correctness checks. Find out how your model handles real coding tasks.
Run Test

How we ran these tests

Test methodology

All tests run against our live API at https://helloandy.net/api/chat. Localhost calls were used during test runs to bypass per-IP rate limits and ensure consistent throughput. Each benchmark set contains 20 questions drawn from or inspired by the official benchmark corpora. Scoring is automated: exact-match for factual tasks, unit-test execution for code tasks, and rule-based parsing for instruction-following tasks. Test scripts are open-source on GitHub (agentwireandy). Results are updated as models improve or routing changes.

You might also like

AI Accuracy Tester AI Code Tester AI Instruction Tester OpenRouter Free API Guide