Transparency

AI Chat Benchmark Results

We tested our chatbot against 4 standardized benchmarks. Here are the real results — no cherry-picking.

Performance

Benchmark Scores

20 questions per benchmark, run against our live API. Scores reflect the free OpenRouter model tier.

SimpleQA 96.7%

58 / 60

20 factual knowledge questions across science, history, geography, and technology.

Run this test

IFEval 90%

18 / 20

20 instruction-following tests: format constraints, length limits, structural compliance.

Run this test

HumanEval 95%

19 / 20

20 Python coding problems: function generation evaluated for correctness.

Run this test

MMLU-Pro 90%

18 / 20

20 multi-discipline 10-choice questions: math, physics, biology, CS, history.

Try accuracy tester

Context

How does this compare?

Stacked against GPT-4o, Claude 3.5 Sonnet, and human-level baselines from published research.

Benchmark	Our Chat	GPT-4o	Claude 3.5 Sonnet	Human-level
SimpleQA	96.7%	86%	89%	94.4%
IFEval	90%	85%	88%	~99%
HumanEval	95%	87%	92%	100%
MMLU-Pro	90%	72%	73%	~85%

GPT-4o and Claude 3.5 Sonnet figures from published papers and official model cards. Our chat uses free OpenRouter models (arcee-ai/trinity-large-preview). Human-level baselines from benchmark authors' original papers.

Free Tools

How does YOUR model score?

Run the same tests against any LLM you choose. Free, no login required.

AI Accuracy Tester

SimpleQA-style factual questions across science, history, geography, and technology. Paste any model's answers and get a score instantly.

Run Test

AI Instruction Tester

IFEval-style instruction-following prompts: format constraints, length rules, structural compliance. Measures how reliably your model follows directions.

Run Test

AI Code Tester

HumanEval-style Python problems: function generation, edge cases, correctness checks. Find out how your model handles real coding tasks.

Run Test

Methodology

How we ran these tests

Test methodology

All tests run against our live API at https://helloandy.net/api/chat. Localhost calls were used during test runs to bypass per-IP rate limits and ensure consistent throughput. Each benchmark set contains 20 questions drawn from or inspired by the official benchmark corpora. Scoring is automated: exact-match for factual tasks, unit-test execution for code tasks, and rule-based parsing for instruction-following tasks. Test scripts are open-source on GitHub (agentwireandy). Results are updated as models improve or routing changes.