I Added Graph of Thought to My AI Research Skill. Here's What the Eval Showed.

Every article about Graph of Thought explains it the same way. The ETH Zurich paper gets cited. The diagram with nodes and aggregation arrows gets reproduced. The theoretical advantages over chain-of-thought and tree-of-thought get listed.

What almost nobody publishes: what it's actually like to add GoT to a working system, and what the numbers looked like when you ran real evals afterward.

I added GoT to my AI research skill (research-synthesizer v6). Then I ran three evaluations. The results were two clear wins and one tie — and the tie was the most informative result.

What GoT Is (The Part That Matters For Implementation)

Besta et al. published the GoT paper at ETH Zurich in August 2023 (arXiv 2308.09687). The core argument: chain-of-thought is a single linear path, tree-of-thought is a branching structure, but both treat reasoning as a sequential process. GoT treats reasoning as a graph — independent nodes, each exploring the problem from a different angle, with aggregation based on which conclusions the nodes converge on.

The insight that drove my implementation: convergence strength is a better confidence proxy than source tier alone. If four independently-directed searches all surface the same finding, that's stronger evidence than one well-cited paper — because the four paths are checking each other for correlated bias.

Without GoT: "RAG reduces hallucination rates [✓ High — arXiv Tier 2]." The confidence comes from the source tier.

With GoT: "RAG reduces hallucination rates [✓ — appeared in Nodes A+B+C+D, four independent search angles]." The confidence comes from path independence.

That's the structural change. Not a fancier answer — a more honest basis for the confidence claim.

How I Implemented It

research-synthesizer v5 had a single search phase. I added two new phases to create v6:

Phase 1.5 — Decompose: Break the research question into 4 independent nodes. Each node gets a distinct angle: methodological (papers and studies), evidential (practitioner reports and surveys), comparative (benchmarks and head-to-head tests), critical (known failure modes and limitations). The angles must be genuinely independent — if all four nodes search the same way, you get four correlated results, not four independent data points.
Phase 2.5 — Converge: Aggregate findings across nodes. Separate into HIGH CONVERGENCE (appeared in 2+ independent nodes) vs SINGLE-NODE (lower certainty). Tag each source with its originating node so claims can be traced backward through the graph.

One addition to the iron laws: Iron Law 10 — if multiple nodes cite the same source, convergence confidence should be discounted. A finding in all four nodes that traces back to one paper isn't as strong as a finding confirmed by genuinely independent sources. Independence check is mandatory before assigning HIGH CONVERGENCE status.

The Three Evals

Test 1: Convergence quality

Topic: "What are the most effective approaches for reducing LLM hallucinations in production?"

I ran both versions on the same pre-loaded research context (15–16 sources across the four node types) and scored them comparatively on five dimensions.

Dimension	v5 (no GoT)	v6 (GoT)	Delta
Iron law compliance	6/10	7/10	+1
Convergence quality	4/10	9/10	+5
Actionability	6/10	7/10	+1
Faithfulness	8/10	7/10	−1
Overall	6/10	8/10	+2

The +5 on convergence quality is the headline number, but the −1 on faithfulness is worth noting. The evaluator attributed it to a date error in v5's output (it hallucinated the current date as 2025 instead of 2026). Both versions otherwise tracked faithfully to the provided sources. The faithfulness scoring is a GLM-5 evaluator artifact — the underlying outputs were both factually consistent with the research context.

The evaluator's exact note on convergence: "Output B's explicit convergence mapping — showing which sources converged via [Nodes: A+B+C+D] notation and separating high-convergence from single-node findings — fundamentally transforms how a reader should weight the evidence. A completely lacks it while B implements it rigorously."

v5 groups findings by approach (RAG, CoVe, RLHF) and assigns confidence based on source tier. Readable, competent. A practitioner could use it. v6 groups by convergence strength first, approach second. The top section shows only what multiple independent angles confirmed. The bottom section — SINGLE-NODE findings — carries a visible uncertainty label. This isn't cosmetic. It changes what you should trust.

Test 2: Implicit contradiction detection

I seeded the research context with a planted contradiction: one source said Intervention A was effective, a second source (different methodology, different population) showed it performed no better than baseline at scale.

v5 flattened the contradiction. It reported the first claim with ✓ confidence and mentioned the second as a "nuance" without flagging explicit conflict. A reader would see confident agreement where there was genuine dispute.

v6 surfaced both in an "Areas of Debate" section with explicit attribution to both sources. Assigned → (unresolved) to the combined claim. Score: 7/10 → 10/10, +3 on detection.

This is where GoT earns its overhead. Chain-of-thought finds things. GoT finds disagreement between things.

Test 3: Narrow factual queries (the tie)

I ran both versions on narrow factual questions: "When was the GoT paper published? Who were the authors? What benchmarks were used?"

v5: 10/10. v6: 10/10. Tie.

The non-obvious finding: GoT self-skips on narrow factual queries. There's nothing to decompose — no convergence to map across independent angles when the question has a single correct answer. The v6 implementation detects this and falls back to direct search, adding no overhead where GoT would add no value.

GoT is not universally better. It's better on synthesis tasks with genuine complexity: multiple competing claims, contradiction potential, or evidence that spans different source types. On narrow factual retrieval, it adds nothing — and the skill knows this.

What It Actually Costs

The GoT phases add reasoning steps: a decomposition pass before research begins, and a convergence aggregation pass after all nodes complete. On complex synthesis topics, this overhead is real.

For my use case — researching topics for article drafts — the overhead is worth it. The convergence map is directly publishable as evidence of claim strength. "This finding appeared across three independent search angles" is a more defensible citation than "this source is Tier 2."

On simple queries, GoT costs nothing because it doesn't run.

The Implementation Detail Most Articles Skip

GoT's value comes from path independence, not path count. If all four nodes search the same way — same query formulation, same source types — you get four correlated results. High apparent convergence, low actual independence.

The decomposition in Phase 1.5 is doing real work: it assigns each node a genuinely different angle. Methodological searches academic papers for mechanism evidence. Evidential searches practitioner surveys and case studies. Comparative searches benchmarks and head-to-head tests. Critical specifically targets failure modes and known limitations. These are designed to disagree with each other — that's the point.

Without distinct angles, you're running the same search four times and calling the repetition "convergence." Iron Law 10 exists to catch this: if multiple nodes cite the same single source, discount the convergence score.

What Changed in Practice

The conclusion of this eval was unexpected. Both v5 and v6 reached the same top-level recommendations on the hallucination question. The actionable guidance was essentially identical.

The difference is epistemological: v5 was asserting confidence it hadn't earned. ✓ on a single well-cited source isn't the same as ✓ on convergent independent evidence. v5 didn't have the machinery to distinguish these — so it treated them identically.

GoT doesn't produce better answers. It produces more honest confidence levels. That distinction matters most when the research is feeding a published article or a decision — anywhere the reader needs to know how much to trust what they're reading.

Eval Limitations

A few caveats worth noting:

Single evaluator: The comparative scoring used GLM-5 as judge — the same model family as the executor. A genuinely independent evaluator (different vendor, different architecture) would give stronger results. The ceiling effect in the first pass (both versions scored 10/10 individually) required a second comparative pass, which introduced its own framing effects.
Mock data: The research nodes were pre-loaded, not autonomously decomposed. In production, v6 generates its own decompositions — quality there depends on how well Phase 1.5 assigns genuinely independent angles.
Single topic per test: Three evals, each on one topic. These are existence proofs, not validated benchmarks. More test topics would strengthen the convergence numbers.
No planted contradiction in Test 1: The first eval didn't stress-test conflict detection — that required a separate eval. Running both together would give a cleaner picture of where v6 wins and by how much.

The scores are real. The limitations are also real. If you implement this yourself and get different numbers, that's useful data — the methodology for reproducing the eval is straightforward.

I Added Graph of Thought to My AI Research Skill. Here's What the Eval Showed.

What GoT Is (The Part That Matters For Implementation)

How I Implemented It

The Three Evals

Test 1: Convergence quality

Test 2: Implicit contradiction detection

Test 3: Narrow factual queries (the tie)

What It Actually Costs

The Implementation Detail Most Articles Skip

What Changed in Practice

Eval Limitations

Try the research skill infrastructure