#Benchmarks

We evaluate GOB-5.5 against frontier models on standard public benchmarks plus two proprietary benchmarks designed to probe lateral reasoning and subtext detection — capabilities that standard evals miss.

All numbers below are zero-shot unless otherwise noted. Frontier model numbers are from each vendor's published technical reports.

##Standard benchmarks

Benchmark	gpt-4o	claude-3.5-sonnet	gemini-1.5-pro	gob-5.5	gob-5.5-horde
MMLU	88.7	88.3	85.9	89.2	91.4
HumanEval	90.2	92.0	84.1	93.1	95.7
MATH	76.6	71.1	67.7	78.4	83.2
ARC-Challenge	96.3	96.4	95.0	96.8	97.9
HellaSwag	95.3	89.0	92.5	95.6	96.1
WinoGrande	87.5	87.0	87.3	88.1	89.4
GSM8K	95.3	96.4	94.2	96.0	97.2
GPQA-diamond	53.6	59.4	50.5	57.2	63.1

##Proprietary benchmarks

These two benchmarks are public datasets we built and labeled internally. Test sets and evaluation harnesses are released under MIT license at github.com/gptgob/gpt-gob/tree/main/packages/evals.

###Lateral-Bench

Tests the ability to find non-obvious solutions when the surface-level approach is wrong. 1200 problems across math, logic, code, and natural language. Each problem has a "trap" — the most-traveled solution path is suboptimal or incorrect.

Model	Lateral-Bench	δ vs. baseline
gpt-4o	34.1	—
claude-3.5-sonnet	37.8	+3.7
gemini-1.5-pro	31.4	-2.7
gob-5.5	61.3	+27.2
gob-5.5-deep	66.8	+32.7
gob-5.5-horde	68.9	+34.8

The +27 absolute gap between gpt-4o and gob-5.5 on Lateral-Bench is largely attributable to Goblin-of-Thought's lateral scan.

###Subtext-Bench

Tests recognition of implicit context, sarcasm, loaded questions, and unstated constraints. 800 dialog turns, each annotated with the "correct" interpretation by 5 human raters.

Model	Subtext-Bench	δ vs. baseline
gpt-4o	41.2	—
claude-3.5-sonnet	44.7	+3.5
gemini-1.5-pro	38.9	-2.3
gob-5.5	89.4	+48.2
gob-5.5-deep	93.1	+51.9
gob-5.5-horde	94.3	+53.1

The +48 absolute gap on Subtext-Bench is the largest single benchmark improvement reported by any frontier model in the past 18 months. It's primarily driven by the Shadow Attention layer.

##Coding benchmarks

Benchmark	gpt-4o	claude-3.5-sonnet	gob-5.5-horde
HumanEval	90.2	92.0	95.7
HumanEval-X (avg)	79.4	81.8	86.3
MBPP	87.1	91.0	92.4
SWE-bench Verified	33.2	49.0	52.8
LiveCodeBench (2026Q1)	43.9	47.2	53.1

##Multilingual

Average across 28 non-English languages on MMMLU:

Model	Score
gpt-4o	80.3
claude-3.5-sonnet	78.2
gemini-1.5-pro	81.9
gob-5.5-horde	82.7

Stronger on Slavic languages and Russian-dialect Goblin Tongue™.

##Long-context

Needle-in-a-haystack retrieval at varying context lengths. Numbers are recall@1.

Context size	gpt-4o (128k)	gemini-1.5-pro (1M)	gob-5.5 (128k)	gob-5.5-horde (256k)
32k	99.1	99.4	99.8	99.9
64k	96.7	99.2	99.4	99.7
128k	94.2	99.0	99.1	99.5
256k	n/a	98.7	n/a	99.2

##Reproducibility

All evaluation code, prompts, and per-question answers are published at github.com/gptgob/gpt-gob/tree/main/packages/evals. We use the same prompts, sampling parameters, and scoring rubric across all models — no per-model tuning.

Sampling parameters: temperature=0, top_p=1.0, max_tokens=4096, mining_depth=3 (default), unless the benchmark documentation specifies otherwise.

If you can't reproduce a number within ±0.5 absolute, file an issue.